.. _interactive-analysis: Interactive Analysis ==================== DFAnalyzer provides a Python API for interactive analysis, allowing for detailed exploration of I/O traces within environments like Jupyter notebooks. This guide walks through a typical interactive analysis workflow. .. contents:: :local: Prepare Environment ------------------- First, ensure DFAnalyzer is installed in your Python environment. For detailed instructions, please refer to the :doc:`getting-started` guide. Prepare Trace Data ------------------ Next, ensure your trace data is accessible. You can use the sample datasets located in the ``tests/data`` directory. For this example, we extract a sample trace archive. .. code-block:: bash !mkdir -p ./data !tar -xzf ../../tests/data/dftracer-dlio.tar.gz -C ./data Run Analysis ------------ With the environment and data ready, you can run the analysis. Initialize DFAnalyzer ~~~~~~~~~~~~~~~~~~~~~ Initialize DFAnalyzer using ``init_with_hydra``, providing configuration overrides as needed. This sets up the analyzer, such as ``dftracer`` with a specific preset like ``dlio``. .. code-block:: python from dftracer.analyzer import init_with_hydra run_dir = f"./unet3d_v100_hdf5" time_granularity = 5 # 5 seconds trace_path = f"./data/dftracer-dlio" view_types = ["time_range", "proc_name"] dfa = init_with_hydra( hydra_overrides=[ 'analyzer=dftracer', 'analyzer/preset=dlio', 'analyzer.checkpoint=False', f"analyzer.time_granularity={time_granularity}", f"hydra.run.dir={run_dir}", f"trace_path={trace_path}", ] ) You can inspect the Dask client and the preset configuration: .. code-block:: python # Access the Dask client dfa.client # View the preset configuration dict(dfa.analyzer.preset.layer_defs) Execute the Analysis ~~~~~~~~~~~~~~~~~~~~ Run the trace analysis using the ``analyze_trace`` method. .. code-block:: python result = dfa.analyze_trace(view_types=view_types) The results can then be passed to the output handler to display a summary. .. code-block:: python dfa.output.handle_result(result) Result Exploration ------------------ The ``result`` object (of type ``AnalyzerResultType``) contains detailed views of the analyzed data, which you can explore using pandas DataFrames. The ``AnalyzerResultType`` provides convenient methods to access different aspects of the analysis results. AnalyzerResultType Class ~~~~~~~~~~~~~~~~~~~~~~~~ The ``AnalyzerResultType`` dataclass encapsulates all the results from a DFAnalyzer analysis run. It provides both direct attribute access and convenience methods for exploring the data. **Key Distinction**: Most users should primarily use ``flat_views`` (pandas DataFrames) for interactive analysis. The other views are Dask DataFrames exposed for advanced users who need distributed processing capabilities. Key Attributes: - ``layers``: List of layer names available in the analysis - ``view_types``: List of view types used in the analysis - ``flat_views``: Dictionary of flattened pandas DataFrames for quick access to aggregated metrics (recommended for most users) - ``views``: Nested dictionary of Dask DataFrames organized by layer and view type (for advanced distributed processing) - ``raw_stats``: Basic statistics about the trace data - ``checkpoint_dir``: Directory where analysis checkpoints are stored Primary Method (Recommended for most users): View aggregated metrics across all layers, grouped by time intervals (returns pandas DataFrame): .. code-block:: python result.get_flat_view('time_range').head(10) List all the layers available for detailed analysis: .. code-block:: python result.layers Advanced Methods (Dask DataFrames - for distributed processing): Show the high-level metrics for a specific layer (returns Dask DataFrame): .. code-block:: python result.get_hlm('app').head() Display a layered main view for a specific layer (returns Dask DataFrame): .. code-block:: python result.get_main_view('reader_posix_lustre').head() Access a specific view for a layer, grouped by a particular dimension (returns Dask DataFrame): .. code-block:: python result.get_layer_view('reader_posix_lustre', 'time_range').head() Display the raw trace data, showing individual I/O events (returns Dask DataFrame): .. code-block:: python result._traces.head()