Interactive Analysis

DFAnalyzer provides a Python API for interactive analysis, allowing for detailed exploration of I/O traces within environments like Jupyter notebooks. This guide walks through a typical interactive analysis workflow.

Prepare Environment

First, ensure DFAnalyzer is installed in your Python environment. For detailed instructions, please refer to the Getting Started guide.

Prepare Trace Data

Next, ensure your trace data is accessible. You can use the sample datasets located in the tests/data directory. For this example, we extract a sample trace archive.

!mkdir -p ./data
!tar -xzf ../../tests/data/dftracer-dlio.tar.gz -C ./data

Run Analysis

With the environment and data ready, you can run the analysis.

Initialize DFAnalyzer

Initialize DFAnalyzer using init_with_hydra, providing configuration overrides as needed. This sets up the analyzer, such as dftracer with a specific preset like dlio.

from dftracer.analyzer import init_with_hydra

run_dir = f"./unet3d_v100_hdf5"
time_granularity = 5  # 5 seconds
trace_path = f"./data/dftracer-dlio"
view_types = ["time_range", "proc_name"]

dfa = init_with_hydra(
    hydra_overrides=[
        'analyzer=dftracer',
        'analyzer/preset=dlio',
        'analyzer.checkpoint=False',
        f"analyzer.time_granularity={time_granularity}",
        f"hydra.run.dir={run_dir}",
        f"trace_path={trace_path}",
    ]
)

You can inspect the Dask client and the preset configuration:

# Access the Dask client
dfa.client

# View the preset configuration
dict(dfa.analyzer.preset.layer_defs)

Execute the Analysis

Run the trace analysis using the analyze_trace method.

result = dfa.analyze_trace(view_types=view_types)

The results can then be passed to the output handler to display a summary.

dfa.output.handle_result(result)

Result Exploration

The result object (of type AnalyzerResultType) contains detailed views of the analyzed data, which you can explore using pandas DataFrames. The AnalyzerResultType provides convenient methods to access different aspects of the analysis results.

AnalyzerResultType Class

The AnalyzerResultType dataclass encapsulates all the results from a DFAnalyzer analysis run. It provides both direct attribute access and convenience methods for exploring the data.

Key Distinction: Most users should primarily use flat_views (pandas DataFrames) for interactive analysis. The other views are Dask DataFrames exposed for advanced users who need distributed processing capabilities.

Key Attributes:

  • layers: List of layer names available in the analysis

  • view_types: List of view types used in the analysis

  • flat_views: Dictionary of flattened pandas DataFrames for quick access to aggregated metrics (recommended for most users)

  • views: Nested dictionary of Dask DataFrames organized by layer and view type (for advanced distributed processing)

  • raw_stats: Basic statistics about the trace data

  • checkpoint_dir: Directory where analysis checkpoints are stored

Primary Method (Recommended for most users):

View aggregated metrics across all layers, grouped by time intervals (returns pandas DataFrame):

result.get_flat_view('time_range').head(10)

List all the layers available for detailed analysis:

result.layers

Advanced Methods (Dask DataFrames - for distributed processing):

Show the high-level metrics for a specific layer (returns Dask DataFrame):

result.get_hlm('app').head()

Display a layered main view for a specific layer (returns Dask DataFrame):

result.get_main_view('reader_posix_lustre').head()

Access a specific view for a layer, grouped by a particular dimension (returns Dask DataFrame):

result.get_layer_view('reader_posix_lustre', 'time_range').head()

Display the raw trace data, showing individual I/O events (returns Dask DataFrame):

result._traces.head()