Tools

This page provides an overview of the supplementary tools distributed with DFAnalyzer.

dfanalyzer-recorder2parquet

The dfanalyzer-recorder2parquet tool is a command-line utility designed to convert I/O trace files generated by the Recorder tracing tool into the Apache Parquet format. This conversion is beneficial for efficient storage and subsequent analysis, as Parquet is a columnar storage format optimized for analytical workloads.

Functionality

  • Input: Takes raw trace files generated by the Recorder tool. These files typically contain detailed records of I/O operations performed by an application.

  • Processing: - Parses individual trace records, extracting information such as function

    calls (e.g., open, read, write, POSIX I/O, MPI I/O calls), timestamps, file identifiers, process/rank information, and data transfer sizes.

    • Categorizes I/O operations (e.g., read, write, metadata).

    • Extracts metadata from the input trace file paths, such as hostname, application name, and process ID.

  • Output: Generates Parquet files containing the structured I/O trace data. The schema of the Parquet files includes the following fields:

    Field Name

    Data Type

    Description

    index

    Int64

    Record index

    level

    Int32

    Call stack level (if available)

    tstart

    Float32

    Start timestamp

    tmid

    Int64

    Timestamp midpoint

    tend

    Float32

    End timestamp

    duration

    Float32

    Duration of the operation

    hostname

    UTF8 String

    Hostname where the operation occurred

    app

    UTF8 String

    Application name

    rank

    Int32

    MPI rank

    proc_name

    UTF8 String

    Process name

    proc_id

    Int64

    Unique process identifier

    thread_id

    Int32

    Thread identifier

    cat

    Int32

    Operation category

    io_cat

    Int32

    I/O category (Read, Write, Metadata)

    func_id

    UTF8 String

    Function name/identifier

    acc_pat

    Int32

    Access pattern (e.g., sequential, random)

    file_id

    Int64

    Unique file identifier

    file_name

    UTF8 String

    Name of the file involved in the operation

    size

    Int64

    Size of the I/O operation (bytes)

    bandwidth

    Float32

    Calculated bandwidth for the operation

Usage

The dfanalyzer-recorder2parquet tool is typically built as part of the DFAnalyzer project, specifically within the recorder subproject. Its direct usage involves invoking the compiled executable with the path to the Recorder trace files.

mpirun -n 8 dfanalyzer-recorder2parquet <input_recorder_trace_directory>

The tool processes the traces from the specified <input_recorder_trace_directory>. It outputs one or more .parquet files into a subdirectory named _parquet, which is automatically created within the <input_recorder_trace_directory>. These resulting Parquet files can then be used as input for the DFAnalyzer recorder analyzer.