Tools
This page provides an overview of the supplementary tools distributed with DFAnalyzer.
dfanalyzer-recorder2parquet
The dfanalyzer-recorder2parquet tool is a command-line utility designed to
convert I/O trace files generated by the Recorder tracing tool into the Apache
Parquet format. This conversion is beneficial for efficient storage and
subsequent analysis, as Parquet is a columnar storage format optimized for
analytical workloads.
Functionality
Input: Takes raw trace files generated by the Recorder tool. These files typically contain detailed records of I/O operations performed by an application.
Processing: - Parses individual trace records, extracting information such as function
calls (e.g.,
open,read,write, POSIX I/O, MPI I/O calls), timestamps, file identifiers, process/rank information, and data transfer sizes.Categorizes I/O operations (e.g., read, write, metadata).
Extracts metadata from the input trace file paths, such as hostname, application name, and process ID.
Output: Generates Parquet files containing the structured I/O trace data. The schema of the Parquet files includes the following fields:
Field Name
Data Type
Description
indexInt64
Record index
levelInt32
Call stack level (if available)
tstartFloat32
Start timestamp
tmidInt64
Timestamp midpoint
tendFloat32
End timestamp
durationFloat32
Duration of the operation
hostnameUTF8 String
Hostname where the operation occurred
appUTF8 String
Application name
rankInt32
MPI rank
proc_nameUTF8 String
Process name
proc_idInt64
Unique process identifier
thread_idInt32
Thread identifier
catInt32
Operation category
io_catInt32
I/O category (Read, Write, Metadata)
func_idUTF8 String
Function name/identifier
acc_patInt32
Access pattern (e.g., sequential, random)
file_idInt64
Unique file identifier
file_nameUTF8 String
Name of the file involved in the operation
sizeInt64
Size of the I/O operation (bytes)
bandwidthFloat32
Calculated bandwidth for the operation
Usage
The dfanalyzer-recorder2parquet tool is typically built as part of the
DFAnalyzer project, specifically within the recorder subproject. Its direct
usage involves invoking the compiled executable with the path to the Recorder
trace files.
mpirun -n 8 dfanalyzer-recorder2parquet <input_recorder_trace_directory>
The tool processes the traces from the specified
<input_recorder_trace_directory>. It outputs one or more .parquet files
into a subdirectory named _parquet, which is automatically created within
the <input_recorder_trace_directory>. These resulting Parquet files can then
be used as input for the DFAnalyzer recorder analyzer.