Utilities¶
dftracer-utils provides a collection of composable utilities for trace file processing. These utilities can be used standalone or combined into pipelines.
Available Utilities:
Overview¶
Utilities follow a consistent pattern:
Input types: Configuration structs with fluent builder API
Output types: Result structs with success status and data
process() method: Main entry point that transforms input to output
Tags: Metadata like
Parallelizablefor thread-safe utilities
graph TB
subgraph Base["Utility Pattern"]
Utility["Utility<I, O, Tags...><br/>process(I) → CoroTask<O>"]
end
subgraph Categories["Utility Categories"]
FileIO["File I/O<br/>FileReader, StreamingReader"]
Compression["Compression<br/>Compressor, Decompressor"]
Text["Text<br/>LineSplitter, LineFilter"]
Hash["Hash<br/>FNV1a, Std, MurmurHash3"]
Indexer["Indexer<br/>Checkpoint, BloomFilter"]
Reader["Reader<br/>Stream, LineProcessor"]
Common["Common<br/>JSON, DDSketch, Statistic, Distributions, Mixture"]
Composites["Composites<br/>DFTracer-specific pipelines"]
Dlio["DLIO<br/>BarrierSimulator, TraceLoader, Optimizer, YAML emit"]
end
Utility --> FileIO
Utility --> Compression
Utility --> Text
Utility --> Hash
Utility --> Indexer
Utility --> Reader
Utility --> Common
Utility --> Composites
Utility --> Dlio
File I/O¶
The fileio utilities support both synchronous and asynchronous file operations:
Synchronous readers: Full in-memory or streaming chunk-based reading
Async generators: Non-blocking line/byte generators using
co_awaitand coroutinesPlain and indexed files: Support for both raw text files and compressed archives with sidecar indexes
Streaming decompression: On-the-fly decompression of .gz files without building indexes
See File I/O for detailed usage.
Statistics¶
Enhanced statistics collection and distribution fitting for trace analysis:
DDSketch: Deterministic, merge-order-independent percentile estimation with bounded relative error
Log2Histogram: Fixed 65-bin logarithmic histogram for duration and size distributions
Statistic: Min/max/mean/count accumulator that optionally delegates to an attached DDSketch for quantile queries
Distributions: MLE fitting + KS / BIC scoring for Normal, Lognormal, Gamma, Exponential, Weibull; sampler factory backed by
<random>and Boost.Math standaloneMixture: Univariate Gaussian Mixture EM (K=2, K=3) with log-sum-exp responsibilities and BIC-based selection across single + mixture models
Chunk statistics: Per-chunk event tracking with online variance calculation and per-name duration sketches
These are used in indexing and aggregation pipelines to compute event distributions and percentiles efficiently, and by the DLIO config generator to fit per-component timing distributions.
DLIO Config Generation¶
End-to-end pipeline that converts a directory of raw DFTracer logs into a DLIO training-loop YAML configuration:
trace_loader: pulls the
AGGREGATIONcolumn family (re-attaches the merge operator at open time) and synthesizes per-rank sample arrays from per-(pid, time_bucket) entries.BarrierSimulator: simulates one DLIO training run across the captured ranks/steps, scoring an end-to-end duration, rank variance, and
fetch.blockCDF similarity against the empirical trace.optimizer: sequential momentum loop refining the
max_boundpercentile on the fitted sampler to minimize simulator E2E error.yaml_emit: renders single distributions or Gaussian mixtures into the DLIO
train.computation_time/reader.preprocess_timeschema.
See DLIO Config Generation for the API and dftracer_gen_dlio_config in
Command-Line Tools for the user-facing binary.
Indexing¶
Advanced indexing utilities for fast trace queries:
Bloom filter cache: Thread-safe bounded cache for deserialized bloom filters with file-level and chunk-level keys
Chunk statistics: Per-chunk aggregates including event counts, timestamp ranges, and duration distributions
Predicate filtering: Efficient multi-dimensional filtering for view queries on dimensions like time range and duration bounds
Views and Predicates¶
Query views on DFTracer traces with multi-dimensional filtering:
PredicateFilter: Efficiently filters events by dimension sets, time ranges, and duration bounds
Supports multiple predicates: Match events against OR’d lists of predicates
See Utilities API for the full API reference.