Utilities
=========
dftracer-utils provides a collection of composable utilities for trace file processing. These utilities can be used standalone or combined into pipelines.
.. toctree::
:maxdepth: 2
:caption: Available Utilities:
utilities/filesystem
utilities/fileio
utilities/compression
utilities/text
utilities/composites
utilities/replay
utilities/hash
utilities/indexer
utilities/reader
utilities/common
utilities/dlio
call-tree
Overview
--------
Utilities follow a consistent pattern:
- **Input types**: Configuration structs with fluent builder API
- **Output types**: Result structs with success status and data
- **process() method**: Main entry point that transforms input to output
- **Tags**: Metadata like ``Parallelizable`` for thread-safe utilities
.. mermaid::
graph TB
subgraph Base["Utility Pattern"]
Utility["Utility<I, O, Tags...>
process(I) → CoroTask<O>"]
end
subgraph Categories["Utility Categories"]
FileIO["File I/O
FileReader, StreamingReader"]
Compression["Compression
Compressor, Decompressor"]
Text["Text
LineSplitter, LineFilter"]
Hash["Hash
FNV1a, Std, MurmurHash3"]
Indexer["Indexer
Checkpoint, BloomFilter"]
Reader["Reader
Stream, LineProcessor"]
Common["Common
JSON, DDSketch, Statistic, Distributions, Mixture"]
Composites["Composites
DFTracer-specific pipelines"]
Dlio["DLIO
BarrierSimulator, TraceLoader, Optimizer, YAML emit"]
end
Utility --> FileIO
Utility --> Compression
Utility --> Text
Utility --> Hash
Utility --> Indexer
Utility --> Reader
Utility --> Common
Utility --> Composites
Utility --> Dlio
File I/O
--------
The ``fileio`` utilities support both synchronous and asynchronous file operations:
- **Synchronous readers**: Full in-memory or streaming chunk-based reading
- **Async generators**: Non-blocking line/byte generators using ``co_await`` and coroutines
- **Plain and indexed files**: Support for both raw text files and compressed archives with sidecar indexes
- **Streaming decompression**: On-the-fly decompression of .gz files without building indexes
See :doc:`/utilities/fileio` for detailed usage.
Statistics
----------
Enhanced statistics collection and distribution fitting for trace analysis:
- **DDSketch**: Deterministic, merge-order-independent percentile estimation with bounded relative error
- **Log2Histogram**: Fixed 65-bin logarithmic histogram for duration and size distributions
- **Statistic**: Min/max/mean/count accumulator that optionally delegates to an attached DDSketch for quantile queries
- **Distributions**: MLE fitting + KS / BIC scoring for Normal, Lognormal, Gamma, Exponential, Weibull; sampler factory backed by ```` and `Boost.Math standalone `_
- **Mixture**: Univariate Gaussian Mixture EM (K=2, K=3) with log-sum-exp responsibilities and BIC-based selection across single + mixture models
- **Chunk statistics**: Per-chunk event tracking with online variance calculation and per-name duration sketches
These are used in indexing and aggregation pipelines to compute event distributions and percentiles efficiently, and by the DLIO config generator to fit per-component timing distributions.
DLIO Config Generation
----------------------
End-to-end pipeline that converts a directory of raw DFTracer logs into a DLIO
training-loop YAML configuration:
- **trace_loader**: pulls the ``AGGREGATION`` column family (re-attaches the
merge operator at open time) and synthesizes per-rank sample arrays from
per-(pid, time_bucket) entries.
- **BarrierSimulator**: simulates one DLIO training run across the captured
ranks/steps, scoring an end-to-end duration, rank variance, and ``fetch.block``
CDF similarity against the empirical trace.
- **optimizer**: sequential momentum loop refining the ``max_bound`` percentile
on the fitted sampler to minimize simulator E2E error.
- **yaml_emit**: renders single distributions or Gaussian mixtures into the
DLIO ``train.computation_time`` / ``reader.preprocess_time`` schema.
See :doc:`/utilities/dlio` for the API and ``dftracer_gen_dlio_config`` in
:doc:`/cli` for the user-facing binary.
Indexing
--------
Advanced indexing utilities for fast trace queries:
- **Bloom filter cache**: Thread-safe bounded cache for deserialized bloom filters with file-level and chunk-level keys
- **Chunk statistics**: Per-chunk aggregates including event counts, timestamp ranges, and duration distributions
- **Predicate filtering**: Efficient multi-dimensional filtering for view queries on dimensions like time range and duration bounds
Views and Predicates
--------------------
Query views on DFTracer traces with multi-dimensional filtering:
- **PredicateFilter**: Efficiently filters events by dimension sets, time ranges, and duration bounds
- **Supports multiple predicates**: Match events against OR'd lists of predicates
See :doc:`cpp_api/utilities` for the full API reference.