Reader Components¶
See also
For complete class and member documentation, see the API Reference.
Trace file reading functionality.
All classes are in the dftracer::utils::utilities::reader namespace.
classDiagram
class dftracer__utils__utilities__reader__JsonLine["JsonLine"]
class dftracer__utils__utilities__reader__ReadConfig["ReadConfig"]
dftracer__utils__utilities__reader__ReadConfig : +has_line_range() bool
dftracer__utils__utilities__reader__ReadConfig : +has_byte_range() bool
class dftracer__utils__utilities__reader__TraceReader["TraceReader"]
dftracer__utils__utilities__reader__TraceReader : +read_lines() AsyncGenerator
dftracer__utils__utilities__reader__TraceReader : +read_json() AsyncGenerator
dftracer__utils__utilities__reader__TraceReader : +read_raw() AsyncGenerator
class dftracer__utils__utilities__reader__TraceReaderConfig["TraceReaderConfig"]
class dftracer__utils__utilities__reader__internal__CLineProcessor["CLineProcessor"]
dftracer__utils__utilities__reader__internal__CLineProcessor : +process() CoroTask
class dftracer__utils__utilities__reader__internal__LineProcessor["LineProcessor"]
<<abstract>> dftracer__utils__utilities__reader__internal__LineProcessor
dftracer__utils__utilities__reader__internal__LineProcessor : +process() CoroTask
dftracer__utils__utilities__reader__internal__LineProcessor : +begin() void
dftracer__utils__utilities__reader__internal__LineProcessor : +end() void
class dftracer__utils__utilities__reader__internal__Reader["Reader"]
<<abstract>> dftracer__utils__utilities__reader__internal__Reader
dftracer__utils__utilities__reader__internal__Reader : +get_max_bytes() size_t
dftracer__utils__utilities__reader__internal__Reader : +get_num_lines() size_t
dftracer__utils__utilities__reader__internal__Reader : +get_archive_path() string &
class dftracer__utils__utilities__reader__internal__ReaderError["ReaderError"]
dftracer__utils__utilities__reader__internal__ReaderError : +get_type() Type
class dftracer__utils__utilities__reader__internal__ReaderFactory["ReaderFactory"]
dftracer__utils__utilities__reader__internal__ReaderFactory : +create() shared_ptr
dftracer__utils__utilities__reader__internal__ReaderFactory : +create() shared_ptr
dftracer__utils__utilities__reader__internal__ReaderFactory : +is_format_supported() bool
class dftracer__utils__utilities__reader__internal__ReaderStream["ReaderStream"]
<<abstract>> dftracer__utils__utilities__reader__internal__ReaderStream
dftracer__utils__utilities__reader__internal__ReaderStream : +read_async() CoroTask
dftracer__utils__utilities__reader__internal__ReaderStream : +read() span
dftracer__utils__utilities__reader__internal__ReaderStream : +read_async() CoroTask
class dftracer__utils__utilities__reader__internal__StreamConfig["StreamConfig"]
dftracer__utils__utilities__reader__internal__StreamConfig : +extend_to_line_boundary() bool
dftracer__utils__utilities__reader__internal__StreamConfig : +extend_to_line_boundary() StreamConfig &
dftracer__utils__utilities__reader__internal__StreamConfig : +stream_type() StreamType
dftracer__utils__utilities__reader__internal__LineProcessor <|-- dftracer__utils__utilities__reader__internal__CLineProcessor
Overview¶
The reader module provides streaming access to compressed trace files,
supporting both sequential and indexed random access modes. When an .idx
sidecar file exists, the reader automatically uses checkpoint-based random
access for line and byte ranges. Otherwise it falls back to sequential
decompression.
The reader also supports query-based event filtering: when a query string is
provided and an index exists, non-matching chunks are pruned entirely, and
per-event filtering is applied to the remaining chunks. Conjunctions of
equality predicates (cat == 'io' AND name == 'read') are compiled into
a vectorized predicate evaluator that runs against the index bloom dimensions
before any line is decompressed.
TraceReader also accepts a directory as file_path: when given a
directory, it enumerates trace files inside it, opens one indexed reader per
file, and yields lines / Arrow batches in file order. Batch chunk pruning is
delegated to ChunkPrunerUtility, which evaluates the compiled query
against all candidate chunks in one pass and feeds the resulting line-range
work items back to the per-file readers.
When DFTRACER_UTILS_ENABLE_ARROW is set, TraceReader::read_arrow()
exports record batches via the Arrow C Data Interface
(ArrowExportResult), which can be sent directly across the FFI boundary
to Python / DuckDB / Polars without a copy. The ReadConfig::flatten_objects
flag expands one level of nested JSON objects (e.g. args) into
parent.child columns with native Arrow types instead of serializing them
as JSON strings.
Getting Started¶
Read all lines from a trace file sequentially:
#include <dftracer/utils/utilities/reader/trace_reader.h>
using namespace dftracer::utils::utilities::reader;
TraceReaderConfig config;
config.file_path = "trace.pfw.gz";
config.index_dir = "/tmp/indexes";
TraceReader reader(config);
// Stream lines as an async generator
auto gen = reader.read_lines();
while (auto line = co_await gen.next()) {
// line->content is a string_view (valid until next iteration)
// line->line_number is 1-based
process(line->content);
}
Read a specific line range using an index:
TraceReaderConfig config;
config.file_path = "trace.pfw.gz";
config.index_dir = "/tmp/indexes";
TraceReader reader(config);
ReadConfig rc;
rc.start_line = 1000;
rc.end_line = 2000;
auto gen = reader.read_lines(rc);
while (auto line = co_await gen.next()) {
process(line->content);
}
Read with query-based chunk pruning:
ReadConfig rc;
rc.query = "name == 'read' AND cat == 'io'";
auto gen = reader.read_lines(rc);
while (auto line = co_await gen.next()) {
// Only lines matching the query are yielded.
// Non-matching chunks are skipped entirely when an index exists.
process(line->content);
}
Read raw byte chunks instead of parsed lines:
ReadConfig rc;
rc.start_byte = 0;
rc.end_byte = 1024 * 1024; // first 1 MB
auto gen = reader.read_raw(rc);
while (auto chunk = co_await gen.next()) {
// chunk is std::span<const char>
write(output_fd, chunk->data(), chunk->size());
}
TraceReaderConfig¶
File-level configuration for constructing a TraceReader. Fields:
file_path– path to the trace file (.pfw.gzor plain text)index_dir– directory where.idxsidecar files are storedcheckpoint_size– checkpoint interval for index building (default 32 MB)auto_build_index– automatically build an index if one is missing (default false)index_threshold– minimum file size before auto-indexing kicks in
ReadConfig¶
Per-read configuration controlling range selection, buffering, and query
filtering. All fields have sensible defaults; pass a default-constructed
ReadConfig{} for a full sequential read.
start_line/end_line– line range (1-indexed; 0 means beginning / end)start_byte/end_byte– byte range (0 means beginning / end)line_aligned– align raw byte chunks to line boundaries (default true)multi_line– allow multiple lines per raw chunk (default true)buffer_size– internal read buffer size (default 4 MB)query– query DSL string for event filtering (empty = no filter)chunk_prune_only– when true, the query is used only for chunk-level pruning via the index; per-line filtering is skipped (caller handles it)skip_pruning– skip the reader’s own chunk pruner pass; the caller’sstart_line/end_linewindow is trusted (used by the checkpoint-level work-item dispatcher to avoid re-runningChunkPrunerUtilityper item)flatten_objects– expand one level of nested JSON objects intoparent.childcolumns with native Arrow types inread_arrow()
Helper methods: has_line_range() and has_byte_range() test whether
non-default range bounds have been set.
TraceReader¶
High-level reader with automatic format detection and index support.
Constructed from a TraceReaderConfig, it probes for an .idx sidecar
at construction time and selects the optimal read strategy (sequential or
indexed) based on whether an index exists and what range the caller requests.
Async generators:
read_lines(config)– yieldsLinestructs (content+line_number) with optional query filtering and chunk pruningread_json(config)– yieldsJsonLinerecords (parsed once with simdjson) for callers that would otherwise re-parse each lineread_raw(config)– yieldsstd::span<const char>byte chunksread_arrow(config, batch_size)– yieldsArrowExportResultrecord batches via the Arrow C Data Interface (requiresDFTRACER_UTILS_ENABLE_ARROW)
Metadata queries:
has_index()– true if an.idxsidecar was foundget_max_bytes()– decompressed size (0 if no index for compressed files)get_num_lines()– total line count (0 if no index)
TraceReader reader(config);
if (reader.has_index()) {
std::size_t total = reader.get_num_lines();
std::size_t bytes = reader.get_max_bytes();
}
// Full sequential read with default config
auto gen = reader.read_lines();
while (auto line = co_await gen.next()) {
process(line->content);
}