Arrow Data Infrastructure¶
See also
For complete class and member documentation, see the API Reference.
Arrow data interchange infrastructure using nanoarrow. All classes are in
the dftracer::utils::utilities::common::arrow namespace.
Guarded by DFTRACER_UTILS_ENABLE_ARROW (ON by default).
graph LR
subgraph Build["Building"]
RBB["RecordBatchBuilder"]
end
subgraph Transport["Transport"]
AER["ArrowExportResult"]
end
subgraph Write["File Output"]
IPC["IpcWriter"]
PW["PartitionWriter"]
PR["PartitionRouter"]
end
subgraph Read["File Input"]
IRD["IpcReader"]
end
RBB -->|"finish()"| AER
AER -->|"write_batch()"| IPC
AER -->|"route()"| PR
PR -->|"per-partition"| PW
AER -->|"PyCapsule"| Python["Python ArrowBatch"]
IRD -->|"read_batch()"| AER
RecordBatchBuilder¶
Type-safe columnar builder with two modes:
Static schema:
declare_schema()upfront, direct index append, no hash lookups. Best for utilityto_arrow()methods with known schemas.Dynamic schema:
add_or_get_column()discovers columns from data,end_row()backfills nulls for missing columns. Best forTraceReader.iter_arrow()with arbitrary JSON.
Once the first row has been finalized the schema is locked: subsequent
rows may only append values into the already-discovered columns, and
attempts to add new columns after the lock are rejected. This makes
batches produced by the dynamic path safe to concatenate across a
TraceReader::read_arrow() stream without re-keying.
String columns store string_view into source data for zero-copy during
build; bulk copy only at finish(). Caller must keep source data alive
until finish() returns.
ArrowExportResult¶
Move-only RAII wrapper holding nanoarrow::UniqueSchema and
nanoarrow::UniqueArray. Self-contained and safe to send across
threads and channels.
IpcWriter¶
Streaming Arrow IPC file writer. Writes .arrows files that can be
read by pyarrow, polars, DuckDB, and any Arrow-compatible tool. Supports
buffer-level compression: when built with DFTRACER_UTILS_ENABLE_ZSTD,
IpcCompression::ZSTD is available and used by default for new files,
producing pyarrow-compatible compressed IPC streams.
Guarded by DFTRACER_UTILS_ENABLE_ARROW_IPC.
IpcReader¶
Streaming Arrow IPC file reader. Mirrors IpcWriter and yields one
ArrowExportResult per record batch in the file. Supports buffer-level
ZSTD decompression compatible with pyarrow / polars / DuckDB outputs.
Guarded by DFTRACER_UTILS_ENABLE_ARROW_IPC.
PartitionWriter¶
Single-partition Arrow IPC sink with PartitionWriteStats tracking
(bytes, rows, batches). Used as the per-partition output of
PartitionRouter and individually as a thin wrapper around
IpcWriter when only one output stream is needed.
PartitionRouter¶
Multi-partition Arrow router. Takes an inbound ArrowExportResult plus
a PartitionConfig (partition key columns, output template, target
batch rows) and dispatches rows into one PartitionWriter per
partition value. Aggregates RouterWriteStats across all partitions.
Usage Example¶
#include <dftracer/utils/utilities/common/arrow/arrow.h>
using namespace dftracer::utils::utilities::common::arrow;
// Build a batch
RecordBatchBuilder builder;
builder.declare_schema({
{"id", ColumnType::INT64},
{"name", ColumnType::STRING},
{"value", ColumnType::DOUBLE},
});
builder.append_int64(0, 42);
builder.append_string(1, "hello");
builder.append_double(2, 3.14);
builder.end_row();
auto batch = builder.finish();
// Write to IPC file
IpcWriter writer;
writer.open("output.arrows");
writer.write_batch(batch);
writer.close();