Arrow Data Infrastructure

See also

For complete class and member documentation, see the API Reference.

Arrow data interchange infrastructure using nanoarrow. All classes are in the dftracer::utils::utilities::common::arrow namespace.

Guarded by DFTRACER_UTILS_ENABLE_ARROW (ON by default).

        graph LR
    subgraph Build["Building"]
        RBB["RecordBatchBuilder"]
    end

    subgraph Transport["Transport"]
        AER["ArrowExportResult"]
    end

    subgraph Write["File Output"]
        IPC["IpcWriter"]
        PW["PartitionWriter"]
        PR["PartitionRouter"]
    end

    subgraph Read["File Input"]
        IRD["IpcReader"]
    end

    RBB -->|"finish()"| AER
    AER -->|"write_batch()"| IPC
    AER -->|"route()"| PR
    PR -->|"per-partition"| PW
    AER -->|"PyCapsule"| Python["Python ArrowBatch"]
    IRD -->|"read_batch()"| AER
    

RecordBatchBuilder

Type-safe columnar builder with two modes:

  • Static schema: declare_schema() upfront, direct index append, no hash lookups. Best for utility to_arrow() methods with known schemas.

  • Dynamic schema: add_or_get_column() discovers columns from data, end_row() backfills nulls for missing columns. Best for TraceReader.iter_arrow() with arbitrary JSON.

Once the first row has been finalized the schema is locked: subsequent rows may only append values into the already-discovered columns, and attempts to add new columns after the lock are rejected. This makes batches produced by the dynamic path safe to concatenate across a TraceReader::read_arrow() stream without re-keying.

String columns store string_view into source data for zero-copy during build; bulk copy only at finish(). Caller must keep source data alive until finish() returns.

ArrowExportResult

Move-only RAII wrapper holding nanoarrow::UniqueSchema and nanoarrow::UniqueArray. Self-contained and safe to send across threads and channels.

IpcWriter

Streaming Arrow IPC file writer. Writes .arrows files that can be read by pyarrow, polars, DuckDB, and any Arrow-compatible tool. Supports buffer-level compression: when built with DFTRACER_UTILS_ENABLE_ZSTD, IpcCompression::ZSTD is available and used by default for new files, producing pyarrow-compatible compressed IPC streams.

Guarded by DFTRACER_UTILS_ENABLE_ARROW_IPC.

IpcReader

Streaming Arrow IPC file reader. Mirrors IpcWriter and yields one ArrowExportResult per record batch in the file. Supports buffer-level ZSTD decompression compatible with pyarrow / polars / DuckDB outputs.

Guarded by DFTRACER_UTILS_ENABLE_ARROW_IPC.

PartitionWriter

Single-partition Arrow IPC sink with PartitionWriteStats tracking (bytes, rows, batches). Used as the per-partition output of PartitionRouter and individually as a thin wrapper around IpcWriter when only one output stream is needed.

PartitionRouter

Multi-partition Arrow router. Takes an inbound ArrowExportResult plus a PartitionConfig (partition key columns, output template, target batch rows) and dispatches rows into one PartitionWriter per partition value. Aggregates RouterWriteStats across all partitions.

Usage Example

#include <dftracer/utils/utilities/common/arrow/arrow.h>

using namespace dftracer::utils::utilities::common::arrow;

// Build a batch
RecordBatchBuilder builder;
builder.declare_schema({
    {"id", ColumnType::INT64},
    {"name", ColumnType::STRING},
    {"value", ColumnType::DOUBLE},
});
builder.append_int64(0, 42);
builder.append_string(1, "hello");
builder.append_double(2, 3.14);
builder.end_row();
auto batch = builder.finish();

// Write to IPC file
IpcWriter writer;
writer.open("output.arrows");
writer.write_batch(batch);
writer.close();