Arrow Data Infrastructure ========================= .. seealso:: For complete class and member documentation, see the :doc:`API Reference `. Arrow data interchange infrastructure using nanoarrow. All classes are in the ``dftracer::utils::utilities::common::arrow`` namespace. Guarded by ``DFTRACER_UTILS_ENABLE_ARROW`` (ON by default). .. mermaid:: graph LR subgraph Build["Building"] RBB["RecordBatchBuilder"] end subgraph Transport["Transport"] AER["ArrowExportResult"] end subgraph Write["File Output"] IPC["IpcWriter"] PW["PartitionWriter"] PR["PartitionRouter"] end subgraph Read["File Input"] IRD["IpcReader"] end RBB -->|"finish()"| AER AER -->|"write_batch()"| IPC AER -->|"route()"| PR PR -->|"per-partition"| PW AER -->|"PyCapsule"| Python["Python ArrowBatch"] IRD -->|"read_batch()"| AER RecordBatchBuilder ------------------ Type-safe columnar builder with two modes: - **Static schema**: ``declare_schema()`` upfront, direct index append, no hash lookups. Best for utility ``to_arrow()`` methods with known schemas. - **Dynamic schema**: ``add_or_get_column()`` discovers columns from data, ``end_row()`` backfills nulls for missing columns. Best for ``TraceReader.iter_arrow()`` with arbitrary JSON. Once the first row has been finalized the schema is **locked**: subsequent rows may only append values into the already-discovered columns, and attempts to add new columns after the lock are rejected. This makes batches produced by the dynamic path safe to concatenate across a ``TraceReader::read_arrow()`` stream without re-keying. String columns store ``string_view`` into source data for zero-copy during build; bulk copy only at ``finish()``. Caller must keep source data alive until ``finish()`` returns. ArrowExportResult ----------------- Move-only RAII wrapper holding ``nanoarrow::UniqueSchema`` and ``nanoarrow::UniqueArray``. Self-contained and safe to send across threads and channels. IpcWriter --------- Streaming Arrow IPC file writer. Writes ``.arrows`` files that can be read by pyarrow, polars, DuckDB, and any Arrow-compatible tool. Supports buffer-level compression: when built with ``DFTRACER_UTILS_ENABLE_ZSTD``, ``IpcCompression::ZSTD`` is available and used by default for new files, producing pyarrow-compatible compressed IPC streams. Guarded by ``DFTRACER_UTILS_ENABLE_ARROW_IPC``. IpcReader --------- Streaming Arrow IPC file reader. Mirrors ``IpcWriter`` and yields one ``ArrowExportResult`` per record batch in the file. Supports buffer-level ZSTD decompression compatible with pyarrow / polars / DuckDB outputs. Guarded by ``DFTRACER_UTILS_ENABLE_ARROW_IPC``. PartitionWriter --------------- Single-partition Arrow IPC sink with ``PartitionWriteStats`` tracking (bytes, rows, batches). Used as the per-partition output of ``PartitionRouter`` and individually as a thin wrapper around ``IpcWriter`` when only one output stream is needed. PartitionRouter --------------- Multi-partition Arrow router. Takes an inbound ``ArrowExportResult`` plus a ``PartitionConfig`` (partition key columns, output template, target batch rows) and dispatches rows into one ``PartitionWriter`` per partition value. Aggregates ``RouterWriteStats`` across all partitions. Usage Example ------------- .. code-block:: cpp #include using namespace dftracer::utils::utilities::common::arrow; // Build a batch RecordBatchBuilder builder; builder.declare_schema({ {"id", ColumnType::INT64}, {"name", ColumnType::STRING}, {"value", ColumnType::DOUBLE}, }); builder.append_int64(0, 42); builder.append_string(1, "hello"); builder.append_double(2, 3.14); builder.end_row(); auto batch = builder.finish(); // Write to IPC file IpcWriter writer; writer.open("output.arrows"); writer.write_batch(batch); writer.close();