DFAnalyzer Module¶
The dftracer.utils.dfanalyzer module bridges the C++ aggregation index to
dfanalyzer. It provides
the index-build, Arrow-IPC marshalling, and distributed high-level-metrics
(HLM) helpers that dfanalyzer drives over a Dask cluster.
These helpers only depend on the Indexer and the Arrow
plumbing, so they live in dftracer-utils rather than being vendored inside
dfanalyzer.
Dask is an optional dependency – the distributed helpers require
dask.distributed to be installed.
Index Building¶
Arrow IPC Marshalling¶
The C extension yields Arrow data as PyCapsules. These helpers convert between capsules, Arrow IPC byte streams (the wire format moved between Dask workers), and pandas frames.
Distributed High-Level Metrics¶
The HLM pipeline aggregates the per-worker aggregation column family into a Dask DataFrame. Each worker owns a disjoint PID set, so per-worker partials have disjoint keys and need no cross-worker merge.
View Groupby Partials¶
These helpers implement mergeable per-partition view aggregation: each partition emits partial aggregates (sum, count, min, max, sum-of-squares) that are combined and finalized into mean/std without a global shuffle.
Dtype Coercion¶
Utilities that normalize Arrow-backed dtypes into the pandas-native dtypes
expected by dfanalyzer’s downstream metrics.py.