DFAnalyzer Module

The dftracer.utils.dfanalyzer module bridges the C++ aggregation index to dfanalyzer. It provides the index-build, Arrow-IPC marshalling, and distributed high-level-metrics (HLM) helpers that dfanalyzer drives over a Dask cluster.

These helpers only depend on the Indexer and the Arrow plumbing, so they live in dftracer-utils rather than being vendored inside dfanalyzer.

Dask is an optional dependency – the distributed helpers require dask.distributed to be installed.

Index Building

Arrow IPC Marshalling

The C extension yields Arrow data as PyCapsules. These helpers convert between capsules, Arrow IPC byte streams (the wire format moved between Dask workers), and pandas frames.

Distributed High-Level Metrics

The HLM pipeline aggregates the per-worker aggregation column family into a Dask DataFrame. Each worker owns a disjoint PID set, so per-worker partials have disjoint keys and need no cross-worker merge.

View Groupby Partials

These helpers implement mergeable per-partition view aggregation: each partition emits partial aggregates (sum, count, min, max, sum-of-squares) that are combined and finalized into mean/std without a global shuffle.

Dtype Coercion

Utilities that normalize Arrow-backed dtypes into the pandas-native dtypes expected by dfanalyzer’s downstream metrics.py.