AI/DL Logging¶

In this document, we detail how to use DFTracer AI Logging style.

Motivations¶

Since DFTracer’s release, we’ve successfully traced numerous AI/DL pipelines. However, analysis revealed that the resulting traces differ widely across workloads.

This inconsistency is largely due to varied naming schemes used by different users. Even when the intent is similar, the lack of a standard makes it hard to build analysis tools that work reliably across use cases.

This API aims to introduce consistent annotation conventions to help users instrument their code more uniformly. With these standards in place, tools like DFAnalyzer can operate more effectively — they will just work™, reducing fatigue for researchers and developers analyzing AI/DL workloads.

Import¶

from dftracer.logger import ai

AI/DL Conventions¶

We currently define six categories of logging. Each category, along with its subcategories (children), is implemented as a wrapper around dft_fn (see Function decorator style profiling).

This means you can use these categories (along with its children) in your codebase in the same way you would use dft_fn directly.

The table below provides a breakdown of the conventions and how they can be applied in your code:

AI/DL Logging Conventions¶
Category	Name	Access Path	Description
Compute	Forward	`ai.compute.forward`	Forward pass of the network
	Backward	`ai.compute.backward`	Backward pass / gradient computation
	Step	`ai.compute.step`	Optimizer step (parameter update)
Data	Preprocess	`ai.data.preprocess`	Dataset-level preprocessing
	Item	`ai.data.item`	Per-item transformation or loading
DataLoader	Fetch	`ai.dataloader.fetch`	Fetch a batch from DataLoader
Comm	Send	`ai.comm.send`	Point-to-point send
	Receive	`ai.comm.receive`	Point-to-point receive
	Barrier	`ai.comm.barrier`	Synchronization barrier
	Broadcast	`ai.comm.bcast`	Broadcast (one-to-many)
	Reduce	`ai.comm.reduce`	Reduce (many-to-one)
	All-Reduce	`ai.comm.all_reduce`	All-reduce (many-to-many)
	Gather	`ai.comm.gather`	Gather (many-to-one)
	All-Gather	`ai.comm.all_gather`	All-gather (many-to-many)
	Scatter	`ai.comm.scatter`	Scatter (one-to-many)
	Reduce-Scatter	`ai.comm.reduce_scatter`	Reduce-scatter (many-to-many)
	All-to-All	`ai.comm.all_to_all`	All-to-all (many-to-many)
Device	Transfer	`ai.device.transfer`	Host-to-device or device-to-host memory transfer
Checkpoint	Capture	`ai.checkpoint.capture`	Capture a model checkpoint
	Restart	`ai.checkpoint.restart`	Restart a model checkpoint
Pipeline	Epoch	`ai.pipeline.epoch`	An entire training or evaluation epoch
	Train	`ai.pipeline.train`	Training phase
	Evaluate	`ai.pipeline.evaluate`	Evaluation or validation phase
	Test	`ai.pipeline.test`	Testing or inference phase

Usage¶

To use these conventions, you can annotate your code as follows:

from dftracer.logger import ai, dftracer

@ai.compute.forward
def forward(model, x):
    loss = model(x)
    return loss

@ai.compute.backward
def backward(model, loss):
    with ai.comm.all_reduce:
        loss.backward()

@ai.compute # or @ai.compute.step if you want to be specific
def compute(model, x, optimizer):
    loss = forward(model, x)
    backward(model, loss)

@ai.data.preprocess
def preprocess(data):
    # Preprocessing logic
    pass

@ai.dataloader.fetch
def transfer_to_gpu(batch, device):
    batch = batch.to(device)
    pass

@ai.pipeline.train
def train(model, dataloader, optimizer, device, num_epoch):
    for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
        for batch in ai.dataloader.fetch.iter(dataloader):
            x, y = transfer_to_gpu(batch, device)
            compute(model, x, optimizer)
            # Additional training logic

def main():
    model = ...  # Initialize your model
    dataloader = ...  # Initialize your DataLoader
    optimizer = ...  # Initialize your optimizer
    device = ...  # Set your device (CPU/GPU)
    num_epoch = 10  # Set number of epochs

    # initialize dftracer
    df_logger = dftracer.initialize_log(...)
    train(model, dataloader, optimizer, device, num_epoch)
    df_logger.finalize()

Extra APIs¶

Context Manager, decorator, and iterable APIs, you name it!¶

DFTracer AI Logging provides flexible APIs to match different coding styles. You can use decorators, context managers, or iterable wrappers to annotate your code cleanly and consistently.

Decorator Style¶

Without arguments — use it directly to wrap a function:

@ai.compute.forward
def forward(model, x):
    loss = model(x)
    return loss

With arguments — pass metadata to the event:

@ai.compute.forward(args={"arg1": "value1", "arg2": "value2"})
def forward(model, x):
    loss = model(x)
    return loss

Context Manager Style¶

Use it to wrap blocks of code inside a with statement:

Without arguments:

with ai.compute.forward:
    loss = model(x)

With arguments:

with ai.compute.forward(args={"arg1": "value1", "arg2": "value2"}):
    loss = model(x)

Iterable Style¶

You can also wrap iterators like data loaders:

for batch in ai.dataloader.fetch.iter(dataloader):
    ...

Constructor Hooking¶

You can annotate constructors directly using category-specific hooks:

class MyDataset:
    @ai.data.item.init  # special `init` event for this category
    def __init__(self, ...):
        # Initialization logic
        pass

Updating Arguments¶

Updating arguments is simple. Every profiler (like ai.compute.forward) provides an update method to dynamically change metadata. These updates apply to the entire subtree of that event.

@ai.compute.forward
def forward(model, x):
    loss = model(x)
    return loss

for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
    for step, batch in ai.dataloader.fetch.iter(enumerate(dataloader)):
        # Update metadata for the current context
        ai.compute.forward.update(epoch=epoch, step=step)
        forward(model, batch)

Disabling/Enabling Logging¶

You can turn logging off or on for the entire tree or individual categories:

# Disable the entire logging tree
ai.disable()

# Disable specific categories
ai.compute.disable()
ai.data.disable()
ai.dataloader.disable()
ai.comm.disable()
ai.device.disable()
ai.pipeline.disable()

# Re-enable specific categories
ai.compute.enable()
ai.data.enable()
ai.dataloader.enable()
ai.comm.enable()
ai.device.enable()
ai.pipeline.enable()

# Enable everything back
ai.enable()

Force Enable or Disable Specific Events¶

You can override the global or category-level logging state for individual events by setting the enable flag explicitly.

This is useful when you want to selectively trace or skip certain operations, regardless of whether the category is globally enabled or disabled.

ai.compute.disable()  # Disable all compute events

@ai.compute.forward(enable=True)  # Force-enable this specific event
def forward(model, x):
    loss = model(x)
    return loss

with ai.compute.backward(enable=True):  # Force-enable this block
    loss.backward()

ai.compute.enable()  # Enable all compute events

@ai.compute.forward(enable=False)  # Force-disable this one
def forward(model, x):
    loss = model(x)
    return loss

with ai.compute.backward(enable=False):  # Force-disable this block
    loss.backward()

Updating arguments¶

Updating arguments is simple. Each profiler (like ai.compute.forward) has an update method that lets you change its arguments — similar to how you would pass arguments to dft_fn. These updates automatically apply to the whole category or its subtree.

Example:

@ai.compute.forward
def forward(model, x):
    loss = model(x)
    return loss

for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
    for step, batch in ai.dataloader.fetch.iter(enumerate(dataloader)):
        # Add context to the forward trace
        ai.compute.forward.update(epoch=epoch, step=step)
        forward(model, batch)

Hook/Checkpoint Style¶

Sometimes you need to attach profilers to hooks (e.g., TensorFlow SessionHook) where you can’t use decorators or context managers directly.

For these cases, you can manually call the profiler methods:

class DFTracerProfilingHook(tf.train.SessionRunHook):
    def begin(self):
        self._global_step_tensor = training_util._get_or_create_global_step_read()
        if self._global_step_tensor is None:
            raise RuntimeError("Global step should be created to use ProfilerHook.")
        ai.pipeline.epoch.start()

    def end(self, session):
        ai.pipeline.epoch.stop()

    def before_run(self, run_context):
        global_step = run_context.session.run(self._global_step_tensor)
        ai.update(step=global_step)
        ai.compute.start()

    def after_run(self, run_context, run_values):
        ai.compute.stop()

Derivation¶

Since sometimes our logging needs to be more dynamic, you can derive new profilers from existing ones. This is useful when you want to create a specialized profiler with the same context as an existing one.

The derived profiler becomes a child of the original profiler, inheriting its context and metadata. All methods like update, enable, and disable work on the derived profiler as expected.

Example:

class Dataset:
    def __getitem__(self, idx: int):
        data = ...
        with ai.data.preprocess:
            # do something with data
            ...
        return data

# this will become name="preprocess.collate" with cat="data"
@ai.data.preprocess.derive(name="collate")
def collate(batch):
    # Collate the batch
    return batch

# OR (context-manager style)

profiler_collate = ai.data.preprocess.derive(name="collate")

def collate_fn(batch):
    with profiler_collate:
        return collate(batch)

# Update
profiler_collate.update(epoch=epoch)
# This also works:
## this will update all children of ai.data.preprocess
## including the derived profiler such as `collate`
ai.data.preprocess.update(epoch=epoch)

As metadata / streaming style¶

By default, DFTracer logs events with a start and end time (duration-based logging). But sometimes you want to log events immediately as they happen, without waiting for them to finish.

This is useful for real-time monitoring or when you need immediate feedback.

To enable metadata mode, use metadata=True:

# Regular
for epoch in ai.pipeline.epoch.iter(range(num_epochs)):
    for step in range(num_steps):
        # Do some work
        ...

# As metadata
for epoch in range(num_epochs):
    ai.pipeline.epoch.start(metadata=True)
    for step in range(num_steps):
        # Do some work
    ai.pipeline.epoch.stop(metadata=True)

Regular mode output:

{"id":27,"name":"epoch.block","cat":"pipeline","pid":2877353,"tid":2877353,"ts":1753123213646764,"dur":828765,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"count":"1","level":2}}
...
{"id":69,"name":"epoch.block","cat":"pipeline","pid":2877353,"tid":2877353,"ts":1753123215361535,"dur":819403,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"count":"3","level":2}}

Metadata mode output:

{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.end","value":"1753123070219202"}}
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.start","value":"1753123070219648"}}
...
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.end","value":"1753123071041297"}}
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.start","value":"1753123071041678"}}

The key difference: metadata mode logs events instantly, while regular mode waits until the event completes to log the duration.

Init event¶

Sometimes you need to log the initialization of a process or component.

This is useful for tracking startup phases and initialization overhead.

To log an init event, use the init method:

class Checkpoint:
    @ai.checkpoint.init
    def __init__(self):
        # Initialize something
        ...

# or

with ai.checkpoint.init:
    # Initialize something
    ...

This will output:

{"id":7,"name":"checkpoint.init","cat":"checkpoint","pid":444541,"tid":444541,"ts":1753136835509693,"dur":100583,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"level":2}}
...

The event name becomes checkpoint.init (not just init) to avoid conflicts with other events. This namespacing keeps events organized under their proper categories. We could add a separate init category to our tree, but that would be overkill for something that may not be needed in most codebases.

Caveats¶

Call ordering matters for enable/disable

The order of calls can affect whether events get logged or not.

This works as expected:

class Checkpoint:
    @ai.checkpoint.init # <-- this instance is tracked internally
    def __init__(self):
        # Initialize something
        ...

if __name__ == "__main__":
    ai.checkpoint.disable()  # Disables all checkpoint events

This syntax sugar doesn’t work as expected:

class Checkpoint:
    @ai.checkpoint.init()  # Notice the parentheses
    def __init__(self):
        ...

if __name__ == "__main__":
    ai.checkpoint.disable()  # This won't disable the event above

Why?

When you add parentheses (), the decorator creates a new instance immediately. Since disable() is called later, it can’t affect the already-created instance.

Solution:

Use the decorator without parentheses, or call disable() before defining your class.
Only use parentheses () when you need to force enable/disable a specific event
To add metadata, use the update() method instead
To create variations of an event, use the derive() method instead