AI/DL Logging¶
In this document, we detail how to use DFTracer AI Logging style.
Motivations¶
Since DFTracer’s release, we’ve successfully traced numerous AI/DL pipelines. However, analysis revealed that the resulting traces differ widely across workloads.
This inconsistency is largely due to varied naming schemes used by different users. Even when the intent is similar, the lack of a standard makes it hard to build analysis tools that work reliably across use cases.
This API aims to introduce consistent annotation conventions to help users instrument their code more uniformly. With these standards in place, tools like DFAnalyzer can operate more effectively — they will just work™, reducing fatigue for researchers and developers analyzing AI/DL workloads.
Import¶
from dftracer.logger import ai
AI/DL Conventions¶
We currently define six categories of logging. Each category, along with its subcategories (children), is implemented as a wrapper around dft_fn (see Function decorator style profiling).
This means you can use these categories (along with its children) in your codebase in the same way you would use dft_fn directly.
The table below provides a breakdown of the conventions and how they can be applied in your code:
Category |
Name |
Access Path |
Description |
|---|---|---|---|
Compute |
Forward |
|
Forward pass of the network |
Backward |
|
Backward pass / gradient computation |
|
Step |
|
Optimizer step (parameter update) |
|
Data |
Preprocess |
|
Dataset-level preprocessing |
Item |
|
Per-item transformation or loading |
|
DataLoader |
Fetch |
|
Fetch a batch from DataLoader |
Comm |
Send |
|
Point-to-point send |
Receive |
|
Point-to-point receive |
|
Barrier |
|
Synchronization barrier |
|
Broadcast |
|
Broadcast (one-to-many) |
|
Reduce |
|
Reduce (many-to-one) |
|
All-Reduce |
|
All-reduce (many-to-many) |
|
Gather |
|
Gather (many-to-one) |
|
All-Gather |
|
All-gather (many-to-many) |
|
Scatter |
|
Scatter (one-to-many) |
|
Reduce-Scatter |
|
Reduce-scatter (many-to-many) |
|
All-to-All |
|
All-to-all (many-to-many) |
|
Device |
Transfer |
|
Host-to-device or device-to-host memory transfer |
Checkpoint |
Capture |
|
Capture a model checkpoint |
Restart |
|
Restart a model checkpoint |
|
Pipeline |
Epoch |
|
An entire training or evaluation epoch |
Train |
|
Training phase |
|
Evaluate |
|
Evaluation or validation phase |
|
Test |
|
Testing or inference phase |
Usage¶
To use these conventions, you can annotate your code as follows:
from dftracer.logger import ai, dftracer
@ai.compute.forward
def forward(model, x):
loss = model(x)
return loss
@ai.compute.backward
def backward(model, loss):
with ai.comm.all_reduce:
loss.backward()
@ai.compute # or @ai.compute.step if you want to be specific
def compute(model, x, optimizer):
loss = forward(model, x)
backward(model, loss)
@ai.data.preprocess
def preprocess(data):
# Preprocessing logic
pass
@ai.dataloader.fetch
def transfer_to_gpu(batch, device):
batch = batch.to(device)
pass
@ai.pipeline.train
def train(model, dataloader, optimizer, device, num_epoch):
for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
for batch in ai.dataloader.fetch.iter(dataloader):
x, y = transfer_to_gpu(batch, device)
compute(model, x, optimizer)
# Additional training logic
def main():
model = ... # Initialize your model
dataloader = ... # Initialize your DataLoader
optimizer = ... # Initialize your optimizer
device = ... # Set your device (CPU/GPU)
num_epoch = 10 # Set number of epochs
# initialize dftracer
df_logger = dftracer.initialize_log(...)
train(model, dataloader, optimizer, device, num_epoch)
df_logger.finalize()
Extra APIs¶
Context Manager, decorator, and iterable APIs, you name it!¶
DFTracer AI Logging provides flexible APIs to match different coding styles. You can use decorators, context managers, or iterable wrappers to annotate your code cleanly and consistently.
Decorator Style¶
Without arguments — use it directly to wrap a function:
@ai.compute.forward
def forward(model, x):
loss = model(x)
return loss
With arguments — pass metadata to the event:
@ai.compute.forward(args={"arg1": "value1", "arg2": "value2"})
def forward(model, x):
loss = model(x)
return loss
Context Manager Style¶
Use it to wrap blocks of code inside a with statement:
Without arguments:
with ai.compute.forward:
loss = model(x)
With arguments:
with ai.compute.forward(args={"arg1": "value1", "arg2": "value2"}):
loss = model(x)
Iterable Style¶
You can also wrap iterators like data loaders:
for batch in ai.dataloader.fetch.iter(dataloader):
...
Constructor Hooking¶
You can annotate constructors directly using category-specific hooks:
class MyDataset:
@ai.data.item.init # special `init` event for this category
def __init__(self, ...):
# Initialization logic
pass
Updating Arguments¶
Updating arguments is simple. Every profiler (like ai.compute.forward) provides an update method to
dynamically change metadata. These updates apply to the entire subtree of that event.
@ai.compute.forward
def forward(model, x):
loss = model(x)
return loss
for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
for step, batch in ai.dataloader.fetch.iter(enumerate(dataloader)):
# Update metadata for the current context
ai.compute.forward.update(epoch=epoch, step=step)
forward(model, batch)
Disabling/Enabling Logging¶
You can turn logging off or on for the entire tree or individual categories:
# Disable the entire logging tree
ai.disable()
# Disable specific categories
ai.compute.disable()
ai.data.disable()
ai.dataloader.disable()
ai.comm.disable()
ai.device.disable()
ai.pipeline.disable()
# Re-enable specific categories
ai.compute.enable()
ai.data.enable()
ai.dataloader.enable()
ai.comm.enable()
ai.device.enable()
ai.pipeline.enable()
# Enable everything back
ai.enable()
Force Enable or Disable Specific Events¶
You can override the global or category-level logging state for individual events by setting the enable flag explicitly.
This is useful when you want to selectively trace or skip certain operations, regardless of whether the category is globally enabled or disabled.
ai.compute.disable() # Disable all compute events
@ai.compute.forward(enable=True) # Force-enable this specific event
def forward(model, x):
loss = model(x)
return loss
with ai.compute.backward(enable=True): # Force-enable this block
loss.backward()
ai.compute.enable() # Enable all compute events
@ai.compute.forward(enable=False) # Force-disable this one
def forward(model, x):
loss = model(x)
return loss
with ai.compute.backward(enable=False): # Force-disable this block
loss.backward()
Updating arguments¶
Updating arguments is simple. Each profiler (like ai.compute.forward) has an update method
that lets you change its arguments — similar to how you would pass arguments to dft_fn.
These updates automatically apply to the whole category or its subtree.
Example:
@ai.compute.forward
def forward(model, x):
loss = model(x)
return loss
for epoch in ai.pipeline.epoch.iter(range(num_epoch)):
for step, batch in ai.dataloader.fetch.iter(enumerate(dataloader)):
# Add context to the forward trace
ai.compute.forward.update(epoch=epoch, step=step)
forward(model, batch)
Hook/Checkpoint Style¶
Sometimes you need to attach profilers to hooks (e.g., TensorFlow SessionHook) where you can’t use decorators or context managers directly.
For these cases, you can manually call the profiler methods:
class DFTracerProfilingHook(tf.train.SessionRunHook):
def begin(self):
self._global_step_tensor = training_util._get_or_create_global_step_read()
if self._global_step_tensor is None:
raise RuntimeError("Global step should be created to use ProfilerHook.")
ai.pipeline.epoch.start()
def end(self, session):
ai.pipeline.epoch.stop()
def before_run(self, run_context):
global_step = run_context.session.run(self._global_step_tensor)
ai.update(step=global_step)
ai.compute.start()
def after_run(self, run_context, run_values):
ai.compute.stop()
Derivation¶
Since sometimes our logging needs to be more dynamic, you can derive new profilers from existing ones. This is useful when you want to create a specialized profiler with the same context as an existing one.
The derived profiler becomes a child of the original profiler, inheriting its context and metadata.
All methods like update, enable, and disable work on the derived profiler as expected.
Example:
class Dataset:
def __getitem__(self, idx: int):
data = ...
with ai.data.preprocess:
# do something with data
...
return data
# this will become name="preprocess.collate" with cat="data"
@ai.data.preprocess.derive(name="collate")
def collate(batch):
# Collate the batch
return batch
# OR (context-manager style)
profiler_collate = ai.data.preprocess.derive(name="collate")
def collate_fn(batch):
with profiler_collate:
return collate(batch)
# Update
profiler_collate.update(epoch=epoch)
# This also works:
## this will update all children of ai.data.preprocess
## including the derived profiler such as `collate`
ai.data.preprocess.update(epoch=epoch)
As metadata / streaming style¶
By default, DFTracer logs events with a start and end time (duration-based logging). But sometimes you want to log events immediately as they happen, without waiting for them to finish.
This is useful for real-time monitoring or when you need immediate feedback.
To enable metadata mode, use metadata=True:
# Regular
for epoch in ai.pipeline.epoch.iter(range(num_epochs)):
for step in range(num_steps):
# Do some work
...
# As metadata
for epoch in range(num_epochs):
ai.pipeline.epoch.start(metadata=True)
for step in range(num_steps):
# Do some work
ai.pipeline.epoch.stop(metadata=True)
Regular mode output:
{"id":27,"name":"epoch.block","cat":"pipeline","pid":2877353,"tid":2877353,"ts":1753123213646764,"dur":828765,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"count":"1","level":2}}
...
{"id":69,"name":"epoch.block","cat":"pipeline","pid":2877353,"tid":2877353,"ts":1753123215361535,"dur":819403,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"count":"3","level":2}}
Metadata mode output:
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.end","value":"1753123070219202"}}
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.start","value":"1753123070219648"}}
...
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.end","value":"1753123071041297"}}
{"id":6,"name":"CM","cat":"dftracer","pid":2876815,"tid":2876815,"ph":"M","args":{"hhash":"2a702c695247d487","name":"epoch.start","value":"1753123071041678"}}
The key difference: metadata mode logs events instantly, while regular mode waits until the event completes to log the duration.
Init event¶
Sometimes you need to log the initialization of a process or component.
This is useful for tracking startup phases and initialization overhead.
To log an init event, use the init method:
class Checkpoint:
@ai.checkpoint.init
def __init__(self):
# Initialize something
...
# or
with ai.checkpoint.init:
# Initialize something
...
This will output:
{"id":7,"name":"checkpoint.init","cat":"checkpoint","pid":444541,"tid":444541,"ts":1753136835509693,"dur":100583,"ph":"X","args":{"hhash":"2a702c695247d487","p_idx":6,"level":2}}
...
The event name becomes checkpoint.init (not just init) to avoid conflicts with other events.
This namespacing keeps events organized under their proper categories.
We could add a separate init category to our tree, but that would be overkill for something that may not be needed in most codebases.
Caveats¶
Call ordering matters for enable/disable
The order of calls can affect whether events get logged or not.
This works as expected:
class Checkpoint:
@ai.checkpoint.init # <-- this instance is tracked internally
def __init__(self):
# Initialize something
...
if __name__ == "__main__":
ai.checkpoint.disable() # Disables all checkpoint events
This syntax sugar doesn’t work as expected:
class Checkpoint:
@ai.checkpoint.init() # Notice the parentheses
def __init__(self):
...
if __name__ == "__main__":
ai.checkpoint.disable() # This won't disable the event above
Why?
When you add parentheses (), the decorator creates a new instance immediately.
Since disable() is called later, it can’t affect the already-created instance.
Solution:
Use the decorator without parentheses, or call
disable()before defining your class.Only use parentheses
()when you need to force enable/disable a specific eventTo add metadata, use the
update()method insteadTo create variations of an event, use the
derive()method instead