Running DFAnalyzer with Dask Distributed

Getting Started

  1. Create a Python virtual environment (Python version>3.7).

  2. Source the Python virtual environment.

  3. Git clone the GitHub repo to get the source code of DFTracer.

  4. Navigate into /path/to/dftracer/dfanalyzer/examples/dfanalyzer.

  5. Build DFTracer as recommended in Build DFTracer.

  6. Get all of the requirements as follows in the terminal:

pip install -r requirements.txt
  1. Create a .yaml file in /path/to/dftracer/dfanalyzer/dask/conf if this is a new system. Please refer to Running Dask Distributed in a new system.

Starting a Dask Distributed Cluster

In the terminal:

cd /path/to/dftracer/dfanalyzer/dask/conf
./install_dask_env.sh

This will create the configuration.yaml in ~/.dftracer. Update the application and environment path in configuration.yaml. You may need to create run_dir and logs folders if they aren’t there already.

cd /path/to/dftracer/dfanalyzer/dask/
# if logs folder is not present
mkdir logs
# if run_dir is not present
mkdir run_dir
install
./scripts/start_dask_distributed.sh

Note

Wait for several seconds as this script will reserve the compute nodes for you using the job scheduler.

Note

Please check the log file /path/to/dftracer/dfanalyzer/dask/logs/worker_<jobid>.log for any issues with running the workers on the compute nodes.

Warning

For errors related to port usage, please check if you already have any Dask distributed instances running. You can do so by checking the jobs already running in your scheduler queue or by running the following command in the terminal:

ps -aef | grep dask

Then kill those jobs/processes using kill -9 <pid>. You may also need to change the port number in the .yaml files located at /path/to/dftracer/dfanalyzer/dask/conf. For more details about these configurations refer to here.

Use DFAnalyzer

To use the Jupyter notebook of DFAnalyzer, navigate to /path/to/dftracer/examples and find the dfanalyzer_distributed.ipynb.

Acessing the Dask Dashboard

It is recommended to run the notebook inside VSCode because it supports port forwarding natively. In VSCode, navigate to the bottom bar (where the terminal is), and click on the PORTS tab. Click Forward Port to add a new port and type the port that was used when setup_dask_cluster() was run in your dfanalyzer.ipynb notebook. Connect to http://localhost:PORT to see the Dask scheduler monitoring.

Stopping Dask Distributed Workers

cd /path/to/dftracer/dfanalyzer/dask/scripts
./stop_dask_distributed.sh

Note

Wait for several seconds as this script will terminate the workers and deallocate the compute nodes.