Running DFAnalyzer with Dask Distributed¶
Getting Started¶
Create a Python virtual environment (Python version>3.7).
Source the Python virtual environment.
Git clone the GitHub repo to get the source code of DFTracer.
Navigate into
/path/to/dftracer/dfanalyzer/examples/dfanalyzer.Build DFTracer as recommended in Build DFTracer.
Get all of the requirements as follows in the terminal:
pip install -r requirements.txt
Create a .yaml file in
/path/to/dftracer/dfanalyzer/dask/confif this is a new system. Please refer to Running Dask Distributed in a new system.
Starting a Dask Distributed Cluster¶
In the terminal:
cd /path/to/dftracer/dfanalyzer/dask/conf
./install_dask_env.sh
This will create the configuration.yaml in ~/.dftracer. Update the application and environment path in configuration.yaml. You may need to create run_dir and logs folders if they aren’t there already.
cd /path/to/dftracer/dfanalyzer/dask/
# if logs folder is not present
mkdir logs
# if run_dir is not present
mkdir run_dir
install
./scripts/start_dask_distributed.sh
Note
Wait for several seconds as this script will reserve the compute nodes for you using the job scheduler.
Note
Please check the log file /path/to/dftracer/dfanalyzer/dask/logs/worker_<jobid>.log for any issues with running the workers on the compute nodes.
Warning
For errors related to port usage, please check if you already have any Dask distributed instances running. You can do so by checking the jobs already running in your scheduler queue or by running the following command in the terminal:
ps -aef | grep dask
Then kill those jobs/processes using kill -9 <pid>. You may also need to change the port number in the .yaml files located at /path/to/dftracer/dfanalyzer/dask/conf. For more details about these configurations refer to here.
Use DFAnalyzer¶
To use the Jupyter notebook of DFAnalyzer, navigate to /path/to/dftracer/examples and find the dfanalyzer_distributed.ipynb.
Acessing the Dask Dashboard¶
It is recommended to run the notebook inside VSCode because it supports port forwarding natively. In VSCode, navigate to the bottom bar (where the terminal is), and click on the PORTS tab. Click Forward Port to add a new port and type the port that was used when setup_dask_cluster() was run in your dfanalyzer.ipynb notebook. Connect to http://localhost:PORT to see the Dask scheduler monitoring.
Stopping Dask Distributed Workers¶
cd /path/to/dftracer/dfanalyzer/dask/scripts
./stop_dask_distributed.sh
Note
Wait for several seconds as this script will terminate the workers and deallocate the compute nodes.