Running Dask Distributed in a new system

This section describes how to configure DFAnalyzer to run on your cluster.

Make sure you have already completed the necessary steps to build the dfanalyzer. See the Build DFTracer documentation for details.

Initializing Dask Configurations

cd <dftracer>/dfanalyzer/dask/conf
./install_dask_env.sh

Note

This will create a new directory $HOME/.dftracer/ with files: $HOME/.dftracer/configuration.sh and $HOME/.dftracer/configuration.yaml

Editing `$HOME/.dftracer/configuration.yaml`

cd $HOME/.dftracer
<EDITOR> configuration.yaml

By default, $HOME/.dftracer/configuration.yaml will contain this entry:

app: /usr/WS2/haridev/dftracer
env: ${DFTRACER_APP}/venv

Please modify the app to your cloned <dftracer> directory and env to the Python virtual environment that you used to install the dfanalyzer code. Refer to the Build DFTracer documentation for details.

Editing Dask Configurations depending on your System

In the <dftracer>/dfanalyzer/dask/conf/ folder create a new .yaml file for the system you want to use. The .yaml file should consist of the following fields:

config: this fiels contains locations for the directories containing files for the dask distributed cluster.
- script_dir: the scripts to run dask distributed
- conf_dir: the .yaml configuration files
- run_dir: the folder which will contain the scheduler_{$USER}.pid and scheduler_{$USER}.json file used to store information for the scheduler.
- log_dir: the folder which will contain the logs for the dask distributed scheduler and workers.
job: information about the job you want to run to create the dask distributed cluster.
- num_nodes: number of nodes which are going to be used to run the dask distributed cluster.
- wall_time_min: time (in minutes) which the dask distributed cluster is going to run for.
- env_id: the name of the job which will run.
- queue: the queue which the job will run for.
scheduler: information used to run the scheduler of the dask distributed cluster.
- cmd: command used to run the scheduler. Depending on the system you are using you might need to use FLUX, SLURM or other scheduler.
Examples can look like this: srun -N ${DFTRACER_JOB_NUM_NODES} -t ${DFTRACER_JOB_WALL_TIME_MIN} for SLURM scheduler or flux run -N ${DFTRACER_JOB_NUM_NODES} -t ${DFTRACER_JOB_WALL_TIME_MIN} for FLUX scheduler. - port: <port>` used to run dask distributed. - kill: command used to kill the cluster. Examples can look like this: scancel ${SLURM_JOB_ID} for SLURM scheduler or flux cancel –all for FLUX scheduler.
worker: information used to run the workers of the dask distributed cluster.
- ppn: processes per node for the dask distributed cluster.
- cmd: command used to run the worker. Depending on the system you are using you might need to use FLUX, SLURM or other scheduler.
Examples can look like this: srun -N ${DFTRACER_JOB_NUM_NODES} --ntasks-per-node=${DFTRACER_WORKER_PPN} for SLURM scheduler or srun -N ${DFTRACER_JOB_NUM_NODES} --ntasks-per-node=${DFTRACER_WORKER_PPN} for FLUX scheduler. - per_core: number of processes per code - threads: number of threads used. - local_dir: a location for a local directory used from dask to cache data frames. It can be set to local storage or shared memory. - kill: command used to kill the cluster. Examples can look like this: scancel ${SLURM_JOB_ID} for SLURM scheduler or flux cancel --all for FLUX scheduler.

cd <dftracer>/dfanalyzer/dask/conf
<EDITOR> <system>.yaml

Bellow is an example of a .yaml taht can used for LC Ruby:

Run DFAnalyzer

Navigate to <dftracer>/examples/dfanalyzer/dfanalyzer_distributed.ipynb and run your notebook.