Dask cluster with distributed
There are multiple ways to create a dask cluster, the following is only an example. Please consult the official documentation. The Dask library is installed and can be found in any of the python3 kernels in jupyterhub. Of course, you can use your own python environment.
The simplest way to create a Dask cluster is to use the distributed module:
from dask.distributed import Client
client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
Visualize the cluster: Dashboard
If you are looking for a nice visualization tool, we already have installed the Dask jupyterlab extension into Jupyterhub and is available for all users.
Sometimes, it is necessary to update the dashboard link
for the dask cluster. This can be achieved in either directly in your code or in one of dask configuration files.
For example, if you use distributed
to start the cluster, you can update the dashboard link
in the distributed.yaml
in ~/.config/dask
directory:
distributed:
dashboard:
link: "{JUPYTERHUB_SERVICE_PREFIX}/proxy/{port}/status"
If you still have issues with the dashboard link, you can update the link manually before starting the cluster:
import dask
from dask.distributed import Client
dask.config.config.get('distributed').get('dashboard').update({'link':'{JUPYTERHUB_SERVICE_PREFIX}/proxy/{port}/status'})
Best practices
you can also use the following code to check the available resources on the node:
import multiprocessing
ncpu = multiprocessing.cpu_count()
processes = False
nworker = 2
threads = ncpu // nworker
print(
f"Number of CPUs: {ncpu}, number of threads: {threads}, number of workers: {nworker}, processes: {processes}",
)
client = Client(
processes=processes,
threads_per_worker=threads,
n_workers=nworker,
memory_limit="16GB",
)
client