Parallel Python

Fast and Easy

Easy Parallel Python that does what you need

What you can do with Dask

Big Pandas

Dask DataFrames use pandas under the hood, so your current code likely just works. It’s faster than Spark and easier too.

Documentation Performance Benchmarks

import dask.dataframe as dd

df = dd.read_parquet("s3://data/uber/")

# How much did NYC pay Uber?
df.base_passenger_fare.sum().compute()

# And how much did drivers make?
df.driver_pay.sum().compute()

Parallel For Loops

Parallelize your Python code, no matter how complex. Dask is flexible and supports arbitrary dependencies and fine-grained task scheduling.

Documentation Process 5,000 files in parallel

from dask.distributed import Client

client = Client()

# Define your own code
def f(x):
    return x + 1

# Run your code in parallel
futures = client.map(f, range(100))
results = client.gather(futures)

Big Arrays

Use Dask and NumPy/Xarray to churn through terabytes of multi-dimensional array data in formats like HDF, NetCDF, TIFF, or Zarr.

Documentation Aggregate 250 TB of Water Model Data

import xarray as xr

# Open image/array files natively
ds = xr.open_mfdataset("data/*.nc")

# Process across dimensions
ds.mean(dims=["lat", "lon"]).compute()

Machine Learning

Use Dask with common machine learning libraries to train or predict on large datasets, increasing model accuracy by using all of your data.

Documentation Example: XGBoost Model Training

import xgboost as xgb
import dask.dataframe as dd

df = dd.read_parquet("s3://my-data/")
dtrain = xgb.dask.DaskDMatrix(df)

model = xgb.dask.train(
    dtrain,
    {"tree_method": "hist", ...},
    ...
)

Performance at Scale

Fast on Machines

Dask is lightweight, and runs your raw code on your machines without getting in the way. No virtualization or compilers.

As the Python stack matures your code matures. Today Dask is 50% faster than Spark on standard benchmarks.


import pandas as pd     

df = pd.read_parquet("s3://mybucket/myfile.parquet/")

df = df[df.value >= 0]
df.groupby("account")["value"].sum()


import dask.dataframe as dd

df = dd.read_parquet("s3://mybucket/myfile.*.parquet/")

df = df[df.value >= 0]
df.groupby("account")["value"].sum().compute()

Made for Humans

Computers are cheap. Humans are expensive.

Fortunately, humans already know how to use Dask.

It’s just Python. It’s just pandas. It’s just NumPy.

Dask’s dashboard guides you towards efficiency, quickly teaching you to become a distributed computing expert.

Cheap and Efficient

Fast humans + Fast machines = Cheap Computing

Rows of Data Computed

1000000000000

Cost

0.00

Dask users often process cloud data at $0.10 per TiB

Where you can run Dask

Open Source Deployment

Run Dask on your laptop (it’s trivial) or deploy it on any resource manager like Kubernetes, an HPC job schedulers, cloud SaaS services, or even legacy Hadoop/Spark clusters.

Documentation Dask deployment video

from dask.distributed import LocalCluster

cluster = LocalCluster(
    processes=False,
)       
client = cluster.get_client()

# Use Dask locally
import dask.dataframe as dd
df = dd.read_parquet("/path/to/data.parquet")
df.value.mean().compute()

Where you can run Dask

Managed Cloud

Run Dask in the cloud with open source Kubernetes, or with an easy SaaS solution. Coiled is free for individuals with modest use and easy for anyone with a cloud account.

Documentation Video: Dask in the Cloud

from coiled import Cluster

cluster = Cluster(
    n_workers=100, region="us-east-2",
)
client = cluster.get_client()

# Use Dask on the cloud
import dask.dataframe as dd
df = dd.read_parquet("s3://data.*.parquet")
df.value.mean().compute()

What users say about Dask

People use Dask and like it! You won’t be alone!

It’s easy

It’s massive

It solved my problem

“Dask has been a trailblazer in making distributed and out-of-memory computing in Python easy and accessible for everyone.”

Wes McKinney, Pandas

“At Capital One, early implementations of Dask have reduced model training times by 91% within a few months of development effort.”

Ryan McEntee, Capital One

“My climate science research has been made possible by Dask. Dask integrates seamlessly with Xarray, making it easy to run large-scale computations on multi-dimensional datasets. I can focus on my research instead of thinking about parallelism.”

Paige Martin, Pangeo

“Dask shines when dealing with generic data structures which don’t conform to table-like structures. PySpark has RDDs, but who wants to code in Python and debug verbose Java logs?”

Ajith Aravind, Simeio

“Dask has transformed how the world interacts with weather, climate, and geospatial data by making it super easy to scale up data processing pipelines on HPC or cloud. Things that seemed impossible five years ago are now routine thanks to Dask.”

Ryan Abernathy, Earthmover

“With Dask, I can easily adapt code that runs on a single machine and scale it across an entire cluster. Very few other tools let you get going that quickly—across any language.”

Jacqueline Nolis, Fanatics Inc.

“Dask also makes it easy to deploy distributed work locally using multiple Python processes in a way that is nearly identical to how full production load is distributed.”