Parallel Python

Fast and Easy

Easy Parallel Python that does what you need

Get started

What you can do with Dask

Big Pandas

Use Dask with pandas to intuitively process terabytes of tabular data.

It’s faster than Spark and easier too.

Dask dataframes use pandas under the hood, so your current code likely just works.

Documentation One Trillion Row Challenge

import dask.dataframe as dd

df = dd.read_parquet("s3://data/uber/")

# How much did NYC pay Uber?
df.base_passenger_fare.sum().compute()

# And how much did drivers make?
df.driver_pay.sum().compute()

Parallel For Loops

Use Dask to parallelize your own Python functions and scripts, no matter how complex.

Parallelize existing scripts
Supports arbitrary dependencies
Fine-grained task scheduling

Documentation Process 5,000 files in parallel

from dask.distributed import Client

client = Client()

# Define your own code
def f(x):
    return x + 1

# Run your code in parallel
futures = client.map(f, range(100))
results = client.gather(futures)

Big Arrays

Use Dask and NumPy/Xarray to churn through terabytes of multi-dimensional array data in formats like HDF, NetCDF, TIFF, or Zarr.

Documentation Aggregate 250 TB of Water Model Data

import xarray as xr

# Open image/array files natively
ds = xr.open_mfdataset("data/*.nc")

# Process across dimensions
ds.mean(dims=["lat", "lon"]).compute()

Production ETL

Run regular jobs on a schedule. Know that they’ll finish. Dask backs modern workflow managers like Prefect and Dagster.

Documentation Production ETL Pipeline with TPC-H

from prefect import flow, task
from prefect_dask import DaskTaskRunner

@task()
def process(file):
    ...

@flow(task_runner=DaskTaskRunner(
    address="http://my-dask-cluster"
))
def workflow():
    files = ...
    results = process.map(files)

Machine Learning

Use Dask with common machine learning libraries to train or predict on large datasets, increasing model accuracy by using all of your data.

Documentation Example: XGBoost Model Training

import xgboost as xgb
import dask.dataframe as dd

df = dd.read_parquet("s3://my-data/")
dtrain = xgb.dask.DaskDMatrix(df)

model = xgb.dask.train(
    dtrain,
    {"tree_method": "hist", ...},
    ...
)

Performance at Scale

Fast on Machines

Dask is lightweight, and runs your raw code on your machines without getting in the way. No virtualization or compilers.

As the Python stack matures your code matures. Today Dask is 50% faster than Spark on standard benchmarks.


import pandas as pd     

df = pd.read_parquet("s3://mybucket/myfile.parquet/")

df = df[df.value >= 0]
df.groupby("account")["value"].sum()


import dask.dataframe as dd

df = dd.read_parquet("s3://mybucket/myfile.*.parquet/")

df = df[df.value >= 0]
df.groupby("account")["value"].sum().compute()

Made for Humans

Computers are expensive. Humans are really expensive.

Fortunately, humans already know how to use Dask.

It’s just Python. It’s just pandas. It’s just NumPy.

Dask’s dashboard also guides you towards efficient computation, quickly teaching humans to become expert in distributed computing.

Cheap and Efficient

Fast humans + Fast machines = Cheap Computing

Rows of Data Computed

1000000000000

Cost

0.00

Dask users often process cloud data at $0.10 per TiB

Where you can run Dask

Open Source Deployment

Run Dask on your laptop (it’s trivial) or deploy it on any resource manager from HPC job schedulers, to Kubernetes, to cloud SaaS services.

Documentation Dask deployment video

# Start on your laptop
from dask.distributed import LocalCluster
cluster = LocalCluster()       
client = cluster.get_client()

# Scale later to many machines
from dask_kubernetes import KubeCluster
cluster = KubeCluster()        
client = cluster.get_client() 

# Your code works the same either way

Where you can run Dask

Managed Cloud

Run Dask in the cloud with an easy SaaS solution. Coiled is free individuals with modest use and easy for anyone with a cloud account.

Documentation Six Coiled Features for Dask Users

# Connect to your cloud account
$ pip install coiled
$ coiled setup                   

# Get many machines in about a minute
import coiled
cluster = coiled.Cluster(
    n_workers=100,               
    region="us-east-2",
)
client = cluster.get_client()

What users say about Dask

People use Dask and like it! You won’t be alone!

It’s easy

It’s massive

It solved my problem

“Dask has been a trailblazer in making distributed and out-of-memory computing in Python easy and accessible for everyone.”

Wes McKinney, Pandas

“Dask shines when dealing with generic data structures which don’t conform to table-like structures. PySpark has RDDs, but who wants to code in Python and debug verbose Java logs?”

Ajith Aravind, Simeio

“Dask integrates seamlessly with Xarray, making it easy to run large-scale computations on multi-dimensional datasets. I can focus on my research instead of thinking about parallelism.”

Paige Martin, NASA

“At Capital One, early implementations of Dask have reduced model training times by 91% within a few months of development effort.”

Ryan McEntee, Capital One

“Dask has transformed how the world interacts with weather, climate, and geospatial data by making it super easy to scale up data processing pipelines on HPC or cloud. Things that seemed impossible five years ago are now routine thanks to Dask.”

Ryan Abernathy, Pangeo

“With Dask, I can easily adapt code that runs on a single machine and scale it across an entire cluster. Very few other tools let you get going that quickly—across any language.”

Jacqueline Nolis, Fanatics Inc.

“Dask also makes it easy to deploy distributed work locally using multiple Python processes in a way that is nearly identical to how full production load is distributed.”

Hugues Demers, Grubhub

“To further accelerate our users’ ability to scale easily on the cloud, we expanded this by setting up pre-configured Horovod and Dask clusters.”

Meenakshi Sharma, Wayfair

“I used to work in Spark all the time. The sooner you move to Dask, the sooner you’ll be grateful you did.”

John Renken, Rebuy

Submit New Event

Submit News Feature

Parallel Python

Fast and Easy

What you can do with Dask

Big Pandas

Parallel For Loops

Big Arrays

Production ETL

Machine Learning

Performance at Scale

Fast on Machines

Made for Humans

Cheap and Efficient

Where you can run Dask

Open Source Deployment

Where you can run Dask

Managed Cloud

What users say about Dask

Organizations That Use Dask

Use Cases

NYC Uber/Lyft

Scalable ETL Pipeline

National Water Model

TPC-H Benchmark

NASA Satellite Imagery

XGBoost Model Tuning

Want more examples?

Super easy to get started

Most people use Dask just on their laptops to scale out to 100 GiB datasets. Dask works great on a single machine.

Submit New Event

Submit News Feature

Sign up for Newsletter

Parallel Python

Fast and Easy

What you can do with Dask

Big Pandas

Parallel For Loops

Big Arrays

Production ETL

Machine Learning

Performance at Scale

Fast on Machines

Made for Humans

Cheap and Efficient

Where you can run Dask

Open Source Deployment

Where you can run Dask

Managed Cloud

What users say about Dask

Organizations That Use Dask

Use Cases

NYC Uber/Lyft

Scalable ETL Pipeline

National Water Model

TPC-H Benchmark

NASA Satellite Imagery

XGBoost Model Tuning

Want more examples?

Super easy to get started

Most people use Dask just on their laptops to scale out to 100 GiB datasets. Dask works great on a single machine.