Integrates with existing projects
Built with the broader community
Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn.
Dask arrays scale Numpy workflows, enabling multi-dimensional data analysis in earth science, satellite imagery, genomics, biomedical applications, and machine learning algorithms.
Dask dataframes scale Pandas workflows, enabling applications in time series, business intelligence, and general data munging on big data.
Familiar for Python users
and easy to get started
Dask uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask-powered equivalents.
You don't have to completely rewrite your code or retrain to scale up.Learn About Dask APIs »
# Arrays implement the Numpy API import dask.array as da x = da.random.random(size=(10000, 10000), chunks=(1000, 1000)) x + x.T - x.mean(axis=0)
# Dataframes implement the Pandas API import dask.dataframe as dd df = dd.read_csv('s3://.../2018-*-*.csv') df.groupby(df.account_id).balance.sum()
# Dask-ML implements the Scikit-Learn API from dask_ml.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(train, test)
Scale up to clusters
or just use it on your laptop
Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.
But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.Learn About Dask Schedulers »
Enabling you to parallelize internal systems
Not all computations fit into a big dataframe.
Dask exposes lower-level APIs letting you build custom systems for in-house applications. This helps open source leaders parallelize their own packages and helps business leaders scale custom business logic.
Powered by Dask
These software projects are well-integrated with Dask, or use Dask to power components of their infrastructure.
Gradient boosted trees for machine learning
XGBoost can use Dask to bootstrap itself for distributed training
Brings the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming
Manage tabular data in a blob store
Library for reading and manipulating meteorological remote sensing data and writing it to various image and data file formats
A package to help build pipelines to manage continuous streams of data
Provides utilities for exploratory analysis of large scale genetic variation data