Dask in one line: Dask works with data that's too big for memory – and still uses Pandas syntax.
Dask is a Python library for data wrangling and parallelization. It helps with scaling and parallelizing programs. It’s used to create programs that deal with big data or have long computation times, making it easier to distribute these workloads across multiple computer cores, either on a single machine or across a compute cluster.
Dask aims to be both simple and powerful. It implements the same APIs as Pandas, NumPy and scikit-learn, which many programmers already know well. This means Dask code is likely faster to write and easier to read than other frameworks, such as Spark or Hadoop, which are also often used to scale software across compute clusters.
When choosing a framework for a software development or data analysis task, you usually have to make tradeoffs. Frameworks like Spark often bring you more power and scalability, but at the cost of added complexity. You could call this “making the easy things hard and the hard things easy.” You wouldn’t want to use Spark to read a CSV file of 1000 rows: it would create a lot of overhead and complexity, and it wouldn’t offer any benefits. You’d use Pandas instead. But you wouldn’t want to use Pandas to read and analyze 1 million CSV files per hour. It wasn’t designed to be used on that scale.
Dask aims to find a middle way in this tradeoff: the simplicity and familiarity of Pandas, NumPy, and scikit-learn, with Spark-like power to parallelize across cores.