Apache Airflow in one line: Apache Airflow manages, schedules, and monitors your data pipelines.
Apache Airflow is a workflow orchestration management system. It lets you define pipelines of interdependent tasks using Directed Acyclic Graphs (DAGs). This means you can schedule tasks for execution (think of an advanced version of a crontab). You can use Apache Airflow to monitor your tasks, and it will automatically retry if they fail (while properly managing any dependent tasks upstream or downstream). It also provides a web frontend so you can see the status of your tasks.
Apache Airflow can be tricky to understand because it doesn’t really do much by itself. Instead, it acts as a glue layer for many of your other systems. For example, you can use it to define a data pipeline where several different Python scripts are executed in complex patterns via Celery or Kubernetes .
Many machine learning tasks are set up as data pipelines. Instead of running a single program, many different components run at different stages, and each of these depends on others in complex ways.
Scheduling all the different pieces to run at regular intervals using something like cron can be very hard to maintain, and developers often spend significant time writing integrations between a core program and other tools, such as Celery.
By using Apache Airflow, you can skip Cron altogether and also get some other features out of the box. For example, you can visualize your task dependency graph, monitor task status, and integrate other services via plugins – all with Apache Airflow.