Despite Airflow’s popularity in data engineering, the
execution_date concepts remain confusing among many new developers today. This article aims to demystify them.
Airflow is an orchestration tool, which means that with sufficient permission it is capable of controlling other services in a pre-defined order and timing.
Let’s say you want to ingest some data with Fivetran, transform it with dbt, then run a notebook analysis with Databricks. You can program each of these steps as an Airflow task in that particular order and dependency, scheduling them to run as often as you need, and Airflow will execute them as per your instructions.
Needless to say, in the world of data engineering where gigabytes of data are moved at any second, having an orchestration tool like this is critical.
Say you go to work on the 1st of January, 2022 (I know, I know, it is New Year’s and nobody should be working but we do not have a labour union for fictional blog characters here). You finish your DAG at 15:00 and you want it to run regularly everyday at midnight. So, you input all of the final settings as below, expecting the pipeline to run immediately after you start:
If you expect the pipeline to run at that time, think again. Airflow starts running tasks for a given interval at the end of the interval itself, so it will not start its first run until after 11:59 pm on 01-01-2022 or midnight on the following day (2nd Jan 2022).
The reason is that if you want to ingest data from the 1st of January 2022 (and before), you will need to wait until the end of the interval (daily) for the data source to have all of the data available from the day before the ingestion starts.
If you want your DAG to run today (1st of Jan in our example), do this:
Or, to be safe, why not go a bit overboard:
By default, Airflow will start any unexecuted DAG with a past
start_date. So unless you want to have unnecessary additional runs, do not put your
start_date in the past. This behaviour can be disabled by setting
You might be wondering, “Why not automate the
start_date as today? We are using the
datetime library after all.”
First of all, your
today() is not at midnight. It could be at 13:45:32. You’ll never know the exact time of its runs.
Second, this simply will NOT run. In the FAQ here, Airflow strongly recommend against using dynamic
start_date. The reason being, as stated above, that Airflow executes the DAG after
start_date + interval (daily). Therefore, if
start_date is a callable, it will be re-evaluated continuously, moving along with time. The
start_date + interval would forever stay in the future.
Another tricky variable is
execution_date (if you work with Airflow versions prior to 2.2). Nowadays, we just call it
ds for short. This is one of the many parameters that you can reference inside your Airflow task.
What do you think, what date will be printed out at the first run?
If your answer is “2022-01-02”, the date of its first run, then you are once again wrong. By definition, Airflow’s logical date points to the start of the interval, not at the moment when the DAG is actually executed. Hence, the correct answer is still “2022-01-01”.
Scheduled DAGs in Airflow always have a date interval, and tasks are run at the end of it. While both
logical_date) point to the beginning of the interval,
start_date will be constant for all the runs as defined in the DAG definition. The
execution_date, on the other hand, is passed as a parameter to the tasks, with a different value every time the DAG is executed.
If this explanation has been helpful, head to the Infinite Lambda blog for more useful content in the data and cloud space.