...

Airflow start_date and execution_date Explained

Quang Anh
June 15, 2022
Read: 3 min

Despite Airflow’s popularity in data engineering, the start_date and execution_date concepts remain confusing among many new developers today. This article aims to demystify them.

Basic Concept

Airflow is an orchestration tool, which means that with sufficient permission it is capable of controlling other services in a pre-defined order and timing.

Let’s say you want to ingest some data with Fivetran, transform it with dbt, then run a notebook analysis with Databricks. You can program each of these steps as an Airflow task in that particular order and dependency, scheduling them to run as often as you need, and Airflow will execute them as per your instructions.

Needless to say, in the world of data engineering where gigabytes of data are moved at any second, having an orchestration tool like this is critical.

The start_date

Say you go to work on the 1st of January, 2022 (I know, I know, it is New Year’s and nobody should be working but we do not have a labour union for fictional blog characters here). You finish your DAG at 15:00 and you want it to run regularly everyday at midnight. So, you input all of the final settings as below, expecting the pipeline to run immediately after you start:

If you expect the pipeline to run at that time, think again. Airflow starts running tasks for a given interval at the end of the interval itself, so it will not start its first run until after 11:59 pm on 01-01-2022 or midnight on the following day (2nd Jan 2022).

The reason is that if you want to ingest data from the 1st of January 2022 (and before), you will need to wait until the end of the interval (daily) for the data source to have all of the data available from the day before the ingestion starts.

If you want your DAG to run today (1st of Jan in our example), do this:

Or, to be safe, why not go a bit overboard:

Right? Wrong.

By default, Airflow will start any unexecuted DAG with a past start_date. So unless you want to have unnecessary additional runs, do not put your start_date in the past. This behaviour can be disabled by setting catchup=False.

You might be wondering, “Why not automate the start_date as today? We are using the datetime library after all.”

First of all, your today() is not at midnight. It could be at 13:45:32. You’ll never know the exact time of its runs.

Second, this simply will NOT run. In the FAQ here, Airflow strongly recommend against using dynamic start_date. The reason being, as stated above, that Airflow executes the DAG after start_date + interval (daily). Therefore, if start_date is a callable, it will be re-evaluated continuously, moving along with time. The start_date + interval would forever stay in the future.

The execution_date

Another tricky variable is execution_date (if you work with Airflow versions prior to 2.2). Nowadays, we just call it logical_date or ds for short. This is one of the many parameters that you can reference inside your Airflow task.

What do you think, what date will be printed out at the first run?

If your answer is “2022-01-02”, the date of its first run, then you are once again wrong. By definition, Airflow’s logical date points to the start of the interval, not at the moment when the DAG is actually executed. Hence, the correct answer is still “2022-01-01”.

Conclusion

Scheduled DAGs in Airflow always have a date interval, and tasks are run at the end of it. While both start_date and execution_date (or logical_date) point to the beginning of the interval, start_date will be constant for all the runs as defined in the DAG definition. The execution_date, on the other hand, is passed as a parameter to the tasks, with a different value every time the DAG is executed.

If this explanation has been helpful, head to the Infinite Lambda blog for more useful content in the data and cloud space.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Breaking Some Myths about the Use of Dual-Track Agile
Bringing both flexibility and transparency, the Dual-Track Agile methodology is increasingly popular. With a growing number of teams that decide to try it out, it...
June 10, 2022
Creating a PostgreSQL to BigQuery Sync Pipeline Using Debezium and Kafka
Many companies today use different database technologies for their application and their data platform. This creates the challenge of enabling analytics on application data without...
June 1, 2022
How to Apply Dual-Track Agile in Practice
This article is a part of a series on the Dual-Track model. Here, I am going to share with you 5 rules on how to...
May 17, 2022
Challenges of Using Dual-Track Agile and How to Handle Them
Welcome to Part II of the Infinite Lambda’s blog series on Dual-Track Agile. You might want to check Part I that explains what this model...
April 15, 2022
Sustainability: the Last Frontier in Business Intelligence
The power of the modern data stack in generating actionable insights out of disparate data is well documented. It’s time to apply this to sustainability....
April 1, 2022
What Is Dual-Track Agile and What Are the Benefits of Using It?
Many organisations today struggle to build products and features that their customers would actually need and use. In the product space, there are a myriad...
March 15, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Optimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.