Airflow start date concepts

I find the concept of start date as little confusing so created a doc for my team to get familiar with it. The terminologies used here is/may not be 100% correct, but it may give an idea to get started and understand the concept

  • start_date = The first dag start time. keep it STATIC
  • execution_date = max(start_date, last_run_date)
  • schedule_interval parameter accepts cron or timedelta values
  • next_dag_start_date = execution_date + schedule_interval
  • On Home Page, Last Run is execution_date. Hoover over on ( i ) to see the actual last run time

It is always advisable to use a STATIC start_date in a dag
Eg: ā€˜start_dateā€™: datetime(2019, 10, 13, 15, 50)
You can use - airflow.utils.dates.days_ago(7) but it is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly

schedule_interval parameter accepts cron or timedelta values. This initiates the next dag run by utilizing the formula
next_dag_start_date = max(start_date, last_run_date) + schedule_interval

Eg - if your start_date = datetime(2019, 10, 13, 15, 50), schedule_interval = 0 * * * * or (@hourly)

Case a) current_time is before start_date - 2019-10-13 00:00, then your dags will schedule at
2019-10-13 16:50, and subsequently every hour.
Please note that it will not start at start_date(2019-10-13 15:50), but rather at execution_date + schedule_interval

Case b) current_time is after start_date - 2019-10-14 00:00, then your dags will schedule at
2019-10-13 16:50, 2019-10-13 17:50, 2019-10-13 18:50 ā€¦ and subsequently catchup till it reaches 2019-10-13 23:50
Then it will wait for the strike of 2019-10-14 00:50 for the next run.
Please not that the catchup can be avoided by setting catchup=False in dag properties

5 Likes

Hey, thanks for the summary here .This is great. Just to add about the execution_date, Airflow runs DAGs a the the end of the scheudule interval. so for a DAG with an hourly schedule starting at 8am, it will run the first DAG at 9amā€¦ and the execution_date of that DAG Run will be 8am. So at 9am, the 8am DAG Run is triggered. You can think of it as ā€œat 9am, iā€™m ready to process the 8am dataā€¦ so run the workflow with a data date of 8amā€. Hope that helps!

1 Like

Thanks for this, @sohiljain! Really helpful.

Related post here to @AndrewHarmonā€™s comment for anyone following: Airflow Pro-Tip: Scheduler will run your job one schedule_interval AFTER the start date