Suppose one has multiple ETL DAGs that execute one after another post completion. What would be the best way to define a main DAG/sub DAG pattern that will trigger each sub DAG and wait for completion?
Setup
- MAIN DAG
- limit visual complexity in graph
- triggers other dags
- handles branching logic
- is scheduled
Example: start >> child_dag_0 >> child_dag_1
- CHILD_DAGS
- contains ETL Tasks
- are not scheduled
Option 1 SubDags
SubDags have had lots of issues in the past with worker slots and deadlock. This may not be the best option.
Option 2 TaskGroups
Divide each task into a taskgroup in the MAIN DAG. This is possible, but makes it harder to do adhoc one off taskgroup runs.
Option 3 TriggerDagRunOperator
Will allow MAIN DAG to trigger other DAGS as needed and provides for poking/sensing for completion status. However the poke mode
does not have reschedule
and will take up a worker slot. Since the source code shows that a call to time.sleep
is made.
github source TriggerDagRunOperator
Option 4 Stable REST API + HTTPHook + HTTPSensor
With the new stable API in Airflow2.0 it is possible to use the post_dag_run
endpoint to trigger a dag and follow up with get_dag_run
endpoint with an http_sensor
to poll for status.
Are there any other patterns or suggestions for accomplishing such scheduling?