We generally recommend using an ExternalTaskSensor in this scenario. That’d allow for a first task to wait for the Redshift load in a first DAG to finish and trigger downstream tasks in another DAG from there.
A few notes:
- Both DAGs would have to be set on the same schedule
- The DAG execution date of that sensor MUST match the DAG execution date of the task it’s sensing. That’s why they must have the same schedule.
- Consider upgrading to 1.10.2 (details below)
Sensors in Airflow 1.10.2
Sensors take up a persistent worker slot that historically has sometimes created a “deadlock” where, if you have more sensors than worker slots, nothing else gets pulled in to get executed. Airflow 1.10.2 thankfully has a “reschedule mode” (https://github.com/apache/airflow/blob/1.10.2/airflow/sensors/base_sensor_operator.py#L46-L56) that addresses that issue, so might be worth upgrading.
Now, if you have more sensors than worker slots, the sensor will now get thrown into a new
up_for_reschedule state and unblock that slot. If you can upgrade to 1.10.2, the better.
The heart of issue is inter-DAG dependency. So the better question would be how do DAGs depend on other DAGs, leading to the discussion of how to trigger said DAGs. As mentioned by the customer above, they overcome this by triggering downstream DAGs with what im guessing are TriggerDagRunOperator.
On the flip side, the DAGs could start and wait until the dependent DAG is in the success state to run the rest of the dependent tasks. @paola suggested ExternalTaskSensor, which is perfect for this scenario. I do want to clean up the notes.
- Both DAGs do no have to be set on the same schedule. They can be on different schedules given that the difference between the schedules is provided to the ExternalTaskSensor as execution_delta.
- Sensors should definitely set the mode as
rescheduled to prevent issues @paola mentioned, freeing up slots for efficient worker usage.
Final thoughts would be to implement separation of duties as it is not Airflow’s job to track inter-DAG dependencies, which is different than managing task dependencies as individual DAG units. I would explore a microservice architecture that records and returns DAG states. This would require custom operators that interact with a database at the beginning of a DAG, sensors to check state, and at the end of a DAG, operator to set state.
If you would like more information on this particular architecture, I can provide that in a separate post!