Airflow Datasets - Can they be cleared or reset?

Hi, I am using Airflow Datasets/Data-driven scheduling for a data pipeline: Data-aware scheduling — Airflow Documentation

Frequently, I am having an issue where the upstream Dataset dependencies for a DAG become out of sync. For example, if DAG D3 is dependent on Datasets from DAGs D1 and D2, and then D1 publishes a Dataset but D2 does not, then D3 will show 1/2 Dataset dependencies satisfied.

If I want to “reset” the Datasets satisfied for D3 dag (i.e. make it show 0/2 Datasets satisfied), I do not currently know a way to do this without having to publish D2 Dataset to trigger D3 dag to run, which will then make the Datasets start over with a fresh slate of 0/2.

Is there a way that we can reset, clear, or otherwise wipe away a DAG’s upstream Dataset dependencies, so that the DAG will need ALL of the upstream dependencies satisfied, and no “partial” Dataset dependencies satisfied?

Hey @vincent99-git

As of today, there is no feature to clear or rest. But I think it is a useful feature. Have relayed to the team.

Is it something which you come across very often for most of the pipelines or just an edge case. If it is indeed a blocker, we can look at other alternatives.

Thanks
Manmeet

Thank you @manmeet for the feedback.

The need for our pipeline to selectively reset/clear certain DAGs’ Dataset dependencies is a rather frequent occurrence for us.

It mostly arises when multiple downstream DAGs depend on the same upstream Dataset. In that case, publishing the upstream Dataset may cause 1 of the downstream DAGs to have Dataset dependencies in sync, but 4 other downstream DAGs may then have their Dataset dependencies out of sync (because those 4 DAGs may have already been at 0/X Dataset fresh slate).

While such a feature is by no means absolutely necessary, like you mention - it would be a useful feature to have! It would save time and provide a much cleaner solution to fixing complex Dataset dependency graphs with multiple up/downstream dependencies.

Best,

Vincent

Hey @vincent99-git , I totally agree with what you said. And this has already been raised as a request with the OSS team and it is on their radar.

As for the next steps to get you unblocked, did you try looking at the metadata db table dataset_dag_run_queue that lists the datasets in queue and that have been completed? May be worth experimenting with if this is a blocker. Could be a good case study!

Thanks
Manmeet

Someone came up with a workaround to fiddle with the Airflow DB