Frequently, I am having an issue where the upstream Dataset dependencies for a DAG become out of sync. For example, if DAG D3 is dependent on Datasets from DAGs D1 and D2, and then D1 publishes a Dataset but D2 does not, then D3 will show 1/2 Dataset dependencies satisfied.
If I want to “reset” the Datasets satisfied for D3 dag (i.e. make it show 0/2 Datasets satisfied), I do not currently know a way to do this without having to publish D2 Dataset to trigger D3 dag to run, which will then make the Datasets start over with a fresh slate of 0/2.
Is there a way that we can reset, clear, or otherwise wipe away a DAG’s upstream Dataset dependencies, so that the DAG will need ALL of the upstream dependencies satisfied, and no “partial” Dataset dependencies satisfied?
As of today, there is no feature to clear or rest. But I think it is a useful feature. Have relayed to the team.
Is it something which you come across very often for most of the pipelines or just an edge case. If it is indeed a blocker, we can look at other alternatives.
The need for our pipeline to selectively reset/clear certain DAGs’ Dataset dependencies is a rather frequent occurrence for us.
It mostly arises when multiple downstream DAGs depend on the same upstream Dataset. In that case, publishing the upstream Dataset may cause 1 of the downstream DAGs to have Dataset dependencies in sync, but 4 other downstream DAGs may then have their Dataset dependencies out of sync (because those 4 DAGs may have already been at 0/X Dataset fresh slate).
While such a feature is by no means absolutely necessary, like you mention - it would be a useful feature to have! It would save time and provide a much cleaner solution to fixing complex Dataset dependency graphs with multiple up/downstream dependencies.
Hey @vincent99-git , I totally agree with what you said. And this has already been raised as a request with the OSS team and it is on their radar.
As for the next steps to get you unblocked, did you try looking at the metadata db table dataset_dag_run_queue that lists the datasets in queue and that have been completed? May be worth experimenting with if this is a blocker. Could be a good case study!