Transferring this question from a customer of ours as it’s a general Airflow design challenge folks in the community might have feedback on.
We currently have a pipeline which incrementally updates a large table in Redshift, transforming and then inserting all source table data which is has arrived since the DAG last ran (hourly).
Occasionally we want to do a full rebuild on this table. The table must remain accessible to queries during this process. Do you have a suggestion for a best practices way to implement this functionality?
A branch operator that is conditional on a rebuild=1 variable? A separate dag that is triggered manually? Ideally the solution would be extensible to all tables, as it’s not uncommon to have to rebuild an incremental table occasionally.
A couple of different approaches have been proposed in the past:
- The one standard way some people approach this is to use the Airflow backfill approach, which has the advantage of using the same DAG but has an efficiency tradeoff for dealing with large data sets.
- The other approach, which I personally tend to prefer, is to use a separate DAG for this “refresh” since a more efficient approach could be taken in this case.
Approach #1 might not work here, as each run (hourly) uses a predicate to only process newly arrived data in the source table:
'WHERE source_table.ds > (SELECT max(ds) FROM target_table)'. Running a backfill given that predicate would have no impact.
What does approach #2 actually look like? Is there a better way to tackle this challenge? Is the best thing to do -
- Toggle OFF the main dag in the UI
- Trigger the rebuild dag in the UI
- Wait for it to finish, then toggle ON the main DAG in the Airflow UI
Having to create this second refresh-dag for every ETL DAG is annoying. Is this generalizable into a macro or a plugin, or something else?