What's the best way to occasionally rebuild/backfill a large table in Redshift that's already being updated hourly with Airflow?

paola · May 20, 2020, 2:18am

Transferring this question from a customer of ours as it’s a general Airflow design challenge folks in the community might have feedback on.

Challenge

We currently have a pipeline which incrementally updates a large table in Redshift, transforming and then inserting all source table data which is has arrived since the DAG last ran (hourly).

Occasionally we want to do a full rebuild on this table. The table must remain accessible to queries during this process. Do you have a suggestion for a best practices way to implement this functionality?

A branch operator that is conditional on a rebuild=1 variable? A separate dag that is triggered manually? Ideally the solution would be extensible to all tables, as it’s not uncommon to have to rebuild an incremental table occasionally.

Options

A couple of different approaches have been proposed in the past:

The one standard way some people approach this is to use the Airflow backfill approach, which has the advantage of using the same DAG but has an efficiency tradeoff for dealing with large data sets.
The other approach, which I personally tend to prefer, is to use a separate DAG for this “refresh” since a more efficient approach could be taken in this case.

Approach #1 might not work here, as each run (hourly) uses a predicate to only process newly arrived data in the source table: 'WHERE source_table.ds > (SELECT max(ds) FROM target_table)'. Running a backfill given that predicate would have no impact.

The Question

What does approach #2 actually look like? Is there a better way to tackle this challenge? Is the best thing to do -

Toggle OFF the main dag in the UI
Trigger the rebuild dag in the UI
Wait for it to finish, then toggle ON the main DAG in the Airflow UI

Having to create this second refresh-dag for every ETL DAG is annoying. Is this generalizable into a macro or a plugin, or something else?

Topic		Replies	Views
Backfilling guidelines Airflow	0	18475	July 11, 2019
What's the best way to trigger a DAG with another DAG? Airflow	1	13130	June 17, 2020
Best practices on dag re-runs (even successful ones)	6	9950	March 31, 2021
How to run a pipeline with these 2 taks? Airflow	1	1400	May 13, 2022
Backfill with rest api Airflow	5	2293	October 19, 2022

What's the best way to occasionally rebuild/backfill a large table in Redshift that's already being updated hourly with Airflow?

Challenge

Options

The Question

Related Topics