Generate Parallel Tasks based on Result of previous task in a DAG

twylabs-ext · April 7, 2020, 9:47pm

I have an Airflow Dag which pulls in data from external HR System which has employee data for companies . In our etl we need to pull data for all companies.

Now our dag looks like this today

Get List of Companies
Get list of Employee for each company
Get information about each employee
Combine the information and push to s3

Is there a way where we can run the task of getting information about employee in parallel instead of sequential. Someway to generate the parallel tasks based on result previous task

larry.komenda · April 8, 2020, 5:17pm

You can definitely generate tasks within a DAG dynamically based on input data. See https://www.astronomer.io/guides/managing-dependencies/.

What type of operator are you using to get the list of companies? I’m still an Airflow beginner, but it seems like you could either:

Pass the list of companies to the operator for task #2 via XCom (assuming it’s a small amount of data)
Persist the list of companies to a shared data store and then read the list of companies in task #2.

Once task #2 has the list of companies, it’s easy to create a task for each company.

twylabs-ext · April 8, 2020, 5:31pm

To get the companies we use a Custom Operator which internally calls an API to get list of companies .

So would this process still work , what would be the shared storage we use ?

AndrewHarmon · April 8, 2020, 5:36pm

Hello,

It sounds like you are trying to fanout your tasks. So for example, Task A runs, and based on the results of that, you create n Task Bs to run in parallel, then when they all complete, Task C runs. For each DAG run, the number of tasks Bs is different and is based on the result of Task A for that particular DAG run. If this is correct, unfortunately this is not currently supported by Airflow. The DAG structure cannot dynamically change within a DAG Run. I believe this is a feature Airflow is looking to include in a future release.

larry.komenda · April 8, 2020, 5:52pm

Hi, @AndrewHarmon . What would happen if dag run #1 had 3 companies and thus 3 tasks for step 2 and then on a subsequent dag run there were 4 companies and thus 4 tasks for step 2?

Just trying to get a better handle on how the components interact.

AndrewHarmon · April 8, 2020, 6:26pm

you could make your dag file dynamic. So if you had a cofig file, env var or airflow variable with the value 3 in it, you could use that in a loop in your dag file to create 3 similar tasks, 1 for each company. Then sometime between DAG run 1 and 2, your edited that value to 4, your dag would instantly reflect that and have 4 similar tasks when DAG Run 2 starts. And this pattern is supported.

larry.komenda · April 8, 2020, 6:38pm

Thanks, @AndrewHarmon. What’s the major difference between what @twylabs-ext is trying to do and that pattern? Is it simply that the variable is defined in a “static” location (e.g. a config file, env var, or Airflow variable) as opposed to a database table?

AndrewHarmon · April 8, 2020, 6:55pm

i think @twylabs-ext wants to change the number of tasks during the DAG Run. So while the dag is actually running, edit the number of tasks in the dag. That can’t be done. But you can edit the number of tasks between DAG runs.

twylabs-ext · April 8, 2020, 6:58pm

@AndrewHarmon
Can i not do something like this

Fetch a list a companies from API in Task 1
Depending on the number above say 100 , we generate 100 identical tasks in parallel, each to fetch some data for individual company and then store in s3

Check the attached diagram.

larry.komenda · April 8, 2020, 7:21pm

@twylabs-ext Yes, that should work as long as (as @AndrewHarmon mentioned) the number of companies returned in task #1 wont’ change within that dag run. It sounds like there should be no problem if the number of companies changes between runs.

So going back to your use case, your custom operator would call the API, get the list of companies, and the list of companies would be used by a dependent operator. As for what type of persistent storage - that might be better answered by someone else as I am still learning Airflow myself. If you’re able, could you create a temporary table in S3 tied to an ID for that specific dag run and then read the list from that table? Or maybe have a persistent working table that has a column that indicates dag id which can be used in a query by the next task?

twylabs-ext · April 8, 2020, 7:58pm

Ya i will try that , will post back with the whatever works for me

Topic		Replies	Views
Dynamic Dag Creation Airflow	2	2248	October 27, 2022
Sequential Dynamic Tasks Airflow dynamic-tasks	0	3066	May 4, 2023
Feedback on my implementation Airflow	5	2163	December 18, 2020
Dynamic taskGroup mapping Airflow airflow	0	119	October 29, 2024
Run next task group if the same task group from previous DAG Run finished Airflow airflow , dag-run , task-group	0	1408	May 16, 2023

Generate Parallel Tasks based on Result of previous task in a DAG

Related topics