I am new in Apache Airflow, but I found it very handy. Currently I am trying to use it for following scenario:
At the university we automatically check student’s projects by downloading them from GitLab and processing them one by one (or in parallel) every few hours. So I created DAG, which is able to process one project. But because the number of projects is in hundreds (around 800), currently I am looking for solution, how to process them correctly.
I can do it following way:
# process the list of projects by their ids
for project in [1, 2, 3 ]:
process_project(project)
This way it will create hundreds of running DAGs (in Graph/Tree View) running one by one. I am not sure, if this is the correct way how to do this, because the list of executed DAGs is huge. Yes - I can easily see the errors for every single processed project, but it looks messy :-/
And also - how to process them correctly in parallel? With pools?
Also one more question - how to correctly end the processing in running DAG? Currently I am raising Airflow exception, which will quit also the for
cycle. Yes - I can try
- except
that, but again - I am not sure, if this is the correct way.
Thanks for any help
mirek