Hi Guys I’m new working on Airflow. I’m looking for a way to schedule jobs stored on a DB with their corresponding cron expression. So I was looking for a create method on the AIRFLOW REST API, but It seems that It’s not possible. So My understanding is that It’s only possible by using Dynamc DAG Creation (with globals/single file creation, env variables, files, etc). Digging into that, I noticed that It’s not a good practice to use globals or the single file creation because Airflow parses every n seconds all the files on the folder, so It could generate performance issues. For that reason, I’m thing about using multiple file creation, getting the dags configuration from the DB and then create a pipeline to publish the created DAGS on Airflow, but I’m not pretty sure if that is the best approach having into account Airflow best practices.
I would really appreciate all your suggestion about it.
My first suggestion will be to evaluate if you really do need this pattern. Making Airflow DAGs the traditional file-based way has many benefits, and doing it from something like a database has a lot of drawbacks. If you do want to continue with the pattern, please review this guide - Dynamically Generating DAGs in Airflow - Airflow Guides
You are correct that you should avoid Top Level Code (such as parsing your Database in the Top Level to assemble your DAG).
Consider utilizing a disk-based cache, or have a DAG create your DAG files.
Here is an example of those patterns:
We have some DAGs in a similar configuration to what you are looking for i think. Hopefully this gives you some ideas.
We have some processes that are the same four operations in varying orders and with different parameters. For DAG creation we have a DAG Factory file that parses an airflow variable that contains the list of json objects that define the individual DAGs. These are stored in a no sql database and whenever we need to update or add a new DAG we trigger a non-scheduled DAG that pulls the json objects for each DAG from the no sql database and over writes the existing variable. This allows us to dynamically add and remove DAGs without ever needed to push code. Obviously this isn’t practical for large scale DAG creation, but we haven’t had any issues with the 50 or so DAGs that we need.
Whichever way you approach it, do not include any external calls in your DAG file. You do not want to bog down the scheduler when it processes the DAGs