I’ve been trying to get my deployment to work for a few days now. Currently a Cerlery deployment is running with the Astronomer default settings.
My DAGs are dynamically generated from a JSON file that is copied to the deployment. So a DAG file generates many DAGs in Airflow.
At first I had the problem that generating the DAGs seemed to take too long. There are currently 32 DAGs, but there will be more. Whenever I clicked on the DAGs in the UI, I got a message that the DAGs were missing.
With some customization of the environment variables, everything now works.
But now I have the problem that the tasks take a long time. Each task just queries an API (answer between 1 - 400 lines) and writes it to a Google Cloud SQL (Postgresql) database. Work that normally only takes a few seconds.
In the logs I see that after starting the task it takes about 2 minutes before the actual work gets done. During this time, a certain line is output again and again in the logs.
so when the the scheduler heartbeats, it parses the dag file but doesn’t execute it (as in execute the tasks). so that happens like every 5 seconds. All tasks in that DAG will have their init called on each heartbeat. which should be light weight, the heavy lifting should be in the execute of the operator, not the init of the operator and hooks. When your task runs, it does an initial parsing of the whole dag, so that’s probably why you are seeing that line printed out for each task using the CloudSqlDatabaseHook as it is going through the init of each task. It does seem to be an unusual pattern to init the hook in the init of the operator. Typically i see the hook being created in the execute section of the operator. I’m not sure if there’s a special reason why it was done this way.
My concern with the slowness of your dag may be that it is being dynamically created. so each task has to parse it’s dag file, but your dag file contains alot of dags in theory. Is there anything in the top level code of your dag that is resource intensive. I would think simply parsing your json from a local file to generate the dags would be relatively quick. are you making any remote calls to anything? You can also try just upping the AU on your scheduler/workers to try and give them a little more juice to process that dynamic dag.
Thank you very much for your detailed answer.
I have modified the CloudSqlDatabaseHook so that it is no longer initialized in the init method of the operator. It will now be initialized in the execute section as you write.
In fact, I don’t have any problems anymore. The DAGs now load fast enough and the logs are much smaller.