Task take very long and log printing out same line over and over again

Jonnyblacklabel · August 7, 2019, 9:50am

Hello,

I’ve been trying to get my deployment to work for a few days now. Currently a Cerlery deployment is running with the Astronomer default settings.
My DAGs are dynamically generated from a JSON file that is copied to the deployment. So a DAG file generates many DAGs in Airflow.

At first I had the problem that generating the DAGs seemed to take too long. There are currently 32 DAGs, but there will be more. Whenever I clicked on the DAGs in the UI, I got a message that the DAGs were missing.
With some customization of the environment variables, everything now works.

But now I have the problem that the tasks take a long time. Each task just queries an API (answer between 1 - 400 lines) and writes it to a Google Cloud SQL (Postgresql) database. Work that normally only takes a few seconds.
In the logs I see that after starting the task it takes about 2 minutes before the actual work gets done. During this time, a certain line is output again and again in the logs.

[2019-08-07 09:10:30,352]{base_task_runner.py:101} INFO - Job 17: Subtask domain_kwcount_seo_for_example_com_mobile_true_country_de_to_gcsql [2019-08-07 09:10:30,350] {base_hook.py:83} INFO - Using connection to: id: google_cloud_sql_sistrix. Host: deft-bliss-xxxxx:europe-west3:postgresql-xxx, Port: 5432, Schema: sistrix, Login: xxxxx, Password: XXXXXXXX, extra: {'database_type': 'postgres', 'project_id': 'deft-bliss-xxxxx', 'location': 'europe-west3', 'instance': 'postgresql-xxx', 'use_proxy': True, 'sql_proxy_use_tcp': False, 'sql_proxy_binary_path': '/usr/local/airflow/include/google_cloud_sql_proxy/cloud_sql_proxy'}

Am I using the Google Cloud SQL Proxy in a wrong way? Or where does this behavior come from?

Cheers, Johannes

Jonnyblacklabel · August 8, 2019, 7:38am

I think I found the source of my problems. But i’m not sure if my assumption about how the DAG is interpreted is correct.

The operators for Google Cloud SQL are built so that in the __init__ method of classes the CloudSqlDatabaseHook is initialized.

→ Here is a link to the init method on Github:

github.com

apache/airflow/blob/cb31d08734a9873b7d16ab8325e698fde89feaa3/airflow/contrib/operators/gcp_sql_operator.py#L773


             gcp_conn_id='google_cloud_default',
             gcp_cloudsql_conn_id='google_cloud_sql_default',
             *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.sql = sql
    self.gcp_conn_id = gcp_conn_id
    self.gcp_cloudsql_conn_id = gcp_cloudsql_conn_id
    self.autocommit = autocommit
    self.parameters = parameters
    self.gcp_connection = BaseHook.get_connection(self.gcp_conn_id)
    self.cloudsql_db_hook = CloudSqlDatabaseHook(
        gcp_cloudsql_conn_id=gcp_cloudsql_conn_id,
        gcp_conn_id=gcp_conn_id,
        default_gcp_project_id=self.gcp_connection.extra_dejson.get(
            'extra__google_cloud_platform__project'))
    self.cloud_sql_proxy_runner = None
    self.database_hook = None


def _execute_query(self):
    try:
        if self.cloudsql_db_hook.use_proxy:

Airflow executes the DAG at each heartbeat. So if the CloudSqlDatabaseHook is initialized when the operator is initialized, this happens for all tasks in the DAG that use the operator.

The amount of these always the same lines in my logs corresponds pretty much to the tasks that use the operator.

Can this be the problem? This would mean that a frequent use of Google Cloud SQL operators would lead to a very long time interpreting the DAGs.

Thanks

AndrewHarmon · August 9, 2019, 12:39pm

so when the the scheduler heartbeats, it parses the dag file but doesn’t execute it (as in execute the tasks). so that happens like every 5 seconds. All tasks in that DAG will have their init called on each heartbeat. which should be light weight, the heavy lifting should be in the execute of the operator, not the init of the operator and hooks. When your task runs, it does an initial parsing of the whole dag, so that’s probably why you are seeing that line printed out for each task using the CloudSqlDatabaseHook as it is going through the init of each task. It does seem to be an unusual pattern to init the hook in the init of the operator. Typically i see the hook being created in the execute section of the operator. I’m not sure if there’s a special reason why it was done this way.

My concern with the slowness of your dag may be that it is being dynamically created. so each task has to parse it’s dag file, but your dag file contains alot of dags in theory. Is there anything in the top level code of your dag that is resource intensive. I would think simply parsing your json from a local file to generate the dags would be relatively quick. are you making any remote calls to anything? You can also try just upping the AU on your scheduler/workers to try and give them a little more juice to process that dynamic dag.

Hopefully that helps some.

Jonnyblacklabel · August 9, 2019, 12:56pm

Hey Andrew,

Thank you very much for your detailed answer.
I have modified the CloudSqlDatabaseHook so that it is no longer initialized in the init method of the operator. It will now be initialized in the execute section as you write.

In fact, I don’t have any problems anymore. The DAGs now load fast enough and the logs are much smaller.

Thanks for the help

Topic		Replies	Views
Airflow DAG Run Delay: running State Persists After Tasks Complete	0	63	November 8, 2024
Airflow New Dag File Processing Slow Airflow	1	2761	July 7, 2022
Scheduler takes long time to run a DAG Airflow	3	2750	July 13, 2021
Dynamic Dag Creation Airflow	2	2248	October 27, 2022
Scheduler takes a long time on start up Astro CLI	0	59	October 16, 2024

Task take very long and log printing out same line over and over again

Related topics