Connection per worker in CeleryExecutor

gaurav · July 8, 2021, 9:06am

In traditional Celery Arch, we can have one connection per worker and this connection can be initialized when a worker is created/init.
Can the same be achieved functionality using celery executor? to have one connection per worker? and this is initialized when a worker is created?

Ideally I would like to initialize a worker and provide worker list of task to be done!!!

gaurav · July 8, 2021, 9:31am

@paola : Please have a look at my query!!!

Alan · July 13, 2021, 12:31am

I’m not sure I understand your question.

What is this connection you are referring to? Do you mean Queues?

Ideally I would like to initialize a worker and provide worker list of task to be done!!!

Task scheduling is done by the scheduler.

Do you mean you want to specify which worker your task instances are executed by?

gaurav · July 13, 2021, 3:32am

@Alan

By Connection, I mean a database connection. I wish to instantiate one db connection per worker.
Is it possible to instantiate one connection per worker and use it for all the task that worker do?

On second point,

Do you mean you want to specify which worker your task instances are executed by?

Is this possible?

Alan · July 13, 2021, 6:22am

I don’t think it’s possible to limit connection per worker.

It is possible to limit the number of connects per Airflow deployment though. (see sql_alchemy_pool_size).

You could look into pgbouncer to see if that fits your needs.

gaurav · July 13, 2021, 7:05am

@Alan pgbouncer is for postgres…what are its equivalent for say mongodb or mysql or other similar db?

Alan · July 13, 2021, 7:14am

I believe Airflow only supports MySQL or Postgres.

Please reference the Airflow documentation on database backends.

I do not have any recommendation for MySQL since Astronomer uses only Postgres databases.

gaurav · July 13, 2021, 7:29am

@Alan Well there are hooks for all the databases in airflow … what you have provided is for metadata storage …and doesn’t cover all the database we can connect and use in our dag.

Alan · July 13, 2021, 7:49am

I’m sorry for my brief responses but i can only answer based on what is given and you have not provided much context around your infrastructure and what it is you want to accomplish.

If you are talking about hooks connecting to other services like a database, then the answer is more complicated. I think Airflow does NOT have any internal mechanism to limit number of connections created by a hook to the service that the hook is for.

A custom hook could have some sort of mechanism built in to act as a pooling agent.

Another potential solution could be to use Pools. If you only want one task to connect to a database for example, you can enforce all operators that uses that hook to have the pool parameter specified. When there are two task instances that uses the hook, only one can run. This scenario assumes that there is only one Celery worker.

I don’t think this is possible to implement with multiple workers because there is nothing in the metadata database that tracks how many hook connections there are at all times.

I would encourage you to submit a feature request on Airflow’s Github Issues.

Topic		Replies	Views
It looks like by default celery will run a max of 16 jobs concurrently. Is it easy to increase that? By how much and at what cost? Astronomer Nebula	0	1676	December 20, 2018
What are "DB Connections" + "Client Connections" on Astronomer? Astronomer	0	2926	May 21, 2020
Transitioning to the Celery executer Airflow	1	1696	July 25, 2019
Celery Executor & tasks sequencing Airflow	1	1956	November 9, 2021
Update number of workers via CLI Astronomer Nebula	1	1820	October 27, 2020

Connection per worker in CeleryExecutor

Related topics