I’m having issues with scaling airflow over 700 tasks instances, using local executor and MySQL. The PIDs are getting killed with no other message. I am now trying with dask, and all seems to run with no errors. But now the dag runs aren’t being scheduled. It just sits there after being triggered.
I followed this:
I also had to set queues to None on the dag and run scheduler with the -do_pickle option.
Our team at Astronomer doesn’t have much if any experience w/ DaskExecutor or LSFCluster.
Have you considered using CeleryExecutor with Kubernetes and KEDA? We have a lot of experience scaling that, and the Celery workers scale to zero.
Also Postgres w/ PgBouncer is a much better DB setup vs. MySQL. With Airflow 2.0, the difference may even become more pronounced, as Postgres has some features that the scheduler upgrades will take advantage of.
If you use KEDA option, the # of Celery workers will autoscale depending on how many tasks are waiting for work. If you’d like, we could jump on a call to demo this to you.