We are running Airflow as docker containers on AWS ECS (using m5.xlarge instance, 4 vCPUs which are 4096 CPU units).
We recently upgraded docker from
19.03.6. Then immediately we noticed that it’s capping the CPU usage on the containers. For instance, the scheduler was a heavy CPU user. So I allocated 1024 CPU units in its task definition. But under docker
19.03.6 the scheduler is using much less CPU than before. In fact, all containers are using less CPU. And we keep seeing these errors on Airflow UI
Broken DAG: [/usr/local/airflow/dags/<my_dags>.py] Timeout, PID: <xxxx>
Everything else seems fine though. No errors in the logs. Only on the UI.
When we roll back to docker
18.09.9, the errors are gone. I know it sounds like a docker issue. But of all containers that have been “capped”, Airflow is the only one that generates these error messages.