We are running Airflow as docker containers on AWS ECS (using m5.xlarge instance, 4 vCPUs which are 4096 CPU units).
We recently upgraded docker from 18.09.9
to 19.03.6
. Then immediately we noticed that it’s capping the CPU usage on the containers. For instance, the scheduler was a heavy CPU user. So I allocated 1024 CPU units in its task definition. But under docker 19.03.6
the scheduler is using much less CPU than before. In fact, all containers are using less CPU. And we keep seeing these errors on Airflow UI
Broken DAG: [/usr/local/airflow/dags/<my_dags>.py] Timeout, PID: <xxxx>
Everything else seems fine though. No errors in the logs. Only on the UI.
When we roll back to docker 18.09.9
, the errors are gone. I know it sounds like a docker issue. But of all containers that have been “capped”, Airflow is the only one that generates these error messages.