Scheduler freezing/hanging without a trace

Symptoms

  • Scheduler stops doing work as expected while continuing to heartbeat
  • No tasks are scheduled or executed, task instances are not created
  • Because the scheduler continues to heartbeat, Astronomer Cloud users don’t receive platform-level email alerts (if configured)
  • No error seen in logs

Likely Root Cause

  • The Airflow scheduler (on both Celery + Local Executors) hanging unexpectedly is unfortunately a well-known and documented bug core to the Apache Airflow project itself that does affect Airflow deployments on Astronomer and for which there has been no confirmed fix just yet

  • In contrast to cases where your scheduler might have a lower heartbeat than usual (for which you would get an alert for that might indicate it’s under-provisioned or otherwise overloaded), this particular behavior misleadingly does not fire a platform-level alert because the scheduler here does continue to heartbeat as expected - even if it’s not actually doing any work

Immediate “Fix”

  • (1) Re-deploy (to trigger a restart)

    • Pushing up a deploy to your deployment (via astro airflow deploy ​ or a CI/CD process) OR save a change to your deployment config via the Astronomer UI usually does the trick
    • Both of these actions “restart” your deployment’s underlying pods (including the scheduler’s)
  • (2) Delete your scheduler pod (we can do this)

    • If restarting your scheduler doesn’t work, you can reach out to us and we can try deleting your scheduler pod on our end to get it to spin up again

Suggestions to Alleviate Issue

  • (1) Make sure your scheduler has enough AU’s
    • 5AU is the default, but you can crank this up to 6-15 AU depending on your use case
  • (2) Keep your Airflow version up-to-date
  • (3) Schedule automatic Scheduler Restarts (via an Env Var)
  • (4) Make sure to sign up for alerts via the Astronomer UI (deployment tab) for any other scheduler issues

Long-term solutions on Astronomer

  • We’ve identified this as a critical Airflow bug that inhibits Airflow’s performance for our users, and are working hard to identify a long-term fix ourselves and contribute it back to the Airflow project
  • We’ve added a few Airflow committers to our team that will be immensely influential to this end
  • We’re constantly ideating on our alerting functionality and building on that feature to be able to catch this behavior
  • We’ve added built-in scheduler restarts (every 11 hours, 31 minutes) as a default env var in our new Cloud release
1 Like