How does Astronomer architecture ‘get around’ scheduler or worker failures?

Astronomer deploys natively on Kubernetes and leverages native Kubernetes features to keep Airflow up and stable. By configuring Liveness and Readiness Probes on each Airflow component, our platform is aware of the status of each Airflow component. If these checks fail for whatever reason, our platform sends out an email alert and restarts that component (in cases where the Airflow scheduler stops working, restarting usually fixes it).

Furthermore, real time metrics are emitted to our Grafana dashboards so you can have an idea of exactly why something isn’t behaving properly.

Finally, we define a PodDisruptionBudget on the scheduler pod, which gives it priority when compared to the other pods on the cluster.

This can be demoed live upon request.