How does Astronomer architecture ‘get around’ scheduler or worker failures?

brad · January 15, 2019, 1:11am

Astronomer deploys natively on Kubernetes and leverages native Kubernetes features to keep Airflow up and stable. By configuring Liveness and Readiness Probes on each Airflow component, our platform is aware of the status of each Airflow component. If these checks fail for whatever reason, our platform sends out an email alert and restarts that component (in cases where the Airflow scheduler stops working, restarting usually fixes it).

Furthermore, real time metrics are emitted to our Grafana dashboards so you can have an idea of exactly why something isn’t behaving properly.

Finally, we define a PodDisruptionBudget on the scheduler pod, which gives it priority when compared to the other pods on the cluster.

This can be demoed live upon request.

Topic		Replies	Views
Tasks stop working after 5 minutes Astronomer Nebula	5	3632	March 24, 2023
Scheduler freezing/hanging without a trace Airflow	0	5809	September 25, 2019
Refreshing airflow versions in Astronomer Astronomer Software (Enterprise)	2	1928	May 5, 2021
Can I have scheduled downtime? Astronomer Nebula	3	1908	February 1, 2019
How do I deploy Airflow on Kubernetes on AWS? Astronomer Software (Enterprise)	1	2071	May 7, 2019

How does Astronomer architecture ‘get around’ scheduler or worker failures?

Related topics