Db-cleanup in Astronomer

I am moving my DAGs from Airflow hosted on a single GCP vm to Astronomer
We have this maintenance DAG that cleans up Airflow’s backend database with some retention period
This is the DAG we use

Is this something that I no longer need using Astronomer? does the airflow backend has a retention policy by default? if yes, where can I find out what is the retention period?

I just looked into this a couple months ago and couldn’t find anything saying that Astronomer cleans up the meta database automatically, so i actually implemented that same DAG in my deployments.

Hey @gilboa-reif and @Tgoad! Thanks for reaching out on this. Astronomer products currently do not do any automatic metadata cleanup, but it’s definitely on our roadmap.

We’ve looked at those DAGs from Google – they’re super nice. I’d recommend sticking to those for now until it’s native to our products.

Is the 30-day mark reasonable for your team? Would you like that to be configurable?

Hey @Paola, I will always vote for configurable, but 30 sounds like a solid default value.

1 Like

Hi @paola, I just came across this thread recently and was about to add teamclairvoyant/airflow-maintenance-dags for as Airflow metadata db retention policy.

It’s been about 3 months since October 2022, has Astronomer implemented automatic Airflow metadata cleanup? Is this in production?
If it has been implemented, can you please point me to documentation and configuration to setup the retention policy? Also which version of Astronomer Runtime supports this feature.

Thank you

Hey @ltu2023 Thanks for reaching out. While this is still on our roadmap and it’s very much our intent to solve for this problem, it’s still going to take some more time for us to get this out. I can’t promise an expected month quite yet. The Clairvoyant maintenance DAGs seem like a fair bet to me in the meantime.

Curious, what’s the primary driver for implementing this cleanup? It’d of course be a best-practice, but curious if you’re hitting a pain point that’s making this particularly relevant for your team.

Hi @paola
Thanks for the speedy response.

Regarding the primary driver, we have occasional scheduler issues that I can’t pin point if it’s related to accumulating data in Airflow metadata DB. This is an effort to eliminate this worry.