I’m deploying airflow on Azure Kubernetes Service using helm. The Airflow DAG and log folders are mounted to a blob storage, using NFSv3 method. I noticed there is a surge of transactions every 15 minutes, even when there’s no activity.
I checked the configuration references on Airflow document (link) and change “min_file_process_interval”, which is responsible for parsing DAGs, to once every 5 minutes, hence the little peaks on chart.
Here’s some information about my current deployment:
Executor: Kubernetes Executor
min_file_process_interval: 300 seconds
Number of scheduler replicas: 2
Number of web servers: 2
My questions are:
Which tasks are running constantly underneath Airflow service?
Which parameters (airflow.cfg, helm chart values) affect the task schedule rate?
Which tasks are running constantly underneath Airflow service?
The processes that would run normally (dependent on Airflow Version) would be
DAG Parsing, which is part of the scheduler. This process regularly attempts to reprocess your DAGs to determine if there are changes to them. It is a good practice to keep top-level code to a minimum in your DAGs, as that can affect DAG Parsing.
If you are using the Astro CLI you can determine how long your DAGs are taking to parse locally by running astro dev run dags report.
Because you are self-hosting Airflow, you can use airflow dags report if you connect to your running Airflow instance in AKS.
Airflow Scheduler’s Scheduling Loop
Which parameters (airflow.cfg, helm chart values) affect the task schedule rate?
Airflow’s scheduler runs on a constant loop attempting to schedule tasks. Tasks can be scheduled based on a number of parameters, some of which include: