I am using airflow version 2.1.3. My use case is to schedule tasks one after another and I require an orchestrator to manage those tasks. I am giving the input to airflow’s DAG using a REST API feature of airflow. My use-case requires the answer in real-time (external API call), as well the batch-processing should happen on a daily basis (REST API call through one DAG to trigger another DAG)
For batch-processing, the time is not a constrain as it doesn’t matter even though more time is taken. But for real-time, the execution time matters which slow (About 6 secs for simple Python code).
I have divided the Airflow DAG into 4 tasks. The scheduling between the tasks is very slow. The individual execution time for tasks is fast. Therefore, I wrote all the code in one monolithic block, which significantly reduces the time to 1 sec. But, this doesn’t let me take the full advantage of airflow.
I want to know - ‘How can I reduce the whole execution time of the DAG by dividing it into tasks’. The configuration parameters that I am using are -
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__SCHEDULER__MAX_THREADS: 4
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 1
AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: 60
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: 60
AIRFLOW__OPERATORS__DEFAULT_CPUS: 2
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'