Hi @gusostow!
If you’re not seeing logs being exposed in the Airflow UI, it’s likely tied to this Airflow bug: https://issues.apache.org/jira/browse/AIRFLOW-4922.
We’re seeing it pop up more and filed that JIRA ticket directly in the OSS project. Currently, Airflow commits a hostname to the backend db after the task completes, not before or during. If there is a hard crash on the task mid-process or it’s otherwise interrupted, the database never gets the hostname and you can’t fetch the logs. In other, the task didn’t “finish” failing/succeeding/executing at all - it crashed halfway.
The JIRA issue above essentially requests for that process to happen upfront, to allow you to see those logs if they exist even if the task crashes or is otherwise interrupted.
Things to Try
-
Can you try re-run that task by going into the corresponding Task Instance from Airflow and hitting “Clear”? Do you have retries/retry_delays configured?
-
That task may be crashing due to too much memory - is that job pulling in a lot of data into memory? Have you tried raising the AU’s allocated to your Scheduler?
-
Do you have the “Worker Termination Grace Period” configured? (Celery only)
- Workers restart after each
astro airflow deploy
to make sure they’re looking at the most up-to-date code. To minimize disruption to tasks running at the time you deploy, you can leverage the worker termination grace period. - If configured, the celery worker will wait an x number of minutes to restart if the worker is otherwise in the middle of executing a task.
- Our default grace period is currently 10 mins, but that’s actually something you can adjust freely on the Astronomer UI ( deployments > Configure > Worker Termination Grace Period)
- Workers restart after each
-
Change the
log_fetch_timeout_sec
to something more than 5 seconds (default) -
Exec into the corresponding Celery worker to look for the log files there (Enterprise only)
- You’ll need to be set up with Kubectl, but you can run:
kubectl exec -it {worker_name} bash
- You’ll need to be set up with Kubectl, but you can run:
-
If this is a sensor, you could set
mode=reschedule
(Airflow v1.10.2+)- If you’re using Airflow 1.10.1 or a prior version, sensors run continuously and occupy a task slot in perpetuity until they find what they’re looking for, so they have a tendency to cause concurrency issues.
- If you have more sensors than worker slots, the sensor will now get thrown into a new
up_for_reschedule
state, thus unblocking a worker slot.
Coming Astro Improvements
-
Kubernetes Executor (coming soon to Astronomer Cloud)
- The KubernetesExecutor will gracefully handle this process.
-
Worker/Scheduler/Webserver Logging (coming soon to Astronomer Cloud)
- Already released for Astronomer Enterprise, Worker/Scheduler/Webserver logs will all be exposed in real-time within the Astronomer UI.