What’s the deal with this error? Is it a problem with my task, or astronomer cloud, or airflow?
*** Log file does not exist: /usr/local/airflow/logs/[TASKNAME]s/run/2019-07-11T11:00:00+00:00/1.log
*** Fetching from: http://:8793/log/[TASKNAME]/run/2019-07-11T11:00:00+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/[TASKNAME]/run/2019-07-11T11:00:00+00:00/1.log': No host supplied
We’re seeing it pop up more and filed that JIRA ticket directly in the OSS project. Currently, Airflow commits a hostname to the backend db after the task completes, not before or during. If there is a hard crash on the task mid-process or it’s otherwise interrupted, the database never gets the hostname and you can’t fetch the logs. In other, the task didn’t “finish” failing/succeeding/executing at all - it crashed halfway.
The JIRA issue above essentially requests for that process to happen upfront, to allow you to see those logs if they exist even if the task crashes or is otherwise interrupted.
Things to Try
Can you try re-run that task by going into the corresponding Task Instance from Airflow and hitting “Clear”? Do you have retries/retry_delays configured?
That task may be crashing due to too much memory - is that job pulling in a lot of data into memory? Have you tried raising the AU’s allocated to your Scheduler?
Workers restart after each astro airflow deploy to make sure they’re looking at the most up-to-date code. To minimize disruption to tasks running at the time you deploy, you can leverage the worker termination grace period.
If configured, the celery worker will wait an x number of minutes to restart if the worker is otherwise in the middle of executing a task.
Our default grace period is currently 10 mins, but that’s actually something you can adjust freely on the Astronomer UI ( deployments > Configure > Worker Termination Grace Period)
Exec into the corresponding Celery worker to look for the log files there (Enterprise only)
You’ll need to be set up with Kubectl, but you can run: kubectl exec -it {worker_name} bash
If this is a sensor, you could set mode=reschedule (Airflow v1.10.2+)
If you’re using Airflow 1.10.1 or a prior version, sensors run continuously and occupy a task slot in perpetuity until they find what they’re looking for, so they have a tendency to cause concurrency issues.
If you have more sensors than worker slots, the sensor will now get thrown into a new up_for_reschedule state, thus unblocking a worker slot.
Coming Astro Improvements
Kubernetes Executor (coming soon to Astronomer Cloud)
The KubernetesExecutor will gracefully handle this process.
Worker/Scheduler/Webserver Logging (coming soon to Astronomer Cloud)
Already released for Astronomer Enterprise, Worker/Scheduler/Webserver logs will all be exposed in real-time within the Astronomer UI.
My guess that it is memory/resources related – this just happened to two heavy task instances at the same time. Do you think the bottleneck is the total resources allocated to my deployment, or the resources allocated to individual task instances? Any suggestions for what to tweak?
Hi @gusostow, happy to help. If you suspect it’s a memory issue, I’d recommend you bump up the resources allocated to your worker (or scheduler if you’re on a LocalExecutor). An “Astronomer Unit” maps to a combination of CPU and Memory, so upping the AU allocated to a particular component will expand those limits.
It sounds like the bottleneck is at a task level, but since resources currently power your jobs at the deployment level (this will change with the KubernetesExecutor), increasing them should make a difference. You can make the adjustment directly in your deployment’s Configure page on Astronomer.
For reference, you can find the source code for that config here.
Still running into this issue… I strongly suspect that it’s a memory issue because this task would get killed with exit code 137 from time to time on our old deployment.
I more than doubled our Astronomer resources for the deployment without any luck. So there isn’t any memory restriction for individual tasks in the airflow configuration that I can bump up? Do you think switching to the celery executer could make these tasks run more robustly?