Bash Operator exiting with error code 137

I have a DAG that, while normally running just fine, has started consistently failing. It fails on the same task every time, returning an error code 137 in the log. There haven’t been any changes to the codebase in this timeframe.

Based on some quick research this appears to be an OOM/resource error, but there’s some internal debate here if that is actually the case (mostly due to the airflow worker logs not referencing any memory errors).

Any help is greatly appreciated!

If you set it to retry, does it do so successfully (both does it attempt to retry and does that retry result in successful execution)? Can you post the complete error and any relevant info from the task log? Also which version of Airflow are you running?

  1. Yes, it successfully retries, and the second attempt fails in the same way (although sometimes it gets farther into the execution than the first attempt)

  2. Here’s the “opening statements” (with some potentially confidential stuff redacted):

[2021-08-18 00:10:17,634] {} INFO - Executing <Task(BashOperator): task_name> on 2021-08-17T00:00:00+00:00
[2021-08-18 00:10:17,639] {} INFO - Started process 16326 to run task
[2021-08-18 00:10:17,645] {} INFO - Running: ['airflow', 'tasks', 'run', <'dag_name'>, <'task_name'>, '2021-08-17T00:00:00+00:00', '--job-id', '24995', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/', '--cfg-path', '/tmp/tmpa57gdh0c', '--error-file', '/tmp/tmpzrc0rm5a']
[2021-08-18 00:10:17,645] {} INFO - Job 24995: Subtask <task_name>
[2021-08-18 00:10:17,816] {} INFO - Running <TaskInstance: dag_name.task_name 2021-08-17T00:00:00+00:00 [running]> on host airflow-worker-4.airflow-worker.airflow.svc.cluster.local
[2021-08-18 00:10:17,878] {} INFO - Exporting the following env vars:

And here’s the error message:

[2021-08-17 19:35:37,175] {} INFO - bash: line 1:  4450 Killed   <python> <><username argument> <password argument>
[2021-08-17 19:35:37,176] {} INFO - Command exited with return code 137
[2021-08-17 19:35:37,200] {} ERROR - Bash command failed. The command returned a non-zero exit code.

**I haphazardly copied these messages from a couple of different runs - disregard the disconnected timestamps

  1. Airflow version is 2.0.1.

Your worker’s OS is killing your task. If it’s doing this consistently it’s very likely that this is a resource issue. I think the logs for the OOM may be getting squelched because you’re running python from the bash operator.

Try this: make a task that makes a gigantic numpy matrix of random values that will definitely OOM and see what the error is. If it looks the same, this is an OOM.

1 Like