Hi,
This is driving me a little crazy and seems to be related to airflow so hopefully someone has insight.
Our project uses Astronomer cloud, but I can duplicate this error when running my deployment locally.
I have a DAG that uses a PostgresOperator to execute a SQL statement on RedShift. The SQL creates a materialized view and takes about 12-15 minutes. There is no issue with the SQL itself; I can execute it directly without error. Also, the behavior is the same if I substitute a different, long-running SQL statement.
When I run my DAG the node that executes the SQL failes, and I see this error via the UI:
[2020-10-10 17:32:29,517] {taskinstance.py:1145} ERROR - SSL SYSCALL error: EOF detected
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/postgres_operator.py", line 65, in execute
self.hook.run(self.sql, self.autocommit, parameters=self.parameters)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/dbapi_hook.py", line 175, in run
cur.execute(s)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
All that really indicates is that the connection to RedShift was lost.
If I view the docker container logs of the scheduled (docker exec -it <container_id> logs -f) I see this:
[2020-10-10 17:32:36,698] {scheduler_job.py:1313} INFO - Executor reports execution of create_view.create_dl_view execution_date=2020-10-10 17:26:46.238038+00:00 exited with status success for try_number 1
So the schedule thinks things were successful, but the webserver doesn’t. The SQL is still running on RedShift and will successfully complete after the normal 12-15 minutes.
I thought it might have something to do with psycopg2 (the library that PostgresOperator uses), but if I take airflow out of the equation and run the SQL via psycopg2 then everything works. The other strange behavior is that the error (or the “success” if the scheduler logs are to be believed) always happens about 6 mins after the query begins.
Is there some airflow process that would kill a process after 6 minutes?
Any thoughts on how I can further debug?
Thanks very much for any help.