Very odd error behavior when using PostgresOperator with RedShift

Hi,

This is driving me a little crazy and seems to be related to airflow so hopefully someone has insight.

Our project uses Astronomer cloud, but I can duplicate this error when running my deployment locally.

I have a DAG that uses a PostgresOperator to execute a SQL statement on RedShift. The SQL creates a materialized view and takes about 12-15 minutes. There is no issue with the SQL itself; I can execute it directly without error. Also, the behavior is the same if I substitute a different, long-running SQL statement.

When I run my DAG the node that executes the SQL failes, and I see this error via the UI:

[2020-10-10 17:32:29,517] {taskinstance.py:1145} ERROR - SSL SYSCALL error: EOF detected
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/postgres_operator.py", line 65, in execute
self.hook.run(self.sql, self.autocommit, parameters=self.parameters)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/dbapi_hook.py", line 175, in run
cur.execute(s)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected

All that really indicates is that the connection to RedShift was lost.

If I view the docker container logs of the scheduled (docker exec -it <container_id> logs -f) I see this:

[2020-10-10 17:32:36,698] {scheduler_job.py:1313} INFO - Executor reports execution of create_view.create_dl_view execution_date=2020-10-10 17:26:46.238038+00:00 exited with status success for try_number 1

So the schedule thinks things were successful, but the webserver doesn’t. The SQL is still running on RedShift and will successfully complete after the normal 12-15 minutes.

I thought it might have something to do with psycopg2 (the library that PostgresOperator uses), but if I take airflow out of the equation and run the SQL via psycopg2 then everything works. The other strange behavior is that the error (or the “success” if the scheduler logs are to be believed) always happens about 6 mins after the query begins.

Is there some airflow process that would kill a process after 6 minutes?

Any thoughts on how I can further debug?

Thanks very much for any help.

It looks like this was what I was running into: https://github.com/apache/airflow/pull/2513

Hi @balexander! I’m sorry to hear you ran into this. Were you able to get that task to run successfully in the end? Did you follow some of the comments in the PR on different ways to potentially get around this?

Let us know if you’re still having trouble.

Hi,

Thanks for reaching out.

I was able to sort of get things to work. The tldr is that now DB transactions that last less than about 45 mins work fine, as long as I add {"keepalives_idle" : "180"} to the connection string. The interval doesn’t have to be 180, that’s just what I settled on. Anything less than 300 seems to do the trick.

However tasks that run longer than that will fail with the same unhelpful error: psycopg2.OperationalError: SSL SYSCALL error: EOF detected

It definitely seems to be related to Airflow though. The queries work and don’t tie out if I execute them directly on redshift or with psycopg2 by itself.