Very odd error behavior when using PostgresOperator with RedShift

balexander · October 10, 2020, 5:53pm

Hi,

This is driving me a little crazy and seems to be related to airflow so hopefully someone has insight.

Our project uses Astronomer cloud, but I can duplicate this error when running my deployment locally.

I have a DAG that uses a PostgresOperator to execute a SQL statement on RedShift. The SQL creates a materialized view and takes about 12-15 minutes. There is no issue with the SQL itself; I can execute it directly without error. Also, the behavior is the same if I substitute a different, long-running SQL statement.

When I run my DAG the node that executes the SQL failes, and I see this error via the UI:

[2020-10-10 17:32:29,517] {taskinstance.py:1145} ERROR - SSL SYSCALL error: EOF detected
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/postgres_operator.py", line 65, in execute
self.hook.run(self.sql, self.autocommit, parameters=self.parameters)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/dbapi_hook.py", line 175, in run
cur.execute(s)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected

All that really indicates is that the connection to RedShift was lost.

If I view the docker container logs of the scheduled (docker exec -it <container_id> logs -f) I see this:

[2020-10-10 17:32:36,698] {scheduler_job.py:1313} INFO - Executor reports execution of create_view.create_dl_view execution_date=2020-10-10 17:26:46.238038+00:00 exited with status success for try_number 1

So the schedule thinks things were successful, but the webserver doesn’t. The SQL is still running on RedShift and will successfully complete after the normal 12-15 minutes.

I thought it might have something to do with psycopg2 (the library that PostgresOperator uses), but if I take airflow out of the equation and run the SQL via psycopg2 then everything works. The other strange behavior is that the error (or the “success” if the scheduler logs are to be believed) always happens about 6 mins after the query begins.

Is there some airflow process that would kill a process after 6 minutes?

Any thoughts on how I can further debug?

Thanks very much for any help.

balexander · October 10, 2020, 9:34pm

It looks like this was what I was running into: https://github.com/apache/airflow/pull/2513

paola · October 15, 2020, 12:30pm

Hi @balexander! I’m sorry to hear you ran into this. Were you able to get that task to run successfully in the end? Did you follow some of the comments in the PR on different ways to potentially get around this?

Let us know if you’re still having trouble.

balexander · October 15, 2020, 12:52pm

Hi,

Thanks for reaching out.

I was able to sort of get things to work. The tldr is that now DB transactions that last less than about 45 mins work fine, as long as I add {"keepalives_idle" : "180"} to the connection string. The interval doesn’t have to be 180, that’s just what I settled on. Anything less than 300 seems to do the trick.

However tasks that run longer than that will fail with the same unhelpful error: psycopg2.OperationalError: SSL SYSCALL error: EOF detected

It definitely seems to be related to Airflow though. The queries work and don’t tie out if I execute them directly on redshift or with psycopg2 by itself.

Topic		Replies	Views
Airflow upgrade_check fails on some instances but not others Airflow	0	1870	April 28, 2021
Error loading shared library libstdc++.so.6: No such file or directory (needed by /usr/lib/python3.6/site-packages/pandas/_libs/window.cpython-36m-x86_64-linux-gnu.so) Airflow	3	5165	August 14, 2019
Column dag.last_parsed_time does not exist Astronomer Nebula	0	2951	August 30, 2021
The Airflow UI isn't updating after I deploy, "503 Service Temporarily Unavailable" Astronomer	1	6661	July 6, 2020
No module named 'pyodbc' when running docker-compose up Airflow	0	2525	March 16, 2021

Very odd error behavior when using PostgresOperator with RedShift

Related Topics