We recently began encountering an unfamiliar error in (at least) one of our long-running dags:
Note that this error is on the second attempt and did not occur on the first attempt. It looks like when the first attempt receives a SIGTERM, the retry just quits as soon as it can’t locate a new log file. Any guidance as to what’s happening here?
Can you share the log from attempt 1?
I don’t know what’s causing this either, but for now I’m assuming it’s user error.
Also, I can get the dag to succeed if I simply clear and rerun the failed task. Sometimes it takes a couple of tries, but I can generally get it to work this way.
Can you check your scheduler logs for “Marked X SchedulerJob instances as failed” where X is an integer? Also, what’s your airflow version?
I think that the task instance is getting started twice by the scheduler and then the two clobber each other. Which is a bug that we’ve been dealing with and has two fixes, one in 2.2.2 and the other in 2.2.3
Ahh understood - that might be the case. It makes sense considering we’re running v 2.1.2 at the moment.