Failed to fetch log file from worker on retry?

We recently began encountering an unfamiliar error in (at least) one of our long-running dags:

Note that this error is on the second attempt and did not occur on the first attempt. It looks like when the first attempt receives a SIGTERM, the retry just quits as soon as it can’t locate a new log file. Any guidance as to what’s happening here?

Can you share the log from attempt 1?

I don’t know what’s causing this either, but for now I’m assuming it’s user error.

Also, I can get the dag to succeed if I simply clear and rerun the failed task. Sometimes it takes a couple of tries, but I can generally get it to work this way.

Can you check your scheduler logs for “Marked X SchedulerJob instances as failed” where X is an integer? Also, what’s your airflow version?

I think that the task instance is getting started twice by the scheduler and then the two clobber each other. Which is a bug that we’ve been dealing with and has two fixes, one in 2.2.2 and the other in 2.2.3

Ahh understood - that might be the case. It makes sense considering we’re running v 2.1.2 at the moment.

Yep, that’s it then.