Tasks failed due to lack of resources, looking to re-run

Hi Everyone,

I’m troubleshooting an issue on a system I’ve inherited.

On 12/20 one of our tasks failed due to INFO - Task exited with return code Negsignal.SIGKILL which I believe to be from a lack of memory on our Airflow worker. I’ve since updated the environment to allocate more memory to our Airflow workers, but I’m unsure how to remedy our scheduler to re-run these tasks.

From 12/20 to 12/31 most of the tasks were successful where one task failed due to the lack of resources and a few other tasks failed due to dependencies on that failed task.

From 1/1 onward a different issue is causing these tasks to hang indefinitely (due to a date setting on reports being queried) and thus I’m not able to re-run the failed tasks from last year with the increased resources.

I was hoping that someone on here would be able to help me figure out how to cancel and re-run these tasks so I can verify that the increase in resources resolves my original issue so I am then able to switch over my date setting and re-run my flows that were scheduled for 1/1 and onward.

Thank you in advance to anyone who is able to help me with this!

Here’s a screenshot of our scheduler:

So maybe i am missing something here, but if you want to rerun the older ones why not just select the dag, or the the task that failed, and clear it?
You could even list all the task instances and add filtering until you have a list of only the tasks that are in a state of failed or upstream failed and just clear those, that way you don’t have to clear each run individually.

1 Like

Hi @BSwaine, thanks for reaching out!

As @Tgoad mentioned, you can clear these runs - I’d suggest going to the Airflow UI, then clicking on Browse > DAG Runs where you can filter DAG Runs (for example by DAG id) using Search > Add Filter on the left hand side.

@Tgoad and @magdagultekin

Thanks for your responses! I guess I’m nervous because I’m new to the environment. Will clearing also drop any data that was ingested from this dag or is best practice to generally not commit any data if there are errors in the task?

I tried pressing run and because there are several dags already running it won’t run the older ones. Sounds like if I clear all running dags and the previous failed ones I’ll be able to run each task individually?

Thank you again!

@BSwaine, no data will be dropped by default, so you need to keep in mind avoiding dups while creating your DAG.

To your second question - if you have catchup=True and you clear these tasks, the scheduler will kick off a DAG Run for any data interval that has not been run since the last data interval or has been cleared. You can read more about re-running tasks, catchup and backfill here.