We are using the Databricks Operator a lot. This operator polls for the status of the job run periodically. When the scheduler is restarted (e.g. for a deployment), that Operator is restarted as well (I guess this is expected?). As the Operator does not know anything about a running job, it starts a new run instead of picking up the already running job run.
In comparison to e.g. BigQuery, Databricks job runs do not have an id that you can set from the outside but it’s getting generated as soon as a new job run is triggered.
The BigQueryJobOperator creates the job run id by itself and if the operator is restarted, e.g. due to a scheduler restart, it checks if a job with this id is already running and if yes, “attaches” to the running job.
For Databricks this is not possible, as the job id is not configurable.
A few options I see to mitigate this:
- save the job run ID somewhere (not sure where), where it survives the scheduler restart and can be picked up by the operator. So the op could reattach to the job run. the run id would be deleted when the job has finished
- one could always reattach if there is a job run for a given job. this would work in my case, but would not work in general, as databricks allows concurrent runs for a job. and there will definitely be a case where a job run exists and a new run should be triggered.
- currently the operator polls. one could give the operator an async=True flag, which exits the operator after the databricks job has been started and then use a sensor to poll for the job run status (the job run is available using xcom already).
Imho only 1 and 3 are feasible solutions. but I am not sure where to store the job run id, so that it survives the scheduler restart. 3) is imho the cleanest solution.
Let me know what you think. I definitely aim to contribute these changes back to provider packages.