Hi all! We’re doing something (seemingly) a little unique, and I’m wondering if our approach is feasible and/or a good idea. We’re planning on moving an existing system onto airflow which will result in hundreds, possibly more than a thousand dags once it’s all done. As part of that we’re writing an in-house library which encapsulates a lot of the common tasks these dags will do (basically a bunch of hooks, operators, sensors, and pre-canned common taskgroups). This library will be semi-frequently changing (as new dags are being written) and we’ll want new dags to be able to use new versions of the library without introducing risk to the existing dags which are already happily running in prod. We would rather not have to provide significant backwards compatibility in our library as it will impede the work we are trying to do, and ultimately a shared library approach would introduce a certain amount of risk even when attempting to maintain backwards compatibility within it.
So that’s the goal…
To accomplish this we were thinking we’d have every dag add its own “site-packages” dir to sys.path where it could load its own packages from. I’ve tested this and it seems to work fine - 2 dags adding their own unique subdir don’t appear to conflict with each other. Ok great, if it’s just that simple for just our own library then it’ll probably all work fine…but now what about our lib’s nested dependencies (and possibly other dependencies needed by the dag which aren’t installed globally)? We’d like each dag to be able to provide its own requirements.txt which would be expanded into the per-dag “site-packages” dir for deployment. Is there any easy/good way to exclude the packages which are already installed globally? Like, for example, airflow itself - obviously our lib references airflow, so by default when pip installing our library to the per-dag packages dir, airflow will also be installed into that dir which I assume would cause all kinds of problems at runtime. Do I basically have to run through all the packages in that dir and delete the ones I know are provided by airflow?
I guess this is maybe more of a pip/python question than airflow-specific. Does this seem like an ok direction to pursue? It feels a little hacky, but I’m not sure how else to accomplish the goal…