Thanks for the post as well as the kind words.
Think there are a few things to consider here!
From what we’ve seen, it’s much better to use Fargate as the compute layer rather than the layer you use to actually host airflow. We haven’t had much success running the long lived AF services (schedulers, webservers, etc.) on Fargate due to some weird things around how you’d handle logging and such. Not saying it can’t be done, just not a pattern we’ve seen widely adopted. The other piece to consider with Fargate has to do with the types of workloads you are running; do you expect them to be light, numerous, low SLA workloads (e.g. queries that deliver summary tables to business users) or longer lived workloads (a container running a data science operator)?
Python with VirtualEnvOperator
I think this is a great approach for getting started, but depending on how your use case changes and evolves, it may not be a scalable solution. We’ve found PythonVirtualEnvs to be better as a “bail out” option for tasks that conflict with the set of python dependencies available within the Airflow environment. If you think users are going to be bringing their own dependencies as part of their regular usage, it might make more sense to look at containerized execution through KubernetesPodOperator and KubernetesExecutor.
Lastly, we’ve very recently changed our tune from NEVER USE AIRFLOW TO PROCESS DATA to something a little more nuanced. Using tools like Ray, Airflow can actually be a great data processing tool.
@jbc I hope this helps These are topics that we’re currently working through with a few of our customers. If you’d like a deeper dive here, happy to put together a demo/webinar for your team. Shoot me a note at firstname.lastname@example.org