Python workloads

jbc · May 19, 2021, 10:12pm

I am new to Airflow. And loving the tutorials you have on YouTube. Plus your guides are top-notch. Thanks!!!

Taking to heart your comments about keeping heavy-lifting workloads off airflow workers, do you have a reference guide for doing that on AWS? We are contemplating MWAA – 2.0 is imminent. So something that leverages Fargate?

Other choice is to just allow python workloads to run as workers since scale-out is easier, and manage package dependencies with PythonVirtualenvOperator although spotty support is provided

Currently we are using AWS Data pipelines which are dedicated EC2’s running everything we need and we control the environment using AMI’s.

virajparekh · May 21, 2021, 12:41am

Hey!

Thanks for the post as well as the kind words.

Think there are a few things to consider here!

Fargate
From what we’ve seen, it’s much better to use Fargate as the compute layer rather than the layer you use to actually host airflow. We haven’t had much success running the long lived AF services (schedulers, webservers, etc.) on Fargate due to some weird things around how you’d handle logging and such. Not saying it can’t be done, just not a pattern we’ve seen widely adopted. The other piece to consider with Fargate has to do with the types of workloads you are running; do you expect them to be light, numerous, low SLA workloads (e.g. queries that deliver summary tables to business users) or longer lived workloads (a container running a data science operator)?
Python with VirtualEnvOperator
I think this is a great approach for getting started, but depending on how your use case changes and evolves, it may not be a scalable solution. We’ve found PythonVirtualEnvs to be better as a “bail out” option for tasks that conflict with the set of python dependencies available within the Airflow environment. If you think users are going to be bringing their own dependencies as part of their regular usage, it might make more sense to look at containerized execution through KubernetesPodOperator and KubernetesExecutor.

Lastly, we’ve very recently changed our tune from NEVER USE AIRFLOW TO PROCESS DATA to something a little more nuanced. Using tools like Ray, Airflow can actually be a great data processing tool.

@jbc I hope this helps These are topics that we’re currently working through with a few of our customers. If you’d like a deeper dive here, happy to put together a demo/webinar for your team. Shoot me a note at viraj@astronomer.io

Topic		Replies	Views
Celery or LocalExecutor? Astronomer	4	5514	March 13, 2019
Having issues scaling airflow Airflow	9	2660	October 5, 2020
Airflow on Kubernetes	0	1236	October 17, 2022
Required Airflow multitenant setup procedure Airflow	2	1883	March 30, 2020
Multiple scheduler can end the reliance of Celery? Airflow	0	1521	March 29, 2022

Python workloads

Related topics