Hello, I am quite new to using airflow and I would like to ask for some feedback on the my situation and use case. I do not see many examples online, so I have no reference point for what I’m doing.
I am working on a data analysis pipeline where I read in a dataset from a source, run several transformations, run a couple of statistical algorithms, and then do some more transformations. All along the way at various points the idea is to store the intermediate results in a database (currently just .csv files).
Previously, I was performing all of this in Python and pandas, and working from RAM. Since moving to airflow, I have been forced (for the better!) to be more modular in my approach, and this means each step in my pipeline will be its own task instance, with its own data in/outs, and therefore I will not be working from RAM as it is clearly bad to pass large amounts of data/datasets between dag tasks.
My approach: I have a simple DAG with a few tasks. Each task is a Python function. Each python function reads in the dataset, performs its function, and then outputs the result separately. Additionally, it stores a dictionary to XCOM that contains the task instance name (grabbed from xcom var itself), and the output path(s) of the file. The second DAG task then gets the previous task instance from XCOM and extracts the dict info stored from the previous task - it now has the directory to the file that it will read in and perform more functions on, before storing a similar dictionary to the XCOM.
My question: this feels really clunky to me. I didn’t want to hardcode task names in my functions, hence grabbing the task name from XCOM and using that as a key in an output dict. However, I really just wanted to have a set of functions that I may use either in airflow or completely separately from airflow; but the use of XCOM to pass parameters around disrupts this goal, and I now have a logic step that checks if a
**context parameter was provided (if so, it must be in airflow, and can therefore grab the dictionary). This was completely my own idea to overcome the issues I was facing, so I am sure it is not a typical approach. I wonder if somebody more experienced can offer some advice?