Has anyone managed to set the data lineage? Any real example (with atlas + Redshift, not hdfs) of use?
Good discussion today in Airflow Slack:
Hi All, Many organisations are interested in data lineage (that is understanding how data moved from point A to point B). Often this is for compliance reasons (GDPR, CCPA) , but it can also significantly help data quality, reproducibility, and data discovery. As a workflow orchestrator Airflow has a key role in connecting the dots between point A and point B.
In the 1.10 series Airflow gained some support for obtaining lineage information, however it didn’t really work well . Since then I have been reworking it to make it more developer friendly and more robust. Together with the ppl from Lyft and DailyMotion we had some discussions on what a design could look like, but they were not conclusive yet.
What I would like to do is to create and Airflow Improvement Proposal, but to do that I need some more input. What do you expect out of lineage support? What does it mean to you? What backends should we support out of the box (Atlas, Amundsen etc)? When will you use it as a developer?
If you could reply in thread that would be very much appreciated. Questions can also be asked of course and I will forward it also to the mailinglist.
Thanks and have a happy new year!