We want to send data from our Snowflake Data Warehouse to 3rd party systems. The basic workflow is this:
Run a query in Snowflake to cache a table of data
Get the data from the cache table
Post the data to 3rd party systems such as our marketing automation system (Eloqua) or to our in-app messaging tool (Mixpanel)
This data is data about users and tied to specific emails so we’d like the most secure way to pass this data.
We could use an XCom for this but I’m not sure that’s the best way. How long is the data in the XCom stored? If we use an XCom is there a way to purge this data from the task history? Will using an XCom unnecessarily drive up our storage?
Is there a better and more secure best practice for this operation and passing data between 2 systems.
Typically you should not use xcom to pass datasets. Xcom is intended to pass metatdata about datasets between tasks. So instead of passing the data, you would typically store the data in s3 or gcs and then pas the key for that data via xcom to downstream tasks. Xcoms are store in the backend database and are pemanently persisted. So it would last forever unless you cleaned it up.
For your scenario you can have a DAG that
pulls from snowflake and stores in S3
a task for each 3rd party system that loads from s3 to 3rd party system (custom operator or PythonOperator)
then a cleanup task that removes the data from s3. This sould probably be set to all_complete instead of all_success so it runs even if a prior tasks fails so data is never left in s3
If you don’t want to stage the data in s3 then you can just build a custom operator for each of your 3rd party systems such as a SnowflakeToEloquaOperator and a SnowflakeToMixpanelOperator. (you wound end up querying snowflake once per system) Have a look at the operators on Airflow’s git hub that move data between systems. We describe these as transfer operators and are the intended pattern for Airflow.