Common patterns for EMR workflows?

What are the common patterns to follow when writing external compute workflows?

image

Here, the job runs after a set of files arrive, and if the “condition check” passes, it proceeds to send a report.

If it doesn’t pass, we’ll usually see some sort of task/dag triggered to account for those types of scenarios (this could be anything from sending a notification to reloading source data into s3).

If it does, some sort of notification is usually sent out (our customers tend to use slack quite liberally).

Finally, the cluster spins down.

This is just a mockup and some customers can have several levels of validation based on the use case, but the overall structure is a good starting point.