First of all, Synergy Flow follows "write once, run everywhere" principle, where "write" applies to the workflow definition and "run" to the execution cluster. For instance, Synergy Flow introduces set of environment-agnostic filesystem actions - declared once they run equally well on developer's desktop and EMR cluster.
To achieve this, Synergy Flow abstracts on the cluster and its filesystem. While it sounds as a Goliath task, it only takes 6 methods to add a new type of execution cluster, and 7 methods for a new filesystem.
Currently local, EMR and Qubole clusters are supported; as well as local and S3 filesystems;
In fact, most of mentioned methods are pretty trivial - cp, launch, terminate, etc.
And finally, I wanted to avoid skein of hundreds or even thousands of small steps that are difficult and sometimes impossible to inspect without specialized tools as [3]. To address this issue, Synergy Flow introduces concepts of Action and Step. Step is an atomic element of the workflow. It contains three categories of actions: list of pre-actions, a single main action and a list of post-actions.
Idea is simple:
- Each step is centered on the main action - whether it is a Pig query, a Spark script or any other process
- Pre-actions are actions that have to be completed before the main action could take place
For instance: an RDBMS export and further data cleansing, FTP download, etc - Post-actions are actions that have to completed after the main action
For instance: RDBMS import, FTP upload, etc
You can define how you want the system to behave in case of a failure - to retry the Step from the beginning, or to continue from the last known successful action.
BTW: Synergy Flow allows you to execute the workflow on multiple concurrent EMR clusters, hence, eliminating the concurrency bottleneck.
Give it a try!
Cheers!
[1] Synergy Scheduler
https://github.com/mushkevych/scheduler
[2] Synergy Flow
https://github.com/mushkevych/synergy_flow
[3] Cytoscape
http://www.cytoscape.org/