Saturday, September 10, 2016

Synergy Scheduler with workflows

As of release 1.18, Synergy Scheduler [1] comes with a workflow engine - Synergy Flow [2]. During its design I was thinking about a system an engineer like myself might want to use.

First of all, Synergy Flow follows "write once, run everywhere" principle, where "write" applies to the workflow definition and "run" to the execution cluster. For instance, Synergy Flow introduces set of environment-agnostic filesystem actions - declared once they run equally well on developer's desktop and EMR cluster.

To achieve this, Synergy Flow abstracts on the cluster and its filesystem. While it sounds as a Goliath task, it only takes 6 methods to add a new type of execution cluster, and 7 methods for a new filesystem.
Currently local, EMR and Qubole clusters are supported; as well as local and S3 filesystems;
In fact, most of mentioned methods are pretty trivial - cp, launch, terminate, etc.

And finally, I wanted to avoid skein of hundreds or even thousands of small steps that are difficult and sometimes impossible to inspect without specialized tools as [3]. To address this issue, Synergy Flow introduces concepts of Action and Step. Step is an atomic element of the workflow. It contains three categories of actions: list of pre-actions, a single main action and a list of post-actions.
Idea is simple:
  • Each step is centered on the main action - whether it is a Pig query, a Spark script or any other process
  • Pre-actions are actions that have to be completed before the main action could take place
    For instance: an RDBMS export and further data cleansing, FTP download, etc
  • Post-actions are actions that have to completed after the main action
    For instance: RDBMS import, FTP upload, etc
Synergy Flow is fully integrated with the Synergy Scheduler: job life-cycle, run modes of the workflow, run-time UI dashboard.
You can define how you want the system to behave in case of a failure - to retry the Step from the beginning, or to continue from the last known successful action.
BTW: Synergy Flow allows you to execute the workflow on multiple concurrent EMR clusters, hence, eliminating the concurrency bottleneck.

Give it a try!

Cheers!

[1] Synergy Scheduler
https://github.com/mushkevych/scheduler

[2] Synergy Flow
https://github.com/mushkevych/synergy_flow

[3] Cytoscape
http://www.cytoscape.org/