Wednesday, November 02, 2016

sdpl: schema driven processing language

Practically every company I worked for had a "meta-information" reader that was altering the workflow behavior: changing the data source, wrapping certain columns with UDF or passing performance tunings to the underlying engine.

The system was further expanding by injecting source code header/footer, primary key management, etc. Afterwards, it was quickly reaching its limits as every new requirement required reopening the rigid core and every data/database schema change was triggering cascading updates: from the initial input, thru all JOIN, and GROUP BY and causing costly and elusive bugs.

SDPL is not trying to fix all of that. For one, it is not a workflow engine and has nothing to do with the performance tunings. However, it was written to address one particular issue - introduce schema manipulation and schema versioning to existing tools, such as Pig or Spark or Hive.

SDPL stands for Schema Driven Processing Language. It introduces data schemas to Apache Pig and provides most common use cases: LOAD DATA, STORE DATA, JOIN BY, GROUP BY and FILTER BY. Goodies, such as CUBE, could be written as quotations. The snippet below illustrates how to load data from a given data repository, associate it with a schema, join, project and store:

Schema is described in a YAML file. For instance:

Data repository is a YAML file representing a connection string to your data storage - whether it is a DB, an S3 bucket or a local filesystem. For instance:

SDPL compiles into Apache Pig. Spark and Hive are planned. Typical schema operations are shown below. For example, for given two schemas:

we can compute field addition and removal:

Should you have a sizable body of existing code, you could benefit from SDPL by loading schema, implicit projections and explicit EXPAND:

Schema versioning is performed in a minimalistic way - every field of the schema is declared with a version attribute. While loading the schema, all fields with field.version > ${VERSION} are skipped. Thus, you can reuse the same schema by multiple scripts supporting different versions of the same data.

SDPL is available under BSD 3-Clause License from Github [1] and Pypi [2]. It uses Python3 and ANTLR4 under the hood.
Give it a try!

Cheers!


[1] https://github.com/mushkevych/sdpl

[2] https://pypi.python.org/pypi/sdpl