The system was further expanding by injecting source code header/footer, primary key management, etc. Afterwards, it was quickly reaching its limits as every new requirement required reopening the rigid core and every data/database schema change was triggering cascading updates: from the initial input, thru all JOIN, and GROUP BY and causing costly and elusive bugs.
SDPL is not trying to fix all of that. For one, it is not a workflow engine and has nothing to do with the performance tunings. However, it was written to address one particular issue - introduce schema manipulation and schema versioning to existing tools, such as Pig or Spark or Hive.
SDPL stands for Schema Driven Processing Language. It introduces data schemas to Apache Pig and provides most common use cases: LOAD DATA, STORE DATA, JOIN BY, GROUP BY and FILTER BY. Goodies, such as CUBE, could be written as quotations. The snippet below illustrates how to load data from a given data repository, associate it with a schema, join, project and store:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A = LOAD TABLE 'table_a' FROM 'repo_a.yaml' WITH SCHEMA 'schema_a.yaml' VERSION 1 ; | |
B = LOAD TABLE 'table_b' FROM 'repo_a.yaml' WITH SCHEMA 'schema_b.yaml' VERSION 1; | |
C = LOAD TABLE 'table_c' FROM 'repo_a.yaml' WITH SCHEMA 'schema_c.yaml' VERSION 1; | |
D = JOIN A BY (A.a), B BY (B.b) | |
WITH SCHEMA PROJECTION (A.*, B.column AS new_column); | |
E = JOIN C BY (C.c), D BY (D.new_column) | |
WITH SCHEMA PROJECTION (C.*, D.new_column, D.column AS renamed_column); | |
STORE D INTO TABLE 'table_d' FROM 'repo_a.yaml'; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!Schema | |
fields: | |
- !Field {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: a, version: 1} | |
- !Field {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: aa, version: 1} | |
- !Field {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: aaa, version: 1} | |
- !Field {data_type: BOOLEAN, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: column, version: 1} | |
- !Field {data_type: BOOLEAN, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: another_column, version: 1} | |
- !Field {data_type: BOOLEAN, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, name: yet_another_column, version: 1} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!DataRepository | |
db: mydb | |
host: host.the_company.xyz | |
kwargs: {} | |
name: repo_a | |
password: the_password | |
port: '6789' | |
user: the_user |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A = { | |
"a" : {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1}, | |
"aa" : {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1}, | |
"aaa" : {data_type: CHARARRAY, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1} | |
} | |
B = { | |
"b" : {data_type: INTEGER, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1}, | |
"bb" : {data_type: INTEGER, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1}, | |
"bbb" : {data_type: INTEGER, default: null, is_nullable: null, is_primary_key: null, is_unique: null, max_length: null, version: 1} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A = LOAD SCHEMA 'A.yaml' VERSION 1; | |
B = LOAD SCHEMA 'B.yaml' VERSION 1; | |
Z = SCHEMA PROJECTION (A.*, B.b, -A.aaa); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A = LOAD SCHEMA 'schema_a.yaml' VERSION 1 ; | |
``` -- vanilla Pig | |
X = LOAD 'data' AS (``` EXPAND SCHEMA A; ```); | |
STORE X INTO 'db://db_connection_string'; | |
``` |
SDPL is available under BSD 3-Clause License from Github [1] and Pypi [2]. It uses Python3 and ANTLR4 under the hood.
Give it a try!
Cheers!
[1] https://github.com/mushkevych/sdpl
[2] https://pypi.python.org/pypi/sdpl