Declarative data pipeline definition

May 2020

Many data pipelines are defined in a large, more or less imperative script written in Python or Scala. The script contains the logic of the individual steps as well as the code chaining the steps together. When faced with a similar situation in Selenium tests, developers discovered the Page Object pattern, and later many behavior-driven development (BDD) frameworks implemented a split between step definitions and their composition. Some teams are now experimenting with bringing the same thinking to data engineering. A separate declarative data pipeline definition, maybe written in YAML, contains only the declaration and sequence of steps. It states input and output data sets but refers to scripts if and when more complex logic is needed. With A La Mode, we're seeing the first open source tool appear in this space.