Since we last wrote about dbt, we've used it in a few projects and like what we've seen. For example, we like that dbt makes the transformation part of ELT pipelines more accessible to consumers of the data as opposed to just the data engineers building the pipelines. It does this while encouraging good engineering practices such as versioning, automated testing and deployment. SQL continues to be the lingua franca of the data world (including databases, warehouses, query engines, data lakes and analytical platforms) and most of these systems support it to some extent. This allows dbt to be used against these systems for transformations by just building adaptors. The number of native connectors has grown to include Snowflake, BigQuery, Redshift and Postgres, as has the range of community plugins. We see tools like dbt helping data platforms become more "self service" capable.
Data transformation is an essential part of data-processing workflows: filtering, grouping or joining multiple sources into a format that is suitable for analyzing data or feeding machine-learning models. dbt is an open-source tool and a commercial SaaS product that provides simple and effective transformation capabilities for data analysts. The current frameworks and tooling for data transformation fall either into the group of powerful and flexible — requiring intimate understanding of the programming model and languages of the framework such as Apache Spark — or in the group of dumb drag-and-drop UI tools that don't lend themselves to reliable engineering practices such as automated testing and deployment. dbt fills a niche: it uses SQL — an interface widely understood — to model simple batch transformations, while it provides command-line tooling that encourages good engineering practices such as versioning, automated testing and deployment; essentially it implements SQL-based transformation modeling as code. dbt currently supports multiple data sources, including Snowflake and Postgres, and provides various execution options, such as Airflow and Apache's own cloud offering. Its transformation capability is limited to what SQL offers, and it doesn't support real-time streaming transformations at the time of writing.
