dbt continues to be our tool of choice for data transformations in the ELT workflow. We like that it lends itself to engineering rigor and enables practices like modularity, testability and reusability of SQL-based transformations. dbt is available both as an open-source and commercial SaaS product and has a healthy ecosystem, including a community hub with packages for unit testing, data quality and data observability, to name a few. Packages worth highlighting include dbt-expectations and dbt-unit-testing which facilitate data quality checks and unit testing of the transformations, respectively. dbt integrates well with a variety of cloud data warehouses, lakehouses and databases, including Snowflake, BigQuery, Redshift, Databricks and Postgres. When working with structured data where one can set up transformations as SQL, our teams prefer dbt — which is why we're moving it to Adopt.
Since we last wrote about dbt, we've used it in a few projects and like what we've seen. For example, we like that dbt makes the transformation part of ELT pipelines more accessible to consumers of the data as opposed to just the data engineers building the pipelines. It does this while encouraging good engineering practices such as versioning, automated testing and deployment. SQL continues to be the lingua franca of the data world (including databases, warehouses, query engines, data lakes and analytical platforms) and most of these systems support it to some extent. This allows dbt to be used against these systems for transformations by just building adaptors. The number of native connectors has grown to include Snowflake, BigQuery, Redshift and Postgres, as has the range of community plugins. We see tools like dbt helping data platforms become more "self service" capable.
Data transformation is an essential part of data-processing workflows: filtering, grouping or joining multiple sources into a format that is suitable for analyzing data or feeding machine-learning models. dbt is an open-source tool and a commercial SaaS product that provides simple and effective transformation capabilities for data analysts. The current frameworks and tooling for data transformation fall either into the group of powerful and flexible — requiring intimate understanding of the programming model and languages of the framework such as Apache Spark — or in the group of dumb drag-and-drop UI tools that don't lend themselves to reliable engineering practices such as automated testing and deployment. dbt fills a niche: it uses SQL — an interface widely understood — to model simple batch transformations, while it provides command-line tooling that encourages good engineering practices such as versioning, automated testing and deployment; essentially it implements SQL-based transformation modeling as code. dbt currently supports multiple data sources, including Snowflake and Postgres, and provides various execution options, such as Airflow and Apache's own cloud offering. Its transformation capability is limited to what SQL offers, and it doesn't support real-time streaming transformations at the time of writing.