When you build data products using data product thinking, it's essential to consider data lineage, data discoverability and data governance. Our teams have found that DataHub can provide particularly useful support here. Although earlier versions of DataHub required you to fork and manage the sync from the main product (if there was a need to update the metadata model), improvements in recent releases have introduced features that allow our teams to implement custom metadata models with a plugin-based architecture. Another useful feature of DataHub is the robust end-to-end data lineage from source to processing to consumption. DataHub supports both push-based integration as well as pull-based lineage extraction that automatically crawls the technical metadata across data sources, schedulers, orchestrators (scanning the Airflow DAG), processing pipeline tasks and dashboards, to name a few. As an open-source option for a holistic data catalog, DataHub is emerging as a default choice for our teams.
Since we first mentioned data discoverability in the Radar, LinkedIn has evolved WhereHows to DataHub, the next generation platform that addresses data discoverability via an extensible metadata system. Instead of crawling and pulling metadata, DataHub adopts a push-based model where individual components of the data ecosystem publish metadata via an API or a stream to the central platform. This push-based integration shifts ownership from the central entity to individual teams, making them accountable for their metadata. As a result, we've used DataHub successfully as an organization-wide metadata repository and entry point for multiple autonomously maintained data products. When taking this approach, be sure to keep it lightweight and avoid the slippery slope leading to centralized control over a shared resource.
Since we first mentioned data discoverability in the Radar, LinkedIn has evolved WhereHows to DataHub, the next generation platform that addresses data discoverability via an extensible metadata system. Instead of crawling and pulling metadata, DataHub adopts a push-based model where individual components of the data ecosystem publish metadata via an API or a stream to the central platform. This push-based integration shifts the ownership from the central entity to individual teams making them accountable for their metadata. As more and more companies are trying to become data driven, having a system that helps with data discovery and understanding data quality and lineage is critical, and we recommend you assess DataHub in that capacity.