Data mesh defines the architectural quanta of a modern data platform as a data product. Supporting this logical view, data pipelines play a core role in moving large amounts of data around an organization. During the development, deployment and maintenance lifecycle of these pipelines, there are several traits engineers should consider as indicators of healthy, robust and scalable pipelines.
Allowing developers to quickly test and prototype on a small scale allows for the rapid refinement of ideas and solutions. To enable this, developers need access to environments to experiment in and evaluate the results in a low friction manner. One of the easiest ways to do this is to provide a local environment that parallels production using tools such as localstack and serverless framework. These tools replicate AWS services (as much as possible) and are easy to spin up and tear down. They effectively provide developers with a personal production environment with none of the cost or risk associated with it.
CI/CD and infrastructure as code is an approach that minimizes downtime and de-risks deployments and a natural extension to this thinking is Pheonix Servers. Though it may seem extreme to completely tear down and redeploy your architecture, the design thinking and discipline required to enable this to be done at any point in time is invaluable. Making sure configuration is codified to avoid configuration drift, tooling for data replay and idempotent operation to all datastores are just some of the hurdles that need to be overcome. Even subtle concerns such as minting deterministic record identifiers get challenged when considering how a pipeline can be restarted with no impact to downstream (and upstream) consumers.
Once your pipeline is deployed and serving requests of production users, the next area of concern is scaling. The ability to scale the throughput of the pipeline based on processing demands leads to a more responsive product with fresher, more up-to-date data. One issue that arises when ingesting data from third parties is handling the various issues inherent within that data such as the ingestion of corrupt data. Being able to identify when you cannot process records and defer its processing via dead letter queues minimizes downtime and increases overall throughput. To address this issue, having the tooling to replay specific events (often out of order) helps when manually fixing the data, allowing the datastore to get to a complete state quicker. Retry loops with intelligent backoff allow contention and intermittent downtime of downstream systems to be non-issues, whilst also making the pipeline less brittle.
Having a holistic understanding of the health of a pipeline is imperative when it is serving a production request. With the rise of serverless components and microservice-based architecture, the need for accurate real-time monitoring increases. Periodic end-to-end tests in real environments can give insights into blockers in the pipeline whilst services allow you to track the flow of data through the system. Correlation identifiers help put the data in perspective as they can show processing time at each node in the pipeline, highlight retry loops and bad data poisoning the pipeline. As the product of these pipelines are quite often consumed by other teams, data quality should always be a concern. Frequent automated profiling should be in place to ensure the shape, distribution and volume of data stays within a specified threshold, alerting individuals when there is some operation that causes it to sway outside of this.
These traits are just some indicators of a health data pipeline. The order in which they are implemented should be based on your specific needs and priority areas that are critical to your individual use case.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.