The solution
With data pipelines at the core of PSMA’s business, we sought to build a custom solution. In just 10 months, the combined ThoughtWorks and PSMA team went from discovery and inception to launching the streaming data platform for real-time customer consumption, alongside a suite of Quality Assurance (QA) tools. The team focussed equally on building the platform and lifting the capability of PSMA to extend the platform.
Our solution outcomes included:
Streaming Data Platform
We built the streaming data platform to leverage cloud technologies and minimise operational work.
The platform consists of a series of data pipelines that automatically ingest, validate, sanitise and standardise data as soon as it’s made available by suppliers. From here, internal product teams consume the data at various points in the pipelines and use it to power the products they deliver to their customers and value-added resellers.
We used a streaming solution based on Amazon Kinesis to ensure multiple customers could tap into time series data, and AWS Lambda for compute. Amazon Kinesis ensures a near infinite scale up and down based on workload, while AWS Lambda reduces operational effort and provides significant horizontal scale capability.
Data QA
Over and above the technical solution delivered, we developed a process for data quality assurance. The process enables PSMA to quickly verify an entire dataset on an ongoing basis. There is also a system for converting exploratory analysis (using Jupyter notebooks) into ongoing monitoring and alerting to ensure the quality of the data remains consistent.
Tooling
We implemented tools like centralised logging which allows PSMA to see the state of the whole pipeline at any moment in time, as well as monitoring and alerting, error handling and repeatable local development environments.