Assuring Data Quality at Scale

More and more companies are becoming Data and AI driven – Data is the lifeblood of many systems we build and business decisions that are made today. Customer experience and journey which in turn drive the P&L for businesses rely on data that are captured and fed into our systems. It is highly imperative that this data is of the highest quality and continues to stay high quality.

The quality of data has a direct impact on the quality of the ML model output, accuracy and relevance. It also has a proportional impact on the cost of running data engineering pipelines be it stream or batch data processing. Following the DataMesh pattern to building platform capabilities that powers decentralized data products, this talk will layout an approach to implementing Data Quality at scale, the key steps in providing confidence and trust in the data being produced and consumed by the data product teams.

This presentation talks about
* Data Quality challenges in modern day data-driven enterprises from both Stream and Batch perspective
* Dimensions & metrics of Data Quality
* Key parts and approach to build a Data Quality platform at scale to provide near-realtime visibility to DQ issues
* Fitting this capability around data eco-system including triggering remediation actions such as stopping a data pipeline

Video producer: