Recently I read a very informative article by Stephen Catanzano in Tech Target (Avoid data sprawl with a single source of truth). To be honest, this is an age-old challenge, and it's getting worse. IDC states that by 2025 the global datasphere will grow to 175 Zeta bytes and that 90% of the data in the world is a replica. Why does this matter? As Stephen points out in his article, a single source of truth is a fundamental concept in data management. Meaning there should only ever be one place where a specific data item is available.
Does a single source of truth (SSOT) mean everything is stored in a single location? No, It’s really all about availability.
In fact, if we examine the principles of data fabric or, in particular, data mesh, they absolutely advocate the SSOT concept in terms of access and availability.
So can we ever achieve a single source of truth? I'd personally turn the question around, what is it the single source of truth helps with, and how do we achieve this?
Stephen states, "The purpose of an SSOT is to ensure that all stakeholders in a system have access to consistent, accurate, and up-to-date information. This helps reduce errors, inconsistencies, and misunderstandings that can arise when multiple versions of the same data exist." Which makes sense.
As we know, in the real world, data is replicated, data is transformed, and data is moved, for example, from one business application to another. Each time one of these events occurs on an organization's data, a data pipeline is involved.
Data observability at the source monitors the data pipelines from within. For example, if your data pipeline is built using Apache Spark, then data observability agents will be embedded into your Apache Spark pipeline, observing data at run-time.
With data observability now built into your data architecture, you will begin to benefit from the observations sent to the platform.
Data observability provides not only insights into the health of your data but also automates metadata collection. This metadata can be used to automatically update your data catalogs as well as to provide end-to-end data lineage. This is key to understanding your single source of truth. By accepting that data is replicated (which I think we all accept), having the lineage at hand means that you can be sure of its provenance.
Data observability can also be set to measure other key attributes of your data that help you make critical business decisions on how “truthful” the data may be. These measures include freshness, automated anomaly detection, and cardinality.
Finally, data observability can be implemented in such a way within the data pipeline that if a “rule,” for example, the freshness of data, is triggered, then the “circuit breaker” will kick in and halt the execution of the data pipeline, thus protecting the data and single source of the truth.
In conclusion, whether you can ever reach the utopia of a single source of truth is debatable. However, data observability at the source implemented within the data pipelines will provide alters and events (such as circuit breaks) that enhance the quality and trust in the data, which is essential for the concept of a single source of truth.
If you want to experience the Kensu data observability platform and validate for yourself the power of Kensu and its ability to help you build your single source of truth, do not hesitate to join our monthly introduction to data observability or directly reach out to our team here.