Fundamentals of
Data Observability Driven Development
[DODD]
Genesis of DODD
Before today’s data era, data was already available for use—but it wasn’t as voluminous, ubiquitous, or promising. Data engineers and data scientists had few resources, technologies, or even data to work on, but they also had little corporate pressure to use data to create new value for the organization. Therefore, it was easier for data teams to find the time to write, verify, and implement their applications—and they had the advantage of better quality control because the volume of data was minimal. This made it easier to keep projects under control, as problems were simpler to detect and solve.
Then, the world of big data and AI emerged.
Suddenly, the volume of data was growing exponentially and new technologies and algorithms were emerging to support its heavy use. Companies saw this “data evolution” as a new source of potential revenue, and, under the promise of significant monetization, they heavily invested in resources and staff. Gone were the days when “data projects” were scattered across the company and managed by a few analysts. Companies grasped the value data could deliver, seized the moment and began to move their data efforts up several gears. Data teams became constantly under pressure to deliver new projects that would enable their company to make smarter purchase recommendations to customers, identify new segments, create new products, and better understand their customers.
This acceleration of data usage has progressively pushed companies into a situation where, as the expectations of business teams increase, data teams are launching more and more applications to handle an ever-increasing volume of data-related projects. However, unlike a few years ago, it has become impossible for data teams to take the time to control data (manually). Inevitably, it’s leading to quality issues, which can lead to wrong decisions made upon data without data quality defenses.
To avoid this situation, companies need data monitoring at scale, which requires automation.
Principles of DODD
There are numerous ways to automate data monitoring, but most do not scale very well. In order to define a solution that would help data teams to scale data usage efficiently, as well as foster a data-driven culture, we identified 3 key principles that data observability must follow to be effective and sustainable.
#1 Contextual observability
Data observability should not only provide data teams with information on the data itself, but it must also provide the context of its usages. This includes, when, how, and above all, which applications consume, use, and produce data. In fact, we believe that understanding and monitoring how data is used is as important as monitoring the data itself.
#2 Synchronized observability
Data observability must be performed at the exact moment of data use to avoid any lag between monitoring and use. This is the most reliable approach to ensuring the quality of the data consumed and used by applications. It also helps assure that assumptions about the data are still valid.
#3 Continuous validation
Data observability must be continuously executed during the successive implementation phases (e.g., development, testing, acceptance, and production) along with the validation of the integrity of the code. Continuous integration (CI) guarantees the quality of the code from the very beginning of the development cycle (until the acceptance phase). Similarly, data applications should be continuously validating the data even after deployment in production.
From these 3 essential principles, we’ve concluded that the most efficient approach for automating data monitoring is by applying data observability from within the applications. While this method might seem a bit more intrusive than the usual files, databases, and other scanning approaches, it comes with more benefits than disadvantages for data teams.
Benefits of DODD
We have grouped the benefits of the Data Observability Driven Development (DODD) method into 5 main groups:
#1 Improved analysis, troubleshooting, and prevention
Contextual and synchronized observability provides precise information about both the data quality and the applications using the data. This contextuality makes it easier for data teams to understand data and its uses, troubleshoot issues, and avoid datastrophes. For example, data quality rules could detect that a problem has arisen in a dashboard, where some of the values displayed are obviously wrong. The data team could then quickly troubleshoot the problem by conducting a backward analysis of which of the incoming data or applications are the source of the problem.
#2 Stronger involvement and accountability
Continuous observability from within applications also increases the involvement and accountability of the teams developing applications. Indeed, as they implement data observability within the code as they write it, they must not only understand how the data is supposed to be used but also how its quality must be controlled. This ownership by the development team leads to better coding, faster debugging, and also more creativity.
#3 Easier maintenance
Integrating data observability into applications makes it easier to maintain and debug code since no additional modules are required. Because all the logging and tracing information is directly coming from applications, we can link the discovered issues with the applications requiring maintenance. Moreover, the management of the maintenance is under control because there is visibility into the potential impacts of any changes made to the applications.
#4 Automated complete documentation
Contextual observability helps to better document issues as it provides insight not only about the data but more importantly about the use of the data and the multiple applications processing and producing data. This information is essential because it provides a way to share how the data can be reused.
#5 Higher reliability
Continuous validation significantly improves the reliability of applications since, like the Test Driven Development method, data teams must validate the quality of data during the various development phases.
What’s next?
Data Observability Driven Development is a paradigm shift for data teams. As continuous integration has changed the way applications are developed, we believe DODD will change how data quality is managed during development and production cycles.
If you have any questions or comments regarding the method, or if you would like to challenge this approach, but also if you would like to join our community of DODD enthusiasts, we invite you to connect with our CEO Andy Petrella or to get in touch with our team.