Genesis of DODD
Before the data era, data engineers and data scientists had few resources, few technologies, and few data to build something from. But they also had little pressure from the business to create new values, and above all, it was easier to find some time to write, check and implement their applications. It had the advantage of better control of quality. It was easier to keep projects under control and, in case of problems, it was simpler to find the right person who could troubleshoot them with more ease.
Then, the world of data emerged: not only its volume grew exponentially, but new technologies and the algorithms available to support its intensive usage also appeared. Companies saw this evolution as a new source of potential revenue and, under the promise of important monetization, they heavily invested in resources and staff. The time when “data projects” were scattered across companies and handled by a couple of analysts within the company was over. It was now time to move up a gear, with data teams constantly being pressured to deliver new projects to make smarter purchase recommendations, to identify new segments, to create new products, or even to better understand clients.
This acceleration of data usage has progressively conducted companies in a situation that has a negative impact on data-based decision-making. While the expectations of the business teams are increasing, the data teams are releasing more and more applications that handle an always-growing volume of projects, inevitably leading to quality issues if defenses about data quality are not implemented. And unlike a few years ago, it became impossible to take time to (manually) control it.
Data monitoring needs to be extended and automated to scale up, which leads to data observability. But how can you implement it?
Principles of DODD
Data observability could be done in different ways but not all of them are optimal. In order to define a solution that would help data teams and data usage to efficiently scale-up, as well as to also nurture a data-driven culture, we have identified 3 main principles that data observability must follow in order to be effective and sustainable.
#1 Contextual observability
Data observability should not only provide data teams with information about the data itself, but also about the context of its usage, which is when, how, and which applications are consuming and producing it. In fact, we believe that understanding and monitoring the way that data is used is as important as the data itself.
#2 Synchronized observability
Data observability should be executed at the exact moment of data usage to avoid any lags between monitoring and usage. This is the most trustworthy approach to guarantee the quality of data being consumed and used by applications.
#3 Continuous validation
Data observability should be continuously executed during the successive implementation phases (e.g. development, testing, acceptance, production) along with the validation of the integrity of the code. While the Test Driven Development method enforces to guarantee the quality of the codes since the very beginning of the development cycle, it should also be the same for the data.
Therefore we came to the conclusion that the most efficient approach to follow these principles is to do data observability from within the applications. While this method might seem a bit more invasive than the usual scanning approach, it offers more benefits than drawbacks for data teams.
Benefits of DODD
We have grouped the benefit of the Data Observability Driven Development method into 5 main groups:
#1 Improved analysis, troubleshooting, and prevention
Contextual and synchronized observability provide precise information about both the data quality and the applications using it, making it easier for Data teams to understand data and its usage, to troubleshoot issues, and to prevent datastrophes. For instance, data quality rules could detect that a problem has arisen in a dashboard with some figures being off the charts. The data team could then quickly troubleshoot the problem by analyzing backward which of the incoming data or the applications is the source of the problem.
#2 Stronger involvement and accountability
Continuous observability increases the involvement in data observability and the accountability of the teams developing the applications. Indeed, as they implement data observability within the code when they write it, they have not only to understand how the information will be used afterwards, but also how its quality should be controlled. This appropriation leads to better coding, faster debugging, and also to more creativity.
#3 Easier maintenance
Embedding data observability within the applications facilitates the maintenance and debugging of code since no other module is required. All the material is self-contained while being simultaneously connected to the other applications. And therefore there is no need to mix information coming from various sources (e.g. data profiling, application observability, …)
#4 Automated complete documentation
Contextual observability helps to better documentate as it provides information not only about data but also about data usage and the multiple applications processing and producing data. This information is key as it offers a good understanding of the whole system.
#5 Higher reliability
Continuous validation can perceptibly improve the reliability of the applications since, like the Test Driven Development method, data teams have to validate the quality of the data during the different phases of development.
Data Observability Driven Development is a paradigm shift for data teams. As Test Driven Development changed how applications are developed, we believe that DODD will change the way data quality is managed during the development and production cycles.
If you have any questions or comments regarding the method, or if you wish to challenge this approach, we invite you to connect with our CEO Andy Petrella, to get in touch with our team, but also to join our community of DODD enthusiasts!