Skip to content

eBook | A Guide to Understanding Data Observability

Disclaimer: This post is a preview of the eBook. If you wish to directly download it, click here

-

Introduction

Have you ever gotten a message from your CFO asking, "What's going on? Why did sales drop almost 30% in one day?" While the data may say sales are rapidly tanking, you both know that it's unlikely your sales took that dramatic of a nosedive in a single day. More likely, it's your data that's wrong.

You start by looking at your monitoring dashboards for your data pipelines. As far as you can tell, everything seems to be running correctly. However, what you're not currently monitoring, and don't have insight into, is the data processed by your applications. Moreover, without any context as to why this sudden change in the data is occurring or where to start looking for the issue, you know getting to the root cause will be difficult.

Because you must handle the issue manually, your main strategy is to rely on some basic tools (e.g. SQL command interface) to help troubleshoot and to seek help from other data engineers and analysts. But none of these methods guarantee you’ll find the issue quickly.

You start your investigation by running a data dump to view what the data looks like along its process, from ingestion to analysis. However, once you get the data, you realize that it's not helpful because you don't have any information about what the data used to look like, so you can't pinpoint what might have recently changed with the data to cause the issue. You need a history of the data, especially how it was used to create the final sales analysis, but this history doesn't exist.

Add to these issues the fact that someone else wrote the code for the data pipeline months ago. You track them down to help, but between both of you, all you've got to help resolve the issue is an outdated document and the other person's vague memory of what they did. Consequently, your confidence in changing anything in the process logic is very low. Meanwhile, your CFO is anxiously waiting for you to fix the issue so they can deliver accurate sales numbers to the CEO.

You've got, what we call, a full-blown datastrophe on your hands.

What is a datastrophe?

A datastrophe results from a missing gap in data management that leads to a "catastrophic" impact on the business. Unfortunately, it’s also becoming an increasingly common problem. In one study, 42% of businesses say they’ve struggled with data issues like inaccurate data.

As data ecosystems have become increasingly complex, with multiple systems of data pipelines, applications, and numerous inputs of large volumes of data, the chances of a datastrophe have also increased. Not only has it become much more difficult to troubleshoot an issue in a complex system, it's also much more time-consuming to do so.

The cause of a datastrophe can be anything from an inaccurate, incomplete, or inconsistent data set. Or it can be caused by unanticipated data drifts (e.g. null values, missing columns, wrong format, or high variations), data logic failures in applications (e.g. wrong SQL syntax, wrong grouping, or brutal row-dropping), or from all these combined. The only certainty is that whatever is causing the problem, it's having a negative impact on your business.

Data issues can be caused for multiple reasons, but the most common are:

Regulatory

Changes in data privacy or other data regulations may require modifications in how data is collected, ingested, transformed, or stored, which can create unforeseen issues.

Human error

Often, data issues are caused by simple human error. Someone accidentally deletes a field or column without realizing it or introduces a regression when updating an application (with untested logic).

Business demands

Different business use cases may require different configurations of the data. One business use case may not need the use of addresses, for example. So, a business user may request that the “address” column be removed. However, someone else using the same dataset may need addresses for their use case, so their analysis is now incorrect when this information is left off the data.

One of the most challenging aspects about what causes a data issue is that those involved in creating the change to the data or application often don't realize the implications of the changes they've made. And, unfortunately, the issue isn't usually discovered until the end of the data value chain.

Usually, it's business users, like the CFO in our story above, who are running reports and realizing through a gut feeling and their own previous experience using the data that the numbers "don’t look right.” But, at that point, it’s already too late. Business decisions may have already been made based on faulty information before the inaccuracies were discovered. Or it may be essential to get the data in near real-time, such as to make decisions about whether to run end-of-day promos.

With no time to fix data issues, IT and DevOps teams scramble to figure out who has the knowledge and skills to help resolve the issue. Yet, it’s often not even clear who is responsible or what knowledge and skills you need to address the issue. Analysts? Engineers? DevOps?

And the responsibility can change from one moment to the next. Perhaps the analyst made a change in how some information is calculated that is now impacting the sales reports, but perhaps even before that, the Data Platform team adjusted one of the connections supporting the sales analytics tool the CFO uses to run the sales reports.

Like in the example at the start, everyone is relying on everyone else’s memory about what they did or didn’t do and are manually trying to find and fix the issue. No one has a clear understanding of what fields or tables affect downstream data consumers, and the only notifications they have set up are basic failure alerts.

The expense and time involved to resolve the datastrophe and its negative impact on business productivity, sales, overall revenue, and even reputation can be significant. According to Dun and Bradstreet, almost one in five businesses say they've lost a customer due to incomplete or inaccurate data. Nearly a quarter of companies also say poor quality data has led to inaccurate financial forecasts. As a result, bad data costs most companies 15 to 25 percent of their revenue, an estimated cost of $3.1 trillion each year to businesses' bottom lines.

Finally, constant datastrophes can lead to a lack of confidence in making business decisions based on data insights. In a recent study of 1,300 executives, 70 percent of respondents said they aren’t confident the data they use for analysis and forecasting is accurate.

So, what can be done to prevent datastrophes from occurring? Similar to how DevOps uses observability to monitor key business and system metrics, data-driven organizations need an automated and scalable method to observe and monitor their data usage and data applications.

Now that we've covered datastrophes, let's see in the eBook how you can prevent them with Data Observability.