In our last article, we introduced the topic of SLAs (Service Level Agreements) and how they are necessary within organizations to help both consumers and producers agree on expectations around data usage and quality. Not only do SLAs provide visibility into what needs to be achieved to ensure data reliability and avoid surprises, but SLAs also create communication flows between consumers and producers that help ensure an alignment on expectations.
While SLAs define the producer's overall promise to a consumer, SLOs (Service Level Objectives) provide a critical buffer to ensure that a producer can meet the SLA expectations for a data project. With SLOs in place, there is greater security that internal alarms will be raised proactively to resolve data issues before they impact the SLA.
For instance, if an SLA outlines specific agreements around data timeliness (e.g., data must be refreshed every 24 hours), an SLO could then be defined to create a buffer that would allow the producers to manage the incident before impacting the client (e.g., data must be refreshed every 12 hours).
Beyond simply keeping the producer safe from SLA violations, there are several other important ways SLOs help to further ensure and improve data reliability. Let’s take a look.
Manage Corner Cases
Corner cases are instances when a problem or situation occurs only outside of normal operating parameters. During the design phase of a data project, such as when designing the business logic, a data engineer may run across corner cases.
For example, they may see that when they receive the data from a source that the age parameters are between 25 and 75, but that there are also a few zero values. While there was no specific communication around the age parameters, the engineer realizes that there cannot be a zero value due to a data usage requirement to divide the age. Thus, the engineer decides the implementation to remove rows with null values. By implementing an SLO to ensure that the volume of missing values remains low and within an initial empirical value of 3%, for example, the engineer ensures that they don’t delete too many rows, which might skew the accuracy of the output data. Hence, the SLO ensures that they will still meet the specific expectations for accuracy outlined in the SLA.
Another example might be when an engineer is joining customer data with contract data. The engineer can assume the data has good integrity, but it’s simply that—an assumption or guess. To better validate the data quality, an SLO can be defined to check the completeness of the data.
Because corner cases don’t happen often, it’s easy to assume that they won’t happen at all. Thus, the initial inclination may be to not worry about corner cases. However, an SLO provides an easy way that lets you manage corner cases and put security alerts into place while avoiding changing SLAs.
Aggregate several SLAs
A producer, such as a data team, is likely to simultaneously support several consumers. Thus, it’s likely that they will have multiple sets of SLAs. Some of these SLAs will have overlapping dimensions (i.e., freshness, completion) and the same thresholds (weekly, 95%). Some SLAs will overlap on dimensions but with different thresholds (weekly vs. monthly, 95% vs. 99%). And, of course, some dimensions of the various SLAs won’t overlap at all.
SLOs can be useful in aggregating these common SLAs. In the case where all SLAs are using the same data, SLOs can be used to define the highest level of constraints for data reliability across all SLAs. You can then define a global SLA based on the most constraining key performance indicators (KPIs) for each dimension and threshold in the SLO. In this way, the SLO provides another kind of buffer (for all SLAs but the most constraining) to help producers meet SLAs and gives the additional benefit of better visibility across all SLAs.
At times, you may have the same dimension (completeness) but may need two separate SLOs due to a significant threshold constraint of one use case. For instance, if one use case requires an exceptionally high threshold (99.9% completeness vs. 95% for all other use cases), it may not make sense to conform to the highest threshold across all use cases. Instead, you will want to create two separate SLOs. Doing so will help develop an understanding of priority. If you are meeting the 95% completeness rate but not the 99.9% rate, you will be alerted, but you will also know you are meeting most of your clients' SLAs, which allows you to better prioritize how quickly you need to resolve the issue for the SLO that requires 99.9% completeness.
Improve the definition of SLAs
IT and DevOps teams have been around for some time, and SLA key performance indicators (KPIs), such as 99.9% uptime or 500 requests per second, are fairly well established and understood by all key stakeholders. Perhaps in five or ten years, DataOps will be established enough that consumers will also know how to define SLAs and what KPIs matter, but we’re not there yet.
The world of DataOps is new, and consumers may have little awareness or knowledge of what should be in an SLA. This has become one of the biggest sticking points in developing more trust with consumers.
By making SLOs visible to the consumer, producers can help them see the types of objectives the producers are holding themselves to internally. For instance, a consumer may not know until they see it in an SLO that measuring the standard deviation of column H could be important. This metric then becomes something that the consumer will want to be notified about and include in an SLA.
By making at least some SLOs visible to consumers, producers can build better relationships with consumers and further strengthen trust and creativity.
Strengthen trust with data consumers
SLOs are often more preferred than SLAs by producers, but they can create just as many issues if the terms are vague, too complicated, or near-impossible to measure. It’s essential to create SLOs that provide simplicity and clarity. To do this, SLOs must focus only on critical metrics and objectives must be clearly communicated in plain language in the SLO.
One of the most significant benefits of having SLOs is that it allows the producer to be proactive and show that you pay attention to the quality of the data delivered. By showing this willingness to have an open dialogue with the consumer and help them acquire additional knowledge about the data, you build greater trust and rapport.
Stay tuned for our last article in this series, where we’ll discuss SLIs (service level indicators) and how they further support data reliability. In the meantime, if you have any questions regarding data quality management and data observability, feel free to contact our team.