In this article, we discuss how you can avoid data pipeline breakdowns thanks to total observability through the use of dbt complemented with Kensu.
Data quality problems tend to manifest in many ways. Here is an example.
Imagine the following situation. You have an important meeting with the planning department within the hour. As you reread the latest quarterly report, you realize that the figures reported don’t make sense. New categories of products appear in the `type` column, and the report seems to fail to take them into account when calculating the turnover. Until recently, that report had always been manually created and checked. But the latest one comes from an automated report generator. So, somewhere in the pipeline between the database and the report generator, the data is either corrupted or not processed correctly. Unfortunately, an hour is not enough to find the root causes of the problem, address them and then rerun the report generator!
Observability to the rescue
That problem could have been avoided if data observability had been implemented throughout the pipeline. Indeed, alerts would have been triggered, warning the pipeline stakeholders, for instance, data engineers, that an issue was detected
And even if a problem were to slip through the net, Kensu’s observability platform would allow data analysts and engineers to perform a systematic and effective root cause analysis. In such a context, detecting and solving data issues is straightforward.
Now, let’s dig deeper into how to reach full observability by complementing the use of dbt with Kensu.
dbt stands for “data build tool” and performs the T in ELT (Extract, Load, Transform) processes. It consists of a toolset that simplifies the life of analytics engineers. It enables them to transform data in their warehouses simply with ‘SQL like’ SELECT statements that are then turned by dbt into tables and views. It is important to note that dbt does not extract nor load data. Instead, it transforms data that is already available in a warehouse. In short, dbt allows analysts to test data that is being transformed and thereby helps to set up software engineering standards within a data pipeline.
Underlying data of SQL databases are frequently changing. That is why, amongst other reasons, these databases can be challenging to test. dbt provides a way to improve the integrity of SQL in each model by making assertions about the results generated by the model. Ready-to-use dbt enables analysts to test whether a specified column in a model only contains:
Further details about testing with dbt can be found here.
dbt natively supports connections to these data warehouses:
Note that there are also several community-supported adapters for other types of data warehouses.
A detailed description of dbt can be found here and in the dbt documentation.
Our solution, Kensu, enhances the data lifecycle of dbt with its monitoring capabilities.
The integration of Kensu and dbt allows you to:
Kensu complements the test capabilities of dbt thanks to monitoring and continuous validation enforced by internally defined rules. Moreover, our platform enables you to run custom tests in addition to those performed by dbt.
Succinctly, complementing dbt data pipelines with Kensu makes these pipelines observable.
If a data issue is automatically detected by Kensu or arises when reusing the dbt test, i.e., the test command of dbt, in production, our solution provides data observability. It radically facilitates the root cause analysis of the problem thanks to the three DODD principles:
Let’s see what lies behind these concepts.
A Kensu’s data observability agent is a small piece of Python code included in the dbt job, i.e., a set of commands written in Python to be run on a schedule. Our observability agents enable:
The observability information available on our platform can easily be used by data engineers or by any stakeholders to:
Any of these three actions can be undertaken to solve the problem we mentioned at the beginning of this article, i.e., find why the automatically generated quarterly report is corrupted.
The easily accessible observability insights allow any technical team member to understand at a glance:
Caption: The figure above shows a view of the “Project Application” tab on Kensu’s platform. See core text for details.
The following contextual information is displayed on Kensu’s observability platform:
Each time the dbt run, i.e., the execution command of dbt, is invoked, the following observability information is sent to Kensu’s platform:
For this to happen, it does not require:
In short, no additional action or intervention is needed with our approach, as Kensu’s agents are embedded within the dbt Python code and hence have the same access privileges as the code itself. The simple integration of our observability agents within the dbt transformation setup automatically collects metadata from the transformation itself during each execution.
Caption: The figure above is an example of data statistics that can be found on Kensu’s platform. The displayed metrics have been collected from the dbt job. These metrics will show any data anomalies within the pipeline that will be specifically helpful when a root cause analysis is needed. Pipeline data metrics displaying no anomalies will give data consumers and engineers the confidence to trust the delivered information.
Kensu’s agents extract the tests from the dbt projects and convert them into internal rules. Data teams can also monitor the tests in a production environment. If any test fails, Kensu will automatically create a ticket and dispatch an alert. That allows constant monitoring throughout the whole data lifecycle.
Caption: The figure above shows a rule that Kensu automatically created from the dbt run.
The integration of our solution to the dbt pipeline is based on the open-source kensu-py library and the publicly-accessible dbt Python libraries.
All you have to do is add Kensu’s libraries, and you are done. These enhanced versions of the libraries will send the required observability information to Kensu’s database at run time. Our approach is unique as Kensu’s agents lie within the dbt Python code itself, as shown by the diagram below.
Caption: The figure above shows the integration of Kensu within the dbt pipeline itself. This approach is unique as Kensu’s agents are directly embedded in the dbt code. They allow the extraction of all the necessary information to make the dbt transformation pipeline completely observable.
The observability agents nested in the Kensu+dbt application will process the SQL queries to:
The Kensu+dbt integration makes it possible to investigate the root causes of any failing tests by using the end-to-end lineage and monitoring metrics that were collected.
These observability data are now available on Kensu’s platform. Therefore, the metadata can also be shared across the entire ecosystem of a company through Kensu's SDK. This integration allows data catalogs to always be up to date with respect to any of the dbt data transformations.
In this article, we discussed how you could avoid data pipeline breakdowns thanks to complete observability of that pipeline through the integration of our solution, Kensu, in dbt. Our unique approach consists in embedding Kensu’s observability agents directly in the dbt code. They enable the extraction of all the necessary information that will render the dbt transformation pipeline fully observable. It facilitates root cause analysis thanks to the three DODD principles: contextual observability, synchronized observability, and continuous validation. No external tools, actions, or interventions are required to obtain total observability. Only Kensu’s integration within dbt is necessary.
If you have any questions, suggestions, or want to try the new dbt integration, please contact us or connect to our Kensu Free Trial.