“Without clean data, or clean enough data, your data science is worthless.”
Michael Stonebraker, adjunct professor, MIT
AI is one of the fastest-growing and most popular data-driven technologies in use. Nine in ten of Fortune 1000 companies currently have ongoing investments in AI.
So you may be wondering: how could there possibly be another AI winter?
Sure, the hype and current investment of AI is real. But there's also room for concern about AI's long-term growth trajectory. Of the 90% of companies that have invested in AI, fewer than 2 out of 5 report business gains from it in the past three years. A 2020 Big Data and AI Executive Survey showed that there has been a 39.7% decline in organizations that say they are accelerating their AI investments.
If better returns aren't seen soon, it's beginning to look more likely that businesses will start to pull back on investments in AI—and that could cause another AI winter.
A brief history of AI winters
Before diving deeper into the risks of another AI winter, it’s important to look briefly at what has caused previous AI winters. So far in AI’s history, there have been two distinct periods where excitement and investment in AI have ramped up rapidly—and then died. These two periods are known as the first and second AI winters.
These "winters," where investment in AI fell significantly, were caused by mathematical and technological limitations. These limitations have been overcome, and since the late 1990s, AI has been enjoying its third and longest hype cycle.
Given the current enthusiasm for AI and its proliferation in all types of applications, it may be hard to imagine a third AI winter could be on the horizon, especially when technical limitations don’t seem to be halting innovation. However, despite AI’s promise, it’s clear organizations are struggling to get enough return on their investment (ROI).
Since data is what fuels AI, it’s also what is most often the issue when AI applications and innovations fall short of expectations. It could be the amount of data required that’s the issue, but more often, it’s data quality. As Veda Bawo, Director of data governance at Raymond James has noted, “You can have all of the fancy tools, but if the data quality is not good, you're nowhere.”
The impact of data variation on AI innovation
Because AI depends upon data sets to feed its models, the reliability of the data can directly impact the success of AI models, and more broadly, the innovation and progression of AI. The issue isn't that these AI models need more data to be successful, but that they need high-quality data. Even then, a high-quality data model can fail simply because the world has changed too much for the model.
For instance, if a retailer’s e-commerce store target audience suddenly switches from teens to pregnant women because they’ve started carrying several lines of maternity clothes, the AI model might not be capable of predicting the right recommendations for site visitors anymore because it is essentially a new population. So, while the data itself might be correct, the real-world circumstances have changed.
Still, data quality is always a pre-condition for AI. Garbage in will produce garbage out if we don’t pay attention to how the outputs are created by the application. A better approach is for the application itself to detect bad quality data (garbage in) to avoid producing a poor output (garbage out).
According to O'Reilly's, The State of Data Quality In 2020 Survey, over 60% of enterprises see their AI and machine learning projects fail due to too many data sources and inconsistent data. Even small-scale errors in training data can lead to large-scale errors in the output. Incomplete, inconsistent, or missing data can drastically reduce prediction accuracy.
Think of the impact of prediction degradation on something like self-driving cars. A poor prediction, such as not accurately predicting the car's proximity to a human crossing at a crosswalk, could lead to a fatal accident. Something similar has already happened. A Tesla Model S, being operated in Full Self-Driving mode, missed a curve in the road, causing it to hit a tree and kill two people. Uber's testing of self-driving technology has also resulted in some negative outcomes. Most recently, an Uber self-driving car killed a pedestrian crossing the road. These types of prediction-failure incidents have created considerable skepticism that self-driving cars will ever be safe. In the case of Uber, it has led to them halting the testing of the technology in Arizona, where the accident occurred.
If organizations don't implement better data control mechanisms—and at a scale that can keep pace with the ingestion of massive volumes of data—the risk of datastrophic failures within AI-driven applications will become too great for organizations to bear. They'll begin to believe there's too much at stake in deploying new products or services that rely on AI because they won't feel they can control the quality of the data feeding the AI algorithms. Investments will dwindle, and another AI winter will be upon us.
“Investments will dwindle, and another AI winter will be upon us.”
We are at a pivotal point in the AI hype cycle. If AI is to avoid the fate of other AI hype cycles, its success must continue to impress. By impress, we mean that AI must continue to be simultaneously creative (new and innovative) and performant (efficient and trustable).
So how can this be achieved?
Avoiding the next AI winter
When considering the explosive growth of AI and the complexity in these models and applications, data issues can be nearly impossible to resolve at the root cause without visibility into the data and data pipeline. But if issues aren't resolved adequately and continue to reoccur, it will ultimately lead to a lack of confidence in the ability to create complex data-driven applications. Leading us back to our earlier concern—that this lack of confidence could hasten a third AI winter as creative experimentation and innovation in AI decreases due to concerns about the ability to "control" or "resolve" a data failure.
The concept of monitoring key business processes has existed for quite some time, but there hasn't been a way to automate the testing and validation of data in the same way there is in the IT and DevOps environment. However, with the immense volumes of data now available and the inherent complexity of AI/ML applications, data and data pipelines are seeing a similar need to ensure the reliability of the data.
This is where data observability plays a key role. Data observability is a solution that provides an organization the ability to measure the health and the usage of their data within their system, as well as health indicators of the overall system. Using automated logging and tracing information that allows an observer to interpret the health of datasets and pipelines, data observability enables data engineers and data teams to identify and be alerted of data change issues across the overall system. This makes it much faster and easier to resolve data-induced issues at their root rather than just patching issues as they arise.
Having observability of your data and data pipelines is important across all data use cases, but particularly for AI/ML, where there is also a high level of complexity in the algorithms. If IT and data teams don't have visibility into data and data pipelines, the probability of data issues increases significantly. At the same time, the ability to resolve these issues in a timely manner decreases dramatically.
Data observability can act as the safety net and thus provide greater confidence in complex AI/ML models and applications. This will give data and analytics teams more confidence that they can experiment safely with data applications—even highly complex ones. This, in turn, will boost data creativity and innovation, keep the hype of AI alive, and help avoid another AI winter.