SQL vs. Machine Learning vs. Machine Learning Applied to SQL
The seed for this article was planted when Anant was struck by a headline on his Twitter feed: “You don’t need ML/AI. You need SQL.”
He had observed something similar in working through data and analytics requirements for Google Cloud’s Apigee team — not that machine learning (ML) or artificial intelligence (AI) is not needed, but that good database queries can frequently accomplish the job, and that when AI is legitimately needed, its role is often to improve the database design and operations, not to replace them.
The two of us got the chance to compile our thinking a bit more as Anant was preparing for a talk at VLDB 2018, a premier database conference. The slides of his talk are here.
In this post, we elaborate on some of our observations on the topic.
What We’ve Learned: ML and SQL
As a leading API management platform, Apigee processes hundreds of billions of API calls every year. These API calls generate data exhaust that can be a valuable source of insights for our customers. Because our customers’ APIs are often called by applications that their customers use, this data can drive insights not only into the performance of our customers’ APIs and how those APIs might be improved, but also into our customers’ customers’ behaviors.
SQL is an important part of this process; if one wants to know “what is the average number of API calls on Sunday following a new release of an API?”, SQL can find the answer. We call this BI-level analysis.
But the above example is backward-facing. Many people believe that if one wants to predict things — “what will the API traffic be three months from now?”; “is this API call suggestive of bot activity?”; etc. — then applying ML/AI to the problem is the right approach. What we’ve found, however, is that this is frequently not the case! Many prediction problems can be solved by well-crafted SQL. Furthermore, SQL offers explainability that deep ML generally does not.
SQL itself is only as good as the data it is being fed. Poor labels? The analysis in trouble. Garbage data? Garbage output. It turns out that though ML may be less indispensable to predictions than popularly believed, it is is a wonderful tool to improve the quality of data being fed into SQL.
To be clear, our observations are not that ML or AI is unnecessary to solving certain problems. In many cases, regression analysis (i.e., the limits of what can be supported by SQL) can only take one so far. Recommender systems and linguistic analysis of support tickets are aspects of our work in which deeper ML/AI techniques make a difference — but our main point is to discourage reflexively turning to these newer, more hyped techniques without first exhausting what SQL can do.
The above points can be summarized by this pyramid that we find very useful:
SQL for BI Level Analysis
One of the products we work on is designed to help users reduce the probability of bot attacks.
The product’s analysis and detection of these attacks is derived from a combination of SQL and deeper analysis based on anomaly detections, regressions, and principal component analysis. We have realized that our customers want explainability — and that it is easier to explain results out of SQL. It’s good to be able to say, “This is a bot because its traffic pattern was below a defined threshold for five consecutive time intervals during which there were several attacks on the login API.” It’s not good to say, “I don’t know why that is a bot — but the algorithm tells me that it is.”
Of course, many hidden patterns are not explainable, and they — and the ML technologies that discover them — have a role to play in bot detection. But there may always be some tradeoffs between explainability and fully trusting ML to do a job — and using SQL for many tasks can help businesses err on the side of explainability while maintaining strong protection.
Garbage in, Garbage Out
Regardless of whether SQL or ML/AI is optimal in a given situation, nothing of substance can come from either approach without good data.
A specific problem is labeling data. Building on the bot detection problem mentioned above, it is obvious that to combat bots, one needs labeled data. But there are billions of calls — and most of them are probably not bots. How do we bring order to this chaos?
Humans can be better than machines in these situations, but humans cannot look at billions of calls and label them. A good approach is for humans to be asked to label only API calls that the algorithm is not sure of. In other words, we’ve found it is useful to apply ML/AI to the problem of feeding data that needs to be labeled to human labelers.
In the figure below, for example, green calls are obviously non-bots, red calls are identified bots (based on, say, earlier labeling), and yellow calls are ambiguous.
ML/AI applied to SQL problems
In Google’s Apigee team, we extensively use ML/AI for our internal support tickets. We use ML/AI to learn which deployments of ours are likely to cause failures, for example, and we use it to spot complex patterns in bots that are not easily labeled. Our use of ML/AI is only increasing, but so is our use of both SQL and ML/AI applied to SQL problems.
Similarly, our customers leverage ML/AI to deliver better recommendation APIs, to build better APIs that better serve developer and end user needs, and to manage their developer funnel.
Don’t put all the Eggs in one Basket
SQL remains a powerful tool that, though perhaps not as hyped and celebrated in recent years as ML or AI, can help solve an array of business analytics challenges. ML, meanwhile, may be less crucial to predictions than some believe, but it remains one of the most exciting technologies out there — one whose potential future impacts are profound but that can still be leveraged by enterprises to make a difference today. Both we on Google’s Apigee team and our customers use SQL, and we foresee use cases in which ML/AI applied to SQL become much more prevalent.
[Interested in using analytics to make the most of your API data? Get your copy of our recent eBook, Inside the API Product Mindset: Optimizing API Programs with Monitoring and Analytics.]