Data Science Trends to Rule 2022
2021 was an exciting year for data science: despite COVID–19 pandemic-related layoffs and budget reductions, the field continued to flourish. According to a survey commissioned by Anaconda, only 37% of companies decreased their investment in data science. To the vast majority of companies, data science emerged as the pre-eminent tool to survive, and thrive in, the pandemic.
2022 promises to be no less interesting.
Massive models trained on millions of instances like GPT–3 or DALL-E might grab the headlines, but TinyML is on the rise. Simply put, TinyML is the long-awaited fusion of embedded systems with machine learning. The IoT paradigm has largely relied on raw data from edge devices, from smartwatches to electricity meters, being shuttled to large conventional servers that would then execute complex machine learning algorithms. However, over the last few years, the cost (and size) of processing power has rapidly decreased, while the cost of data transfer has remained largely the same. TinyML is a natural answer to computationally expensive models.
Bigger is not always better when it comes to models. A low-power, low-latency model run on an edge device might be a better choice when data transfer is costly or difficult (e.g. due to a lack of cellular or wired networking in the area), a rapid response is desirable and the models can be reduced to a relatively small size. A trail camera used by wildlife researchers to photograph a particular species does not need to have a state-of-the-art deep learning image recognition model onboard — but it does need to be able to operate in austere settings for a prolonged period of time. Similarly, devices used in predictive maintenance and anomaly detection for e.g. oil pipelines or overland high-voltage networks often have to operate outside the boundaries of ubiquitous wireless connectivity. TinyML is a trend that evolved in response to these challenges.
At the heart of TinyML is a growing need for machine learning developers to understand their underlying hardware. Where resources are effectively unlimited, there is little need to budget computational power. However, when a model needs to run on a dime-sized microcontroller with 256Kb (yes, kilobytes — that’s not a typo!) of RAM, developers need to be closer to the metal and have a good understanding of the power and resource implications of just about every line of code.
Fortunately, the TensorFlow developers have picked up on the potential of TinyML, and created TensorFlow Lite, which can compress and optimize models to run on 8-bit integers. As edge devices become ubiquitous, from “smart” kitchen appliances to live anomaly detection and monitoring of unmanned industrial facilities, the TinyML paradigm will gain ground. Data scientists might initially find the learning curve of hardware and low-level software daunting, but the increasing proliferation of TinyML frameworks is bound to give this domain a significant uplift in 2022.
AI as a Service (AIaaS)
In November 2021, OpenAI made waves by announcing that their Transformer language model, GPT–3, would be made available as an API to the general public. This is but the latest in a growing trend to provide cutting-edge models as services. The future of AIaaS will be characterized by composition of atomic AI services. A bank may use one service to build a chatbot to whom customers can report fraudulent credit card charges and use a different service for anomaly detection, risk-scoring customers’ reports. Meanwhile, a clinical language model would ingest a patient’s record and history, and use a conversational language engine from a different supplier to warn her about drug interactions. With a growing number of composable ‘domain expert’ AI models, users can create complex algorithms that merge domain-specific best-of-breed tools.
AIaaS is not without its challenges: in particular, enterprises need to carefully vet their prospective service providers for reliability and security. Data privacy concerns, too, may factor into managing the risk of reliance on a third party. An enterprise that provides a solution may or may not be legally liable if an API provider is breached, but the potential for reputational harm is significant. Enterprises who consider using AIaaS APIs in a customer-facing role must be ‘picky eaters’, doing assiduous due diligence on prospective API providers, obtaining ironclad SLA warranties and ensuring they are sufficiently indemnified.
Regulated industries (e.g. banking, healthcare) must meet high burdens of compliance, which might obviate the benefits of AIaaS. For enterprises that can discharge their compliance obligations and manage the risk of relying on third parties for potentially significant customer-facing AI products, however, AIaaS is an excellent way of rapidly building AI-driven solutions without the upfront expenditure traditionally associated with an in-house AI team. The future looks bright for AIaaS in 2022, and we are likely to see household names and legacy enterprises leverage AI through AIaaS solutions.
Machine learning is complex, with a steep learning curve and an often resource-intensive business model — and that is unlikely to change in 2022. However, automating machine learning (AutoML) might present a solution for both of these problems. By enforcing ‘blueprints’, best practices can be baked into the analytics pipeline from the beginning, thereby preventing users from straying into the numerous pitfalls of machine learning. In addition, automation may also reduce cost-to-solution and time-to-solution by reducing the need for specially trained and expensive machine learning specialists.
The name is slightly misleading — mature AutoML solutions manage not only the machine learning part but also data preparation and preprocessing, as well as model selection and hyperparameter tuning. The leading AutoML solutions are even capable of running automated model diagnostics, advising the user as to the suitability of the final product.
AutoML may also be an attractive value proposition to experienced users and data scientists, who may wish to reuse a predefined analytical pathway for reproducibility or encapsulate individual subtasks to create composable arrays of tasks. In addition, AutoML can provide useful safeguards for the soundness of analytical outputs, such as identifying information leakage from training data that is not row-wise independent and identically distributed. Automating feature selection and hyperparameter search through automated hyperparameter tuning is also going to be a welcome time-saver for data scientists and non-experts alike.
While AutoML tools have been around for years, 2022 is likely to put new emphasis on such solutions to alleviate the chronic shortage of data scientists. A recent report by McKinsey highlighted that money spent on pursuing scarce data science talent might be more impactful in recruiting and training competent, proficient users of AutoML. Similarly, AutoML competence is easier to acquire than the full panoply of a machine learning specialist’s skillset, allowing subject matter experts in operational areas to cross-train as AutoML users. Will 2022 be the year AutoML takes off? I certainly think so, and judging by the trend of major cloud providers as well as independent providers offering AutoML solutions, I am not alone.
Self-service and augmented analytics
Self-service analytics is hardly new, but its meaning is rapidly changing. From a service-oriented notion, which focused on making analytics self-service ready, leading enterprises have transitioned towards a capabilities approach: self-service is not just about providing tools like BI platforms designed for end users, but also expecting managers to become data-driven decision-makers who leverage these tools in their day-to-day duties.
Augmented analytics has emerged to bridge the service-oriented perspective with the capabilities-oriented demands. By automating the generation of insights, augmented analytics products help decision-makers navigate the data firehose and get to results faster — while also reducing the workload of expensive specialists like data scientists, who will be able to focus on higher value-added activities.
Augmented analytics often faces initial enthusiasm but also deep-seated uneasiness from prospective users. Behind this are equal parts pride and prudence: asking seasoned managers to rely on an algorithm to curate information for them is a tall order. Even executives who do not delude themselves into believing they can navigate the vast sea of enterprise data better than an algorithm feel apprehensive at the prospect of turning over data curation to artificial intelligence.
Successful augmented analytics solutions act like a fighter jet’s heads-up display (HUD): they allow situational awareness while preventing information overload, prioritizing data points by impact and saliency — all the while taking care not to obscure the totality of information (“non-limiting curation”). Augmented analytics must also take into account the various ways in which we prefer to ingest information — natural language generation (NLG), for instance, is a valuable tool in conveying information to verbal learners, while visual learners will prefer sparklines and other visualizations. As the interest in augmented analytics continues to rise, a growing number of solutions are incorporating a sound understanding of sensory psychology and best practices for communicating complex information. 2022 might well be the year augmented analytics becomes a household word among leading enterprise customers.
Data marketplaces and exchanges
If data is the new oil, data marketplaces are the new commodity exchanges. Just a few years ago, companies used to own, and jealously guard, their data. With the rise of simple, convenient platforms to share and monetize large data sets, such as Snowflake’s Data Marketplace, any company can refashion themselves into a data provider. The traditional model, whereby an intermediary buys data from originators and transforms it into an analytics-ready format, is on the wane. This is bad news for intermediaries, who will have to fight hard to demonstrate their added value, but good news for everyone else: service providers who have data on customer behavior will be able to create a lucrative side business selling analytical outputs. What you listen to, what you look at before you buy, where you order your dinner from and what weather phenomena get you to take a cab are now valuable commodities in the hands of service providers.
To succeed, they must balance their economic interests with privacy and legal rules on customer confidentiality, lest their new side business destroys the reputation of their overall business. Equally, they must become responsive to a new class of clients and expend some effort up-front to crunch their data into a digestible, analytics-ready format. Data exchanges, such as Snowflake’s Data Marketplace, can be an invaluable tool in the process.
Companies that wish to take the benefit of the new marketplaces in data need a solid data monetization strategy. This must deal with the entire process, start to finish: from the legal ramifications and privacy obligations to building internal CI/CD platforms for data that automates transformations into an analytics-ready format without committing specialist resources. Once adequately developed, however, a data monetization strategy can continue to reap benefits with minimal additional effort. For enterprises that generate large datasets (or highly unique ones) as part of their business, exploring data sharing might well be worthwhile.
2022 is shaping up to be an eventful year in data science, machine learning and artificial intelligence. The pandemic has refocused attention on advanced analytics as a powerful tool to manage uncertainty and respond to rapidly shifting strategic situations. The rise in AIaaS will contribute to the ubiquity of AI-driven solutions, which will now become feasible for a much wider range of markets. Similarly, AutoML and augmented analytics is gradually turning data science from a job description to a job skill — one that will be indispensable for managers seeking advancement. And 2022 might be the year when the oft-cited adage about data being the New Oil would come closer to reality with the first ‘commodity exchanges’ for data.
REACH OUT TO US HERE TO LEARN MORE:
READ MORE STORIES FROM STARSCHEMA:
8 Best Practices for Working with Your Data Science Vendor — from Data Scientists
Get practical advice from Starschema data scientists to optimize your workflows for better productivity and results…
Scalable Machine Learning in Snowflake: Distributed Python from Snowflake SQL using Bodo
How to set up Python ML execution pipelines in Snowflake using the Bodo parallelizing optimizing Python compiler that…