Resonating with readers: Our most viewed articles during the first half of 2022

Casey Doyle
Data Science at Microsoft
5 min readAug 2, 2022

--

Data Science at Microsoft was established 2.5 years ago with a straightforward mission: To share Microsoft data science best practices, discuss the value of what we do as data scientists, and show the impact of our work on the overall Microsoft business. We believe that this form of social journalism serves as an accessible engine — to both us and our stakeholders — that enables us to tell the story of Microsoft data science. We purposefully decided to make this online publication available to the public because we believe that we can make a contribution to and advance the dialog of the larger data science community in positive ways.

Photo by Christina @ wocintechchat.com on Unsplash.

When we began, some were skeptical. They said we wouldn’t have enough to write about to sustain regular publication, or that there wouldn’t be enough interest among potential readers, or that we couldn’t write about how data science at Microsoft is done and protect what is proprietary to Microsoft. I’m happy to say that more than 120 published articles later, with contributions from data science professionals throughout Microsoft and a steadily growing readership now exceeding 3,200 followers, I believe we have shown that there is indeed a desire to read what we have to share about our work in all of the data sciences.

DS@M, as we call it internally, is open to all professionals in the data sciences who work full-time at Microsoft at the time of article publication, and indeed they have heard our call: Our articles are written by data scientists, ML scientists, data engineers, and data science program managers, among others, from across the company. Although the focus of their work may vary from person to person, they all play an integral role in shaping the development and success of data science as a profession both inside Microsoft and outside of it across the wider data sciences community.

As we proceed into the second half of 2022, I’ve put together this summary to highlight the articles that have resonated the most with our readers, in terms of views, during the first half of the year. Each of these articles has achieved at least 1,000 views since being published. I’ve done no further adjustments or monthly averaging to these metrics, so this does mean that the ones published earlier in the year have had more time to build their audiences than the ones published later. The vast majority of DS@M articles gain the bulk of their views in the first month, however, and so I do not believe that this presents an especially distorted view. We’ve had at least one article eclipse this threshold for each of the first six months of this year. What follows is a list of these seven articles, along with a brief synopsis and link to each one so that you can seek them out and read them if you have not had the opportunity to do so earlier.

My deep thanks to the authors of all our DS@M articles, the data science professionals who take time in their busy working and personal lives to commit to sharing what they know with our readership.

And now the highlights of the most viewed articles:

Comparing matrix factorization with transformers for MovieLens recommendations using PyTorch-accelerated by Chris Hughes, published January 4. Chris notes the importance of recommendation systems in preventing us from being overwhelmed by choice and exposing us to content that would be difficult to otherwise discover and demonstrates how transformers can be used to predict ratings from sequences of past behavior, as well as seeing how this compares with the more widely known matrix factorization approach. Chris also notes that transformers are driving significant advancements in NLP, vision, and time series domains, so it’s about time they came to recommenders as well.

Autosuggestion services in web search by Tezan Sahu, published February 8. Tezan’s first article this year reviews the fundamentals of autosuggestion services in web search and delves into some key aspects of modern autosuggestion services, such as relevance and coverage. Tezan also discusses some techniques to improve the UX of these services, along with performance metrics to evaluate the effectiveness of these techniques.

Visual question answering with multimodal transformers by Tezan Sahu, published March 8. As Tezan explains in his second article this year, recent years have seen significant advancements not only in the respective domains of Natural Language Processing (NLP) and Computer Vision (CV) but also in tasks involving a combination of these modalities. Among the various tasks, Visual Question Answering (VQA) has particularly drawn the interest of several researchers. In this article, Tezan illustrates some of the basic concepts related to Visual Question Answering and Multimodal Models for performing such a task. He presents a detailed PyTorch implementation of VQA models using text and image transformers from the huggingface library, and also compares the performance of several models using different text and image transformers for this task.

Scalable time series forecasting and anomaly detection by Sourav Khemka, published March 15. As Sourav explains in his first article this year, time series forecasting and anomaly detection is important for many businesses in today’s data-intensive world. It finds use across diverse industries including IT, manufacturing, retail, health care, banking, and finance. It also has applications for sales forecasting, inventory analysis, intrusion detection, fraud detection, and production system monitoring, among others. Sourav writes that his team has multiple use cases for time series forecasting and anomaly detection in the spaces of Azure finance and commerce, involving a significantly large number of time series. Forecasting and detecting anomalies in a univariate time series is a well-known problem and many solutions exist that are quite effective. But as the number of time series increases, these solutions often do not scale well. Sourav provides a solution that scales up to a volume of 100,000 time series.

The role of a technical program manager in AI projects by Nik Sachdeva, published April 5. In this article, Nik relays a report from Venture Beat that 87 percent of data science projects fail and never move to production. He notes that Technical Program Managers (TPMs) can drive change to this statistic and help data science and engineering teams build successful AI projects, and he walks through PM considerations for ML projects and a learning path for Technical PMs to increase their skill set for projects that have an ML component.

Estimating customer churn based on usage data by Haribabu Inuganti, published May 31. In this article, Hari explores how to identify churned or retained customers by looking at product usage patterns to understand from the data whether a churn problem exists, covers how to identify a churned customer based on usage data, discusses how to understand churn patterns using Kaplan–Meier curves and heatmaps, and shows how to use statistical testing to determine whether there is a churn problem.

Scalable time series forecasting by Moid Hassan and Sourav Khemka, published June 21. In this article, Moid and Sourav demonstrate the use of Temporal Fusion Transformers (TFT) as one way to solve the problem of scalable time series forecasting. This article builds on the article that Sourav wrote solo on scalable time series forecasting and anomaly detection that was published March 15 and is described above, as it also has been viewed more than 1,000 times.

If you have any ideas for articles that you would like to see covered in Data Science at Microsoft, please include them in the Comments section below. We thank you for your readership!

Casey Doyle is on LinkedIn.

--

--

Casey Doyle
Data Science at Microsoft

Principal Data Scientist of a data storytelling program fostering thought leadership in information design and data visualization inside and outside Microsoft.