Faculty Spotlight | Arun Kumar on Scalable Analytics, Academia vs. Industry

Arun Kumar is an Assistant Professor of Computer Science & Engineering and Data Science (Halıcıoğlu Data Science Institute) at UC San Diego. He is currently teaching DSC 102: Systems for Scalable Analytics. His research interests are in data management and systems for machine learning. In his leisure time, he enjoys hiking, writing poetry, blogging, and movies.

Kumar’s interest in computers and data management began in high school, where he learned the basics of databases and programming languages, and became fascinated with how we store and process information. Throughout his higher education, he grew an affinity for the systems perspective, that is, how we build algorithms to process and understand data. At the time, Machine Learning (ML) systems were emerging and gaining leverage in new conferences and workshops.

“There is a lot of investment by industry, by government in this field, so it’s a fast growing field,” Kumar said. “I am glad to see that growth and I’m glad I was at the head of that wave of growth for data systems for ML.”

From Kumar’s perspective, lots of changes in the data management field, coupled with investment from industry and government have made it urgent for people to pay attention to scalable analytics, fueling its development, research, and applications.

Buzzwords have also played a role in the growing presence of scalable data systems for ML, particularly “big data” and “cloud computing.” Kumar explains that the “big” in “big data” signifies complexity and power just like in “big oil” or “big government.”

“The data is so huge and varied and might in at a great rate, and it’s so vast and often messy,” Kumar said. “It occupies an organization’s mind-space enormously in terms of money, time and human effort and so on.”

So no, “big data” does not solely refer to data that is big in size. The database community has been dealing with data big in size for decades, but the emerging issue is on how to deal with heterogeneous, unstructured or varied data and complex data-centric computations such as ML algorithms on such data.

Cloud computing is getting in full flow now as well, Kumar believes, as it promises many benefits in terms of manageability and cost savings for many application users dealing with big data. Amazon, a leading proponent and provider of cloud services started off by realizing that there was a huge potential in allowing people to rent out machines for computation, storage, etc. through Software-as-a-Service (SaaS) and other offerings.

The Impact of Scalable Machine Learning

Kumar sees much potential in the future of scalable data systems for ML, with widespread applications in the commercial world, the healthcare field, and society at large.

“One: predictive power will go up; two: you can ask questions that are more fine-grained than ever before; three: you can start producing predictive applications at a much larger scale than before,” Kumar said.

Let’s break this down a bit more:

1. Predictive power

In ML, there’s a phenomenon known as the bias-variance trade-off, which is essentially a tension between accuracy on the training data seen and generalizability to unseen data for prediction. Deep learning resolves this tradeoff by allowing models to memorize an entire training dataset with extremely high accuracy (virtually no bias but still offer low enough variance and thus, high overall prediction accuracy — the ideal). To sustain this, we need larger datasets and complex models, which is possible with scalable analytics.

Bias-variance tradeoff. Image from Learn OpenCV

2. Ask questions that are more fine-grained

Scalability makes it possible for us to collect fine-grained information and ask questions like never before. Collecting vast amounts of data allows for more detailed, precise information with powerful applications. Take bioinformatics for example, where gene sequencing provides extremely fine-grained data on an individual, thereby enabling precision medicine such as personalized recommendations in pharmacology.

3. Larger scale applications

IoT in agriculture. Image from IoT Design Pro

Systems for ML have a variety of applications and many real-world benefits. These extend to the agricultural industry: there are farmers who use systems comprised of connected smart-devices (Internet of Things) to monitor the health of their farms, such as animal tracking and pest control, which increase efficiency and enable conveniences that were never possible before.

In short, scalable data systems for ML help all fields to automate grunt work as much as possible, gain and retain customers, reduce costs, and increase efficiency. Kumar believes that almost all companies will follow suit and promote inquiry on whether it is beneficial to apply ML in a specific setting and whether that can translate to saved costs and/or improved human productivity. This trend is not limited to the commercial sector — non-commercial fields such as social sciences and digital humanities are also capitalizing on the benefits of scalable ML systems.

Because of the wide range of applications for scalable ML systems, Kumar sees a growing need for Data Scientists to understand business concerns and use analytical methods to optimize company metrics (e.g. customer churn prediction, saving power costs for data centers).

Academia vs. Industry, or Both

Kumar didn’t always know he wanted to become a professor. He had originally planned to pursue industry research, but in the last couple of years of his PhD, working closely with other master’s and senior undergraduate students and professors in nearby fields led him to realize how much he valued these relationships.

“Working as a faculty member, especially at a major research school like UCSD, you get to see really good students over a period of at least five years if they do their PhD, and it’s a very rewarding process that you just cannot replicate in industry, which is: they come in, they’re excited, they’re actively looking for new ideas, and they evolve into a researcher,” Kumar said. “They start proposing new things that you’ve never thought of yourself. That’s a process only academics get to see.”

Though his research is in academia, Kumar’s current research focuses on building artifacts and prototype software that are relevant to industry. This has allowed him to obtain research gifts (unrestricted funding) from companies like Google and Oracle, which he received through a very selective process. He enjoys researching issues rooted in principles and math, allowing him to make discoveries that aren’t known yet in industry and impact end users.

For those choosing between pursuing a career in industry or academia, Kumar believes that the right choice ultimately boils down to the individual (check out his slides on Data Science career advice).

“In terms of interests, it comes down to how much risk are you willing to tolerate, how much grunt work do you want to put up with, how passionate you are in driving your own agenda,” Kumar said. “If you want to do something at the boundary of knowledge where there’s a lot of uncertainty — things might work, things might fail — research is a good fit.”

Luckily, Data Science research doesn’t always mean publishing papers all the time. There are many opportunities in industry for research, but Kumar explains that research in academia versus industry differs in a couple of aspects.

“The amount of freedom you get to pursue your research interests is vastly higher in academia. In industry, also in the last five to ten years, research has expanded, but it’s very product-centric and relevant to the company’s bottom line,” Kumar said. “In the AI area, there are some labs that are more boutique, like DeepMind and OpenAI, but most of the other research groups in industry — Google, Microsoft, Facebook research — are still driven by the company’s needs and products. But that’s not to say that’s bad. A lot of people enjoy working on stuff that’s immediately relevant. [They] can say ‘I did this research, it got shipped by this company, and now 1 billion people are using it’.

OpenAI, ethical/safe AI research organization.

Kumar says that industry didn’t have this scale of impact 20 years ago. But now companies such as Google, Facebook, Amazon provide this opportunity because of their sheer user base. However, these companies compete among one another while academics can collaborate with industry arch-rivals at the same time. In Kumar’s case, he is currently collaborating with Oracle, Google, and Microsoft and that would not be possible in industry. Universities also allow faculty to do stints in industry during summers or sabbaticals.

In this field, Kumar says, it is quite common for people to switch between academia and industry. But one thing to remember is that generally speaking, getting back into academic research is very hard, and it’s hard to maintain that publishing repertoire when you’ve been immersed in industry and products. On the flip side, going from academia to industry has a very nice plus in that your salary can get doubled or tripled, but with the trade-off of less academic freedom, less opportunity to mentor students, and more industry competition.

Synthesizing Cultures

In his research domain of database and data management, Kumar enjoys a unique combination of top-down exposure to various fields and approaches to problem-solving. He gets the union of four main cultures of intellectual inquiry, which are, as he explains: mathematical/formalist; real-world/engineering; natural science/experimentative; and humanist/social science.

“In my case, it’s the combination of abstract reasoning and understanding of concepts and mathematical style ideas with concrete software artifacts that people can interact with, and this is a common philosophy in the database world,” Kumar said. “It’s cross-stack. It goes all the way from abstract math through the software systems through understanding the hardware to understanding the user.”

Just as he is able to integrate academia with real-world applications, Kumar’s research papers often tackle concepts from multiple perspectives of thinking. For instance, a paper of his might show how the same technological finding can reduce system cost as well as improve user productivity.

“This synthesis of research cultures is more possible in this field, and that’s what excites me about it,” Kumar said.

--

--