Data Science in 2021
Where Data Science is today and where it’s going
When I first began to work as a data scientist seven years ago, the field was a different beast. It was the early days of Spark, and the term “big data” was still echoing around as people were still focused on how to find insights in large amounts of data in warehouses and silos. Natural language processing at scale was not widely used in business. Deep learning was on the ascent and data scientist was said to be the “sexiest job of the 21st century.”
Things have changed massively since then. As we relaunch the IBM Data Science in Practice blog, let’s go through a few current standout trends in data science which we will cover in posts going forward. We’ll talk about the growing technical areas of data science and AI such as knowledge graphs, NLP, and data governance, while also looking at the human side of data science and practice in trustworthy and ethical AI and DEI in the data science field.
Knowledge Graphs
Coming fresh out of AKBC 2021, I am amazed at the use and application of knowledge graphs that has grown massively in the last five years. While deep learning has been the primary paradigm in AI and machine learning the last few years, the lack of contextual knowledge that has always been a key problem in knowledge-based technologies have shown to be an issue with deep learning centric methods.
At KG 2021, most of the subject area talks were specifically on health care applications from companies such as Novartis and AstraZeneca, along with more general health care and medical knowledge graphs for research developed by Elsevier and UCSF. While such a wide range of applications indicates a mature technology in the health care space, the use of knowledge graphs is also increasing in the financial industry and in dataset management. One of the current issues for KGs remains how to populate them quickly and with reliable data, as discussed in this paper by IBM Research at ACL 2021.
Trustworthy and Explainable AI
With the increasing use of machine learning in sensitive use cases such as personal finance and criminal sentencing, the impacts and disparate outcomes of models has come to the fore of data science and AI discussions. ACM’s FaCCT (Fairness, Accountability, and Transparency) Conference was founded in 2018 to bring together researchers and practitioners in the field to discuss methods and means by which to reduce inequalities and improve ethical outcomes in machine learning development.
IBM has led much work in developing toolkits in ethical and explainable AI, such as AI Fairness 360 and AI Explainable 360. For those just starting in learning about ethical AI and data science, the Linux Foundation, working with Alka Roy, developed a course on Ethics in AI and Data Science, the University of Edinburgh has a course titled Data Ethics, AI, and Responsible Innovation, and data scientist Ayodele Odubela teaches an intro to machine learning course that integrates ethical considerations throughout the machine learning process.
Diversity, Equity, and Inclusion
When I first began working as a data scientist in 2014, I attended a meetup with a well-known leader in data science. I was one of only a few women attending, and I remember the horror I felt as this person in the field made sweeping generalizations about women’s behavior that made me get up and leave the room.
Fast forward to today, where diversity, equity, and inclusion are now a day-to-day part of the conversation within data science and other technical fields. With the founding of groups such as Women in Data Science (which began as a one-day conference at Stanford in 2015), Black in AI (co-founded by Timnit Gebru and Rediet Abebe in 2016), and Queer in AI, greater inclusivity is being fostered within the data science and AI community. While there are still many milestones to reach to achieve greater representational parity for traditionally marginalized groups, these steps and increasing corporate awareness (check out a job board for DEI opportunities), the first steps towards a more inclusive future have begun.
Data Intelligence, Provenance, and Governance
Beginning the passage of the EU’s General Data Protection Regulation (GDPR), governments globally have begun to pass more rigorous data protection laws, including the state of California in the US (California Consumer Privacy Act or CCPA) and, more recently, China (People Information Protection Law or PIPL). Outside of government regulation, consumers have increasingly grown wary of how companies collect, store, and use their data.
Additionally, as data has become more greatly scrutinized in light of issues with machine learning model output (see above with Trustworthy AI), data provenance and bias has also become a major source of concern for data scientists and ML practitioners.
To meet this demand, companies and researchers have begun to increasingly respond with tools and frameworks to improve data intelligence and governance and to examine data provenance. Datasheets for Datasets, Data Statements for Natural Language Processing, and IBM Factsheets 360 give frameworks for data scientists to document and examine their data before and during the model creation process. Software for data governance and intelligence such as Watson Knowledge Catalog and Collibra have gained increasing use and adoption in the light of stricter regulations that many companies now must follow.
NLP Beyond Sentiment Analysis
Sentiment Analysis has long stood as a major use case for Natural Language Processing, and in a recent IBM Data Science Community Survey, this proved to still be one of the top NLP applications community members are interested in. In this same survey, Data Science community members identified Conversational AI as the language technology area with the most interest. As chatbots have become ubiquitous throughout many industries and smart home assistant actions or skills have grown, Conversational AI has grown to be a far more in-demand skill for a data scientist or ML specialist to have.
In the 2021 NLP Industry Survey released by Gradient Flow, the most common use cases for NLP identified by professionals in the industry were NER and document classification. NER, in particular, is indicative of a mature natural language technology practice at an organization, and is linked to the growth of entity linking and knowledge graphs in industry. Additionally, the growth of large language models and tools and APIs related to these have increasing importance with the release of GPT-3 and Jurassic-1.
Both reports indicate that the most common industries outside of the technology sector for NLP applications are healthcare and finance, similar to the use of knowledge graphs.
Additionally, 53% of the respondents used at least one of the following Python libraries: Hugging Face, spaCy, Natural Language Toolkit (NLTK), Gensim, or Flair.
Where Data Science Is G(r)o(w)ing
When I started in data science, R was on the ascent, and you could get a data science role commanding $150,000 if you knew scikit-learn, Pandas, and NumPy, similarly to how anyone with even a basic understanding of HTML could get a web development job in the late 1990s. While data science in 2014 included roles like data journalism and statistical data science, roles like data engineer were relatively new, and the term “MLOps” or “DataOps” did not yet exist.
In coming years, data science is going to be focused even less on code and there will be increased specialization of data scientists within a pipeline in larger organizations, while smaller organizations will rely heavily on no code platforms to lessen the work of a lone hero data scientist. While finance and healthcare are currently the largest industries in many aspects of data science, rising climate change challenges will demand more automated solutions and long range modeling to deal with mitigating harsher climate conditions globally.
Is data scientist still the sexiest job of the 21st century? Data science is becoming a more ubiquitous role in many companies and it is certainly becoming increasingly a role that many organizations have. With the relaunch of our blog, we hope to create a space with highly curated content that everyone from aspiring data scientists to those with a mature career can learn new skills and find inspiration.
Interested in learning more about the IBM Data Science Community? Join here and join online discussions, get special invites to Data Science events, and read even more content on the community blog.