Source — © Marvel Studios — Scene from the movie Iron Man 3 where Jarvis the AI carries out analysis tasks while Tony Stark asks questions and connects the dots.

Data Analysis in the Age of AI

AI can supercharge data-driven decision-making if users cultivate a unique blend of data and AI skills.

Yusuf Saber

--

Recent breakthroughs in artificial intelligence can potentially transform the field of data analysis and, by extension, the entire data-driven decision-making process. However, the effective and safe use of AI in this context calls for a specific blend of conceptual and technical data skills and proficiency in interacting with AI systems.

Provided data analysts and business users adopt a positive level-headed mindset and commit to acquiring and constantly refining a unique set of data and AI skills, they can leverage AI to empower their organizations to extract more insights from data at a faster pace and with superior quality enabling quicker, more informed decision-making, boosting product quality and operational efficiency.

In this article, we will present how data analysis for decision-making may be reimagined to utilize AI effectively and safely to unlock outsized value in the short to medium term. Central to our discussion is how will the roles of data analysts and business users evolve and what will be the requisite skills. We will conclude with a speculative vision for the medium to long term. Let’s get to it!

Image by Author — An example of a user entering a question into an AI-powered data analysis tool and immediately getting some visualizations and a summary.

Recent demonstrations by leading tech companies have painted a picture where: Business users simply type their questions into an AI-powered data analysis tool and immediately get back well-narrated, reliable, and actionable insights that perfectly address the needs of their decision-making.

Whether intentionally or not, those demonstrations have given rise to two false perceptions:

  • The first is the belief that AI-powered data analysis tools will soon render data analysts redundant. This perception has induced emotions of despair and defeatism among some aspiring and junior data analysts, fearing their role is becoming obsolete.
  • The second is the impression that anyone can effectively and safely use an AI-powered data analysis tool to extract the necessary insights from data without needing any special data analysis knowledge or skills.

These impressions are not just incorrect, they can severely limit the value an organization can obtain from AI-powered data analysis tools. The reality is: While AI-powered data analysis tools have the potential to bring massive value, trained human oversight and understanding will remain indispensable for formulating questions, managing analysis choices, interpreting results, and ensuring that AI systems are being used ethically and responsibly. It is crucial, therefore, to adopt a pragmatic vision for how to leverage AI in the data-driven decision-making process. This begins with cultivating the right mindset.

1. Mindset

1.1 Positive

When faced with adversity, it’s those who refuse to mentally surrender and instead choose to believe in themselves and their ability to find a way, are who have the greatest chances of surmounting challenges. On the other hand, those that go into a situation with a defeated attitude make defeat all but the guaranteed outcome.

Source — “No man is defeated without until he has first been defeated within” — Franklin D. Roosevelt, US president during WWII

The threat of AI taking over one’s job is no different. We may not know exactly how the future will unfold, but we know that those who will thrive are those who will do what it takes to identify and learn the new skills needed to create value in the age of AI. That starts with the choice not to be internally defeated and to believe in oneself. It’s sad but true that some will be rendered redundant over the next few years, not as much by AI but by their choice to give up on themselves.

1.2 Level Headed

We would be well advised to recognize our tendency to fall prey to hyped promises of a “free lunch”. We are all too eager to believe that a technology or product will effortlessly solve a complex problem without requiring effort, learning, or process change on our part. We find it hard to resist the allure of achieving results that are disproportionately greater than the investment made.

A case in point is Big Data (referring to the availability of vast volumes of data and the advent of technology to extract value from it). There is no denying that Big Data has been exceptionally valuable. Its impact is seen in many fields and applications: It supports advanced diagnostic tools in healthcare, powers real-time, traffic-aware navigation systems, enables personalized product recommendations in online retail, enhances fraud detection mechanisms in banking and finance, and enables today’s advanced AI systems.

However, early on, expectations around Big Data were overblown to the extent that many anticipated that it would revolutionize our approach to problem-solving so much that it would render the scientific method itself obsolete. Then editor-in-chief of Wired, famously declared in 2008 that the “scientific method is dead”, suggesting that the sheer volume of data could answer all our questions without the need for hypotheses, models, or experiments.

Organizations have since recognized, albeit after wasting considerable time and resources, that this vision for Big Data was indeed overblown. Large datasets, while offering vast possibilities, do not inherently lead to better or more accurate conclusions. They can even mislead if biases, errors, or poor-quality data are present. Far from making the scientific method obsolete, Big Data has underscored that data volume does not replace the need for careful analysis, understanding causative relationships, and ensuring data quality.

Source — Gartner Hype Cycle — Note how new technologies are often met with inflated expectations.

Likewise, when it comes to utilizing AI in data analysis, we must be mindful of our tendency to fall for the illusion of a “free lunch”. It’s vital to maintain realistic expectations of the necessary effort, learning, and process change necessary to utilize AI effectively and safely for data analysis, grounded in an accurate understanding of the strengths and limitations of AI systems. In particular, extracting relevant and reliable insights will require a lot more than buying a new tool and writing an arbitrarily worded question in a text box. It will require active collaboration between AI and human data analysts and business users who have invested in themselves to acquire the requisite data and AI skills.

2. Limitations

While AI systems hold immense potential, it’s important to note two limitations with the current state-of-the-art AI systems that have a direct bearing on their utility in data analysis: Reliability and Context Awareness.

2.1 Reliability

The current AI paradigm (Large Language Models or LLMs), while immensely useful, is not sufficiently reliable that we wouldn’t need to constantly validate its outputs. This can be easily experienced firsthand by working with any of the commercially available AI systems. GPT4, the most powerful AI system today by a large margin and a phenomenal gift to humanity, still misses the mark not infrequently and is prone to make false claims, often referred to as hallucinations.

In an insightful interview, Open AI’s Co-Founder and Chief Scientist Ilya Sutskever said (paraphrased): “If I had to pinpoint a reason why Large Language Models (LLMs) might not reach their full economic value, I would cite reliability. Despite our best efforts, if ensuring reliability proves more challenging than anticipated, it could significantly limit the economic value these systems can generate”.

Source — Open AI’s Co-Founder and Chief Scientist Ilya Sutskever: “If I had to pinpoint a reason why Large Language Models (LLMs) might not reach their full economic value, I would cite reliability.”

Researchers have identified multiple scenarios where current LLMs are prone to be unreliable. Let’s look at the four most relevant to data analysis.

(1) Their grasp on common sense is often inadequate, which causes them to flounder on some tasks, generating text that is plausible but not grounded in reality.

While GPT4 shows incredible understanding and logical reasoning skills, its grasp on common sense is terribly poor in some situations. In her TED talk, AI researcher Yejin Choi shows an example where when asked the simple common sense question: “Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws, and broken glass?”, GPT4 responds “Yes”.

Source — GPT4 is unable to exhibit basic common sense in this situation.

(2) They don’t often know what they don’t know and, therefore, are prone to making false claims with certainty in situations where there are gaps in their knowledge.

A recent highly publicized case (here) underscores the tendency of LLMs to make false claims with confidence when there are gaps in their knowledge. In this instance, GPT-4 fabricated several legal cases, presenting them as precedents. These were subsequently submitted in court by an unsuspecting lawyer, leading to discovery and reprimand. The lawyer said: “I continued to be duped by ChatGPT. It’s embarrassing”. This incident led to several judges calling for a prohibition on the use of AI in legal proceedings unless its output is meticulously scrutinized by legal professionals.

(3) They seem to care more about “coherence” than “truth,” so to justify previously generated hallucinations, they can make incorrect claims that they “know” are wrong.

In a recent publication titled “How Language Model Hallucinations Can Snowball” (here), researchers show how LLMs can make claims they “know” are false to justify previously made false claims. In one of the examples, they show that when asked if certain numbers are prime numbers, the AI responds incorrectly. While this is certainly an issue, the bigger issue is that to justify the previously generated false claim about whether a number is a prime, they output additional false claims that they can separately (when asked in a separate session) recognize as incorrect.

Source — GPT-4 mistakenly claims that 9677 is not prime. When asked right after if 9677 is divisible by 13, it says Yes to justify its previous false claim. However, when we ask it the same question in a separate chat, it responds correctly, “No”.

(4) They often struggle with performing non-sequential tasks i.e., tasks that involve revisiting initial assumptions based on later findings.

The way LLMs work confines them to sequential decision-making processes. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. See here.

Now let’s contrast these known patterns of the unreliability of LLMs with the nature of data analysis.

  • Data analysis is an inherently nonlinear process, with insights gained later in the process often requiring a revisiting of initial assumptions.
  • Data analysis draws its validity from its rigor, not its results. Unlike with tasks such as, say, in web development, where we can test the resulting functionality, there rarely is a simple way to verify the correctness of data analysis results. If we knew the exact answer, we wouldn’t have asked the question. In data analysis, often the only guarantee that we have a valid result is a meticulous approach to design and rigorous application of best practices at every step of the process.

Putting these points together — AI’s reliability issues and the non-linear, rigor-based nature of data analysis — means that (1) there is a critical need for a data analyst to manage the analysis and assess the outputs generated by AI and consequently, (2) there is special knowledge needed by the data analyst to be qualified to carry out that role effectively.

2.2 Context Awareness

Extracting valuable insights from data can only be carried out on a foundation of deep context awareness. A good data analyst holds in mind a considerable amount of context that they draw on to make sound decisions throughout the data analysis process. For instance:

  • The purpose of the analysis.
  • The potential actions that can be taken based on the results of the analysis, their feasibility, and implications.
  • The short and long-term objectives of the organization.
  • An understanding of business operations and product mechanics.
  • Deep familiarity with the data itself and its nuance, including; what it represents, how it was collected, and its limitations.
  • The perspectives, needs, and levels of understanding of different stakeholders.
  • External aspects such as competition, regulation, compliance, and culture.

Today’s AI systems, particularly Large Language Models, struggle with capturing, maintaining, and applying context.

  • The amount of context that can be provided to the model, referred to as the context window, is limited.
  • The most important contextual information is often gleaned from conversations with stakeholders or on-ground observations, which are forms of data that AI can’t directly acquire effectively.
  • LLMs are bound by the specific text input they are provided for each task. Without explicit sharing of crucial contextual information, the model remains unaware, limiting its effectiveness in the task at hand.
  • Context isn’t deeply woven into the model’s knowledge structure; instead, it is fed as a separate textual input with each prompt (which also makes it quite costly at $0.06 per 1k tokens; the cost can exceed $1 per prompt for large amounts of context, see here).

Effectively acquiring and applying the complex contextual awareness required for autonomous data analysis remains beyond what AI can do. This limits the utility of current AI systems to carrying out individual data analysis tasks, while the specification of the tasks, the interpretation of results, and the management of the overall analysis are handled by the data analyst. This is reflected in a recent statement by Open AI CEO, Sam Altman: “AI would be good at tasks but not jobs”.

Source — Open AI CEO, Sam Altman at the US Congress in May 2023: “AI would be good at tasks but not jobs”

3. Skills

Given the nature of data analysis and the strengths and limitations of AI systems, what skills do data analysts and business users need to have to utilize AI most effectively in the data-driven decision-making process? Let’s find out.

3.1 Data Skills

(1) Question Formulation

The data analyst needs to have deep context awareness, business acumen, and data skills to formulate the appropriate data analysis questions for a particular business decision situation. A well-formulated question (1) yields the insight necessary to enable sound decision-making and (2) can be answered with available data.

Image by Author —For instance, say an online retailer is looking to decide on an effective marketing campaign. An analyst may ask: “What is the average order value per month for each of the past 12 months?”. A more experienced analyst would ask: “How has the monthly average order value varied for each distinct customer cohort (grouped by their initial quarter of purchase) over the past 12 months?”. The experienced analyst recognizes that customer behavior, particularly the extent of their reliance on the retailer, typically changes over time, and therefore looking at the overall average order value per month is not a good idea since it combines diverse “cohorts” of users, which likely obscures important information. A better approach is to look at each cohort separately over time which reveals that within each cohort, the average order value increases over time. The overall downward trend may be attributed to more recent, larger cohorts, which typically have smaller initial order values.

AI can help the data analyst sharpen the question by offering variations on the original question drawn from its vast knowledge base but it wouldn’t know which is more appropriate for the case at hand. The choice relies heavily on the human user’s knowledge, skill, and context awareness.

(2) Analysis Design

A capable data analyst recognizes that there are many ways to go about answering a data question and some are far more effective than others. For instance, the choice of structure and granularity of a data summary, the choice of metrics to use, the choice of comparisons to show, and the choice of data visualization to name a few. Capable analysts, therefore, start with conceptualizing the output that if obtained from data, the question would be conclusively answered, before starting to write code. Accordingly, the business user needs to have the requisite knowledge of what the different designs choices are and the skill, and context awareness to identify the right design choices for the situation at hand.

Take data visualization for instance; while AI can write the code that produces a data visualization saving a good data analyst a lot of time, the data analyst still needs to know what the visualization options are, when to use each, and how to interpret them in context.

Source —In a recent demonstration of the ChatGPT Code Interpreter Plugin (a very useful tool) the prompt was: “Generate a radial bar plot of US life expectancy” and GPT4 managed to produce it. Even though AI can produce the visualization, the data analyst still needs to know what a radial bar plot is, when to use it, and how to interpret it. Here is how to interpret this chart: Each circle represents a life expectancy (40y, 50y, etc..) and each radius (line from center to outside) represents a year (1998, 1999, etc..). The chart is comprised of cells each representing the intersection of a life expectancy (circle) and year (radius). To decide whether to fill a cell we ask, was the median life expectancy in that year less than or equal to the value represented by the circle.

AI can provide valuable assistance to the analyst when making design choices. It can generate a range of possible analysis approaches and offer guidance into the applicability of each. It is important to note that the quality of the design options produced by AI depends on how effectively the analyst can steer it. This requires an understanding of which factors are critical in making design choices, and the ability to communicate these to the AI. Thus, the analyst needs to have the knowledge, skill, and context awareness to (1) guide the AI into producing good analysis approaches and (2) to consider the entirety of the situation and identify the optimal design choices.

(3) Analysis Management

There are many possible actions to take on a dataset– including cleaning, visualization, summarization, modeling, and more — and the possible actions only increase during the course of an analysis project as taking one action uncovers many new leads to follow. Effective analysts are great at planning and managing the analysis, which protects them from losing time going down unproductive paths at the cost of more worthwhile paths.

While AI makes execution more efficient, it’s still simply not feasible to pursue every possible lead. The analyst must effectively manage the analysis by deciding which leads are worth following, in what order, and to what depth to balance between the quality of results obtained and the speed of decision-making.

(4) Execution Assessment

Current AI systems can perform well-defined data analysis tasks with comparable quality to a good data analyst and significantly more efficiently. Example tasks include checking for correlations between particular variables, creating a specific visualization of selected variables, or fitting a model to a particular set of variables.

It is important to note that there is a very large number of low-level decisions that are made during the course of executing a data analysis task. Common examples include, how to bin (partition) a continuous variable, how to handle missing values or duplicate records, which data points qualify as outliers, and whether to exclude them, among many more. These decisions have the potential to alter the output entirely. These decisions are seldom one-size-fits-all but instead rely heavily on the context in which they are made.

Image by Author — In this example, because of the way the system works, orders with zero discount have a discount value of “NA”. Since, in some tools e.g. Python Pandas, missing values “NAs” are excluded from calculations be default, if we compute the average discount value we get a wrong value.

Therefore, the role of a data analyst goes beyond assigning the task. They must also review and evaluate the work of the AI, verifying that appropriate decisions were made at each step. This requires a depth of skill, knowledge, and context awareness. Without these, the user may not know about some of these low-level decisions, let alone be qualified to assess the decisions made. Two points are worth highlighting here:

  • While AI will write most of the code, the data analyst needs to have a sufficiently solid grasp on the code and how it works to recognize whether the right low-level decisions were made.
  • The data analyst needs to have an understanding of the nuances of the data, including what it represents, how it was collected, and its limitations.

(5) Interpreting Results

While AI may readily provide an interpretation of the results, it’s important to note that the same set of results can be interpreted in many ways. While many of these interpretations may seem plausible, only a few will be valid in a given context.

It takes technical knowledge, skill, data familiarity, context awareness, and domain depth to:

  1. Identify the correct interpretation of results and rule out other plausible but incorrect interpretations.
  2. Understand the AI’s interpretation. Sure, we can ask the AI to “explain it to me like I am 12,” but this could inadvertently strip away critical nuances, potentially compromising decision-making.
  3. Share appropriate input with AI to maximize the chances of obtaining relevant interpretations.
Image by Author —On inspecting a dataset of many companies, an analysis finds a correlation between company spending on employee training and sales performance. While it might be that training leads to better performance, the causal link could be reversed: companies performing well may have more resources for employee training. Alternatively, within the same company, effective salespeople might both drive high performance and be more likely to seek out training.

The above said, AI can be especially valuable during the interpretation of results. Great analysts are able to come up with multiple plausible explanations and then proceed to methodically rule them out to get to the most likely interpretation. This is very difficult since many analysts can fall into the trap of tunnel vision, finding it challenging to think beyond the first explanation that comes to mind. AI can help by producing multiple alternative interpretations that the analyst may not have thought of. By enriching the pool of possible interpretations, AI can significantly boost the overall value of the analysis.

(6) Navigating Bias

We are prone to many mental biases and cognitive illusions, which can, and often do, hinder our ability to leverage data effectively in decision-making. It takes special knowledge, skill, and discipline to steer clear of these biases. For instance, many of us, often unconsciously, seek to find ways to confirm our position, which is a well-documented bias referred to as Confirmation Bias. Without the necessary knowledge, context awareness, and discipline, it is usually possible to draw results that confirm one’s preconceived position from the data. A common form of that is to stop the analysis when the evidence gathered so far confirms the belief they would like to be true.

Source — In this comic, scientists find no significant link between jelly beans and Acne but then they proceed to test 20 different types of beans for causing Acne. The relationship with green jelly beans was found to be significant, with a confidence level of 95%. However, when testing 20 hypotheses, there’s a good chance (approximately 64%) that at least one of them will appear significant just by chance alone, even if there’s no real effect. This is a common pitfall in data analysis known as the ‘multiple comparisons problem’, and illustrates the importance of knowledge and discipline to steer clear of incorrect results.

The risk of a biased interpretation of results can often be magnified with AI because of the AI’s limited context awareness and the tendency of current AI systems to appease and conform to the views of the user. This means that without the necessary knowledge, context awareness, and discipline, it is quite easy for someone using an AI-powered data analysis tool to fall for confirmation bias and have an articulate narrative supporting their skewed interpretation of the data.

(7) Making Recommendations

Recommending appropriate actions based on analysis results is an intricate task that requires a solid understanding of the results themselves as well as a deep awareness of the possible actions, their feasibility, and their ethical, cultural, and regulatory implications. Given the limitations of AI in fully grasping context, cultural nuances, and ethical considerations, this task demands the judgment of a human analyst.

Moreover, the responsibility and accountability for analysis results and any subsequent recommendations cannot be outsourced to an AI system. The final decision-making responsibility and the ethical and legal accountability for the actions taken based on an analysis lie with human operators.

(8) Communicating

Communicating analysis results and recommendations is as critical as the analysis process itself. It involves transforming complex analysis results into easily digestible information that can be used by decision-makers, who might not possess the technical expertise to understand raw analytical outputs. This requires the ability to convey intricate concepts in an accessible and convincing way and to handle questions and objections effectively. It also demands a keen understanding of the audience, including their knowledge level, expectations, and potential biases.

AI can generate visual representations and summaries of analysis results, even “translating” them into plain language. It can also adapt the communication style based on the intended audience. However, AI can lack the subtlety and flexibility of human communication, especially when dealing with complex ideas and emotions. Moreover, AI is not yet able to fully understand the cultural, political, and emotional context that can greatly influence how messages are received. Furthermore, stakeholders often prefer to interact with a human with whom they can have a conversation, asking questions about the results, the approach, the implications, and the recommendations. As such, while AI can be an aid in the communication process, it is the human analyst who must take the lead in conveying the analysis results and recommendations in a persuasive, contextually appropriate, and responsible manner.

3.2 AI Skills

While their natural language interface makes AI systems fairly easy to use, optimizing the value derived from these systems is far from trivial. State-of-the-art research indicates that certain interaction and prompt strategies can yield significantly higher quality outputs.

For instance, the strategy of “chain of thought prompting” has proven to be highly effective. In recent research (here and here); it was found that AI’s performance on certain tasks increased by 10 to 20 percent when guided through a step-by-step process, as in “Let’s work this out in a step-by-step way to ensure we have the right answer”. This underscores the importance of appropriate training for data analysts and business users in AI skills.

Source — Self reflection strategies boost GPT4’s performance on certain tasks. This particular study involved a coding task.

Another strategy that is particularly relevant for data analysis is to break the work involved in every task into a design phase followed by an execution phase. The design phase involves refining the question, designing the output (that would answer the question), and planning the steps of the data analysis procedure. All of this is done in dialog with the AI, with the data analyst setting objectives, reviewing, and giving feedback at each step. Once the design is deemed sound, the AI is instructed to write and execute the code, with the human analyst reviewing and validating the results. In contrast, posing a single-shot question to the AI, where only the final output is returned, opens us up to uncertainty and error.

Without understanding the AI’s process (its assumptions, design, and potential semantic code errors), verifying the trustworthiness of the results becomes very difficult. In other words, utilizing AI effectively and safely in data analysis is not merely about writing a prompt but also involves understanding and navigating the process the AI uses to reach its conclusions.

The value derived from AI systems is determined as much by the proficiency of the user as by the system’s capabilities. A user familiar with the AI’s capabilities, limitations, and effective interaction strategies can unlock much more of its potential.

To summarize, for AI-powered data analysis tools to truly enhance data-driven decision-making, organizations must create a working relationship where data analysts and AI systems seamlessly complement each other throughout the analysis process. AI helps in sharpening the question, enhancing the design, enriching the interpretation, and executing tasks efficiently, in particular, writing and running code against the organization’s data. On the other hand, human data analysts play the leading role in formulating questions, managing the analysis, assessing the AI’s execution, interpreting results, and communicating with stakeholders. They also bear the responsibility for ensuring that AI systems are used ethically and responsibly.

Without data analysts and business users who are equipped with the necessary skills, organizations can not realize the potential benefits of AI in data-driven decision-making. This could result in unwarranted frustration with AI systems or, even worse, misleading analysis results. Therefore, as we harness the capabilities of AI, it’s vital to invest in developing and nurturing these crucial skills, which have long been the essence of effective and responsible data analysis and decision-making.

4. Learning

4.1 Business Users

There are massive benefits to empowering business users to handle some of the data needs of their decision-making, including:

  1. It empowers numerous individuals across the organization to answer questions and formulate hypotheses with data.
  2. There are unparalleled benefits to depth and understanding when the curious, context-aware business or product leader is able to engage directly with the data.
  3. It will free up a significant amount of the data team’s time to tackle strategic high-value projects.

Let’s pause for a moment and consider the significance of direct interaction with data by decision-makers. Drawn-out cycles of inquiry, with days between questions and responses, could stifle the development of valuable insights. It is like trying to play a tune on a piano but hearing the note hours after pressing the key. However, when decision-makers can directly interact with data, they create a positive loop of understanding. An answer to one question quickly leads to a deeper, more targeted question. This fast-paced cycle of understanding is difficult to achieve when decision-makers have to wait for others to provide answers over several days.

Traditionally, many business users were not able to answer their own questions with data for two reasons:

  1. Business users typically lack the requisite data literacy and technical skills to carry out simple ad-hoc data tasks, such as writing a simple data query. No-code self-serve data analysis has been widely promoted as a solution, but for various issues, such as poor data model quality, that promise is rarely realized in many organizations.
  2. Perhaps more critically, when business users, and sometimes inexperienced data analysts, carry out data analysis, they are likely to draw inaccurate conclusions due to a lack of the necessary conceptual knowledge and trained judgment. Typical pitfalls include succumbing to confirmation bias, erroneously identifying randomness as patterns, confusing correlation with causation, and making semantic execution errors.

Now, let’s examine how the advent of AI impacts these skill requirements and empowers business users:

  1. Because AI can handle technical tasks well, the qualification bar on some technical skills (coding in particular) is being reduced. The user still needs to have a basic grasp of the code and how it works so they can verify that it is doing the right thing. That said, learning to recognize what code does is significantly easier than learning to write code from scratch.
  2. Data analysis design skills such as defining problems, formulating questions, designing solutions, and interpreting results are still very much needed. In particular, a sufficient grasp of concepts such as confirmation bias, randomness, and causality and their implications on how we formulate questions and interpret results is particularly important.

Traditionally data literacy courses available to business users were either too high level and leave the business user largely unequipped to work directly with data or too tool-focused (eg, SQL or Python), which overwhelms the business user with low-level details that are rarely relevant and can be disengaging. In addition, both types of courses generally fail to cover the essential conceptual data skills we outlined earlier.

There is a need for an education system that fosters a pragmatic understanding of the role of data in decision-making and equips business users with the right mix of conceptual and technical data skills as well as proficiency in working with AI systems to be able to utilize AI-powered data analysis tools effectively and safely to boost their decision-making.

Note: We are advocating that business users be empowered to work with data for measurement and hypothesis generation. However, hypothesis testing (via experimentation or causal inference) should remain with data analysts trained in practical statistics.

4.2 Data Analysts

Data analysts who have the knowledge, skill, and business acumen, necessary for formulating questions, managing analysis choices, assessing AI’s execution, interpreting results, and ensuring that AI systems are being used ethically and responsibly, are essential to utilize AI effectively and safely in data analysis, and will, consequently, bring outsized value to their organizations.

For many years, expectations for levels of seniority in data analysis were as follows: Junior data analysts have the coding ability and theoretical statistics knowledge to execute data analysis tasks or take on small projects. Defining problems, formulating questions, designing solutions, interpreting results, and handling communication generally fell to senior data analysts, typically with upward of 3 years of work experience. See here.

If we contrast the conventional path of skill development in data analysis with the skills required for effective collaboration with AI systems, it is clear that they are almost entirely senior data analyst skills. What a junior data analyst is typically qualified to do — execute solutions designed by a senior for problems formulated and structured by a senior — is precisely what AI can do well today.

It has been the case for a long time that junior data analysts are rarely able to create value for the organization on their own. The “deal” has been that juniors augment seniors’ capacity by taking on execution while learning from seniors the intricacies of formulating questions, designing solutions, interpreting results, and communicating effectively to eventually become seniors. With the advent of AI, this cycle is getting disrupted. In a large number of situations, augmenting the senior data analyst with AI will be more efficient for the organization than adding a junior to the team. Consequently, roles that can be filled with today’s junior data analyst profile will gradually become hard to come by.

While it is possible for a data analyst to secure a position by showing basic knowledge of tools like SQL and Tableau without having the “senior” skills we covered earlier, this may not be the case for long. The bar is being raised, and some of today’s “senior” data analyst skills will soon become the new “junior” skills. Data analysts need to dedicate time and energy to fast-track building their conceptual and technical data skills, as well as proficiency in using AI systems.

I think the reason so many junior data analysts today are under-skilled is a natural by-product of the education system’s focus. Most data analysis or data science programs focus on the tools and scientific theory but rarely teach students how to formulate problems, design solutions, interpret results, and communicate effectively. For instance, many courses focus on teaching students how to use SQL and Tableau to create data visualizations but rarely teach them how to identify the right data visualization for a particular business situation. Another example is most statistics classes will teach students how to perform a statistical test, such as a t-test, to determine if there is a statistically significant difference between the treatment and null hypotheses. However, these classes seldom instruct students on how to formulate these hypotheses — both null and alternative — in the context of complex, real-world business situations.

While there is no substitute for hands-on experience and apprenticeship for building deep understanding and acquiring pattern recognition for problems and solutions, the data education system can be rethought to do a much better job in preparing data analysts for the AI age. The education system needs to focus less on the tools and more on the knowledge and thinking that go into problem formulation, solution design, interpretation, and communication in the context of real business settings as well as developing proficiency in working with AI systems.

Helping data analysts and business users acquire the conceptual data skills outlined above is much harder said than done. It’s much easier to teach and test for SQL competence than to interpret analysis results in context. While challenging, it is feasible and necessary. It will require a comprehensive transformation, including revised content, innovative learning approaches, and the clever use of AI as an educational tool.

The problem of empowering data analysts and business users with the skills they need to realize value from data in the age of AI is one I am deeply passionate about, and hope that the work me and my team are doing with optima.io will make a positive contribution.

5. Near Term

5.1 Before

Let’s first describe the pre-AI state of data analysis at a typical company where data-driven decision-making is in relatively good shape:

There is a lot of valuable work that never gets done.

At many, if not most, organizations, a significant volume of data work, which could create substantial value, never gets done. In my experience leading a competent data team of over 100 people at a prominent tech company, we were barely able to tap into perhaps 10% of the work that could potentially generate value for our organization. We had to ruthlessly prioritize and limit the scope to make the most of the team’s bandwidth, at the cost of dropping high-value projects and cutting corners. A good way to describe it is as a constant state of compromise.

Here’s a more detailed look:

  • Many high-value projects never get done due to limited bandwidth. The data team has to constantly negotiate priorities with business users, essentially deciding which business decisions get support with data analysis or experimentation, inevitably leaving many business decisions unsupported.
  • Corners are cut as projects need to be wrapped up quickly to move to the next thing. For instance, data model changes are made without documentation making future work with the data more challenging, data pipeline changes may be deployed without the requisite automated tests causing delayed response to potential issues, data analysis and experimentation projects are wrapped up without a post-analysis write up making it hard to benefit form the learnings later. Inadequate tracking code is included with new features, leading to a gap in understanding user interactions, among many others.

Minimal and potentially damaging contributions from business users:

As discussed earlier, at many organizations, business users are not able to answer their own questions with data due to a lack of the requisite data literacy and technical skills and when some do run analysis they are prone to draw inaccurate conclusions due to a lack of the necessary conceptual knowledge and trained judgment.

Many junior data analysts working on routine or well-defined tasks

In many organizations, the data team is predominantly composed of junior members. Their skill set frequently makes them dependent on senior members for problem definition, solution design, implementation review, and communication of results. Therefore, their work is generally limited to performing routine tasks (such as running ad-hoc queries or creating simple reports), or tasks that have been well-defined by a senior. They are rarely able to autonomously take on higher-value projects because those are typically more ambiguous or complex.

A few senior data analysts who are simultaneously stretched thin and underutilized

Most data teams have a relatively small group of senior members who form the core of the team, the pillars on which the team stands. While they have the skills and potential to tackle the most impactful problems and deliver outsized value to the organization, they are perpetually stretched thin with firefighting, routine management, and filling the gaps left by junior team members.

5.2 After

Now let’s look at what the situation may look like at an organization where:

  1. AI is properly integrated in the data analysis workflow.
  2. Data analysts and business users have the requisite skills that we described earlier, specifically; conceptual and technical data skills as well as proficiency in interacting with AI systems.

Business users can effectively and safely handle a substantial proportion of the data needs of their decision-making.

Equipped with AI and the requisite skills, business users can effectively and safely handle a substantial proportion of the data needs of their decision-making, with minimal dependence on data analysts. This will improve the quality of decision-making while freeing data teams to focus on strategic high-value projects.

The efficiency and quality of work produced by data teams will be significantly improved.

  1. By stepping up to focus on question formulation, analysis management, execution assessment, and interpretation of results while delegating the actual execution of well-defined analysis tasks to AI, an analysis can explore many more hypotheses in a deeper way yielding better insights, and yet be concluded in less time than previously possible.
  2. The efficiency boost enabled by AI, combined with the bandwidth freed by transitioning basic decision-support data tasks to the newly empowered business users, means that data teams can undertake many high-value projects that would have previously never been done.
  3. Tasks that were often previously dropped in the team’s haste to move on to the next project, such as data model documentation, post analysis write up, and data test writing, are time-consuming but straightforward and can thus mostly be done by AI with data engineers and data analysts reviewing and suggesting changes where needed. These improvements, among others, will boost system robustness and data reliability, as well as further increase the team’s overall productivity and quality of work experience.
  4. Senior members can spend less time on mundane management and more time innovating and leading. This includes taking on the most challenging problems, mentoring the team, and setting standards for AI use to ensure its effective and safe utilization.

The outcome is a significant expansion of an organization’s capacity to extract value from data, consequently driving a substantial enhancement in the quality of the organization’s decisions.

But wouldn’t an increase in productivity mean that companies would need fewer data analysts? This is highly unlikely, at least in the short to medium term. It’s more likely that the productivity and quality gains enabled by AI will lead to a raising of expectation from data teams, especially given the fact that there is a very large amount of valuable work that was previously not done. Once a company leverages AI effectively to make significantly more data-driven, hence better, decisions, its performance will improve, compelling its competitors to do the same to stay relevant. The bar will thus be raised for all. In this way, any job loss is more likely to result from failure to acquire necessary data and AI skills, rather than from the advent of AI itself.

6. Beyond

6.1 Gradual Enhancement

Active research will gradually mitigate some of LLM’s limitations. In particular:

  • Reliability: LLMs will gradually get more reliable. In addition to continuous improvement through the feedback of millions of users, two currently promising research directions are: (1) Endowing AI with access to tools that it can use to validate its assertions, such as the CRITIC system proposed here. (2) Adding layers on top of LLMs that endow AI with some non-linear reasoning abilities such as the Tree or Thought system proposed here.
  • Context: LLMs’ context awareness will continue to improve. Some promising directions include: Context windows are expanding. Novel means to acquire context, such as easily ingesting chats, emails, and meeting transcripts related to a project or a topic. Especially notable is that AI will get better at asking revealing questions to actively acquire the necessary context.

We can also expect progress in mitigating practical obstacles. In particular:

  • Tooling: Deep integration of AI in data analysis tools and workflows, making it easier to use and capable of handling more tasks. Researchers (here) estimate that 80% of the tasks of roles like data analysis can be carried out in 50% of the time or less if AI is well integrated in the tooling and practitioners adopt and use those tools effectively.
  • Privacy: Data privacy concerns will be resolved. Cloud providers will offer private instance of LLMs that can work directly with the enterprise’s data. In addition, lighter models (like Orca and Falcon) will be possible to self-host and maintain privately.

As AI systems continue to become more reliable, they will be able to contribute more effectively and in more ways to the data analysis process, especially to activities such as data cleaning, performance reporting, and hypothesis generation. This will further improve the efficiency and quality of the results of data analysis. It may also allow a reduction of the level of expertise required by human users to ensure that data analysis is performed effectively and safely. That said, the fundamental structure of work will likely remain similar to what we described above, with skilled human data analysts and business users working in tandem with AI to realize massive value for the organization.

6.2 Superhuman Intelligence

AI’s cognitive capabilities will continue to advance. Either the current LLM paradigm gets its issues solved, or a new paradigm is discovered in the near term to keep the progress toward superhuman intelligence going unabated. We get to a point where AI systems are able to create ever more intelligent AI systems leading to an intelligence explosion that quickly leaves human intelligence so far back that involving humans in almost any decision-making process becomes detrimental to the quality of the decision. At this point, it will not just be data analysis being taken over by AI, but, for purely efficiency reasons, the entire decision-making process will likely be handed over to AI systems with human involvement reduced to a high-level supervisory role.

Source — In a nuclear reaction, one nuclear event triggers others, causing an escalating, self-perpetuating cycle that releases massive energy. Likewise, an intelligence explosion begins with an AI system smart enough to design an even smarter one. This new AI then develops an even more advanced system, and the cycle continues, leading to an “intelligence explosion”.

Some AI thought leaders believe that superhuman intelligence will be a reality in under ten years. For instance, in a recent publication titled “Governance of Superintelligence” here, Open AI’s leaders say: “Given the picture, as we see it now, it’s conceivable that within the next ten years, AI systems will exceed expert skill level in most domains”. On the other hand, other AI thought leaders think that it may take decades to reach superhuman intelligence here.

While it is likely that we will arrive at this point as early as the next 10 or 20 years, we would be advised, to consider two possibilities:

  1. The challenges facing AI progress may prove harder than we think.
  2. We may choose to limit the extent to which decision-making processes can be handed over to AI.

Let’s explore these briefly.

Lord Kelvin, the leading physics authority of the time, stated in a lecture given in 1900 that “there is nothing new to be discovered in physics now. All that remains is more and more precise measurement”. This statement was made shortly before the birth of quantum mechanics and Einstein’s theory of relativity, two groundbreaking fields that revolutionized our understanding of the universe and, if anything, showed us that we are barely scratching the surface of what can be discovered in physics.

The field of AI itself had periods, referred to as “AI winters,” when enthusiasm (and funding) was drastically reduced. The first of these periods occurred in the mid-1970s, due to the failure of early neural networks to live up to their hype and the lack of progress in achieving the ambitious goals initially set for AI. A second AI winter occurred in the late 1980s and early 1990s for similar reasons.

Technological progress has a knack for defying expectations.

Turning to the second possibility. Ethical considerations and regulatory constraints could potentially limit the extent to which decision-making processes can be handed over to AI. This is a highly debated issue, and many experts argue that there should always be a ‘human in the loop’ to ensure alignment with human goals as well as accountable and ethical decision-making.

There is precedent for humans making responsible, ethical decisions in the face of business and competitive pressures. While human cloning could have been a massive business, we made the responsible decision to ban it. Likewise, nations have collectively agreed to prohibit chemical and biological warfare.

Finally, there are compelling arguments for the prediction that even if AI eventually subsumes today’s data analysis tasks, there will emerge evolved roles that offer us similar, if not better, levels of satisfaction. See here, for instance.

The superhuman intelligence scenario is a massive topic and certainly one that can’t be done justice in this article. For a balanced treatment, I recommend the book “Life 3.0”. It is an enlightening account of the possible impact of superhuman AI on the future of life on Earth and beyond. The book explores an array of societal implications, strategies to enhance the likelihood of favorable outcomes, and potential trajectories for humanity and technology. The author Max Tegmark is one of the deepest thinkers on the topic. He is a professor at MIT and the founder of the Institute for the Future of Life, which was one of the first voices calling for taking AI safety and alignment seriously.

--

--