AI for software development: can we actually measure it's impact?

8 min readApr 22, 2024

White android gravitating in meditation posture attach to cables into a spherical grey solid. — Credits: Aideal Hwa via Unsplash

We see all the time posts in social media, blog posts, and articles in recognized publications talking about AI potential for disruption, and how AI will not only transform our society but also drive productivity gains for several industries and economic activities.

Some critics of such view claim that since ChatGPT’s launch Sam Altman and OpenAI executives were never able to concrete on explaining how AI will help companies to make more money or why this technology matters so much.

We cannot deny that AI is the buzzword of the moment, however, I feel there is just too much noise and perhaps a lot of speculation about current AI’s potential to drive higher productivity, enhanced work performance, and ultimately higher value for businesses. It seems there is a fine line between what is currently tested knowledge and what might be an overexcited reaction from all that is being shared.

Navigating the complexity: can AI enhance team performance/productivity?

The debate around AI gains is also a hot topic in software development given the natural link between these industry and AI technology. Some articles state that software development work can be speed up in 75%, others point out for faster release of features or reduced number of bug, while some others state completely different numbers. A third group (which I identify with) debate the validity of such numbers. But which view is the right one?

If you looking for a straight forward answer I am afraid you will be disappointed. The expected answer is: it depends. Software development is a complex activity and often relying on cognitive work. As highlighted by Erik Brynjolfsson, Jerry Yang and Akiko Yamazaki Professor in its article called "Can AI actually increase human productivity?", productivity gains can come in 2 forms:

more output produced out from the work: usually an industrial labor activity that may benefit from task automation. This form of gain is usually easier to be measured.
or accelerated innovation flow from a new idea to it’s application: usually a cognitive labor activity that may use AI tools to speed up research, improve on idea generation and analysis capabilities. This second type of gain is way more subjective and complex to be measured.

Parsing Controversy: AI’s Role in Code Generation

A research done by Mckinsey claim that software engineers can develop code 35% to 45% faster, refactor 20% to 30% percent faster, and perform documentation 45 to 50% faster. But the same research point that the realization of such benefits are highly dependent in factors such as the nature of activity being performed (ex: repetitive versus creative) and on the skill set and experience of the engineer using AI tools. In some cases the productivity was negatively impacted.

These numbers look promising, but the reality is when it comes to software development, measuring the productivity gains coming from adopting of AI tools can be a topic open for a long debate with no clear answer. Even if we face software simply as coding (which is a very limited view of that).

The sole use of AI for coding is not a guarantee of higher productivity. To make this statement a little more tangible, I will use an analogy traced by Neal Ford, and discussed by AI experts Mike Mason and Birgitta Böckeler in the episode "AI-assisted coding: Experiences and perspective" of ThoughtWorks technology podcast. Neal trace a analogy between the use of AI tools for coding and the introduction of spreadsheet for accounting work back in the 80’s. For accountants, the spreadsheets represented a huge productivity boost, not only in terms of making easier to store and retrieve date, but also in terms of reducing human error by using formulas, scaling work faster and also allowing easily shareable format. AI, however, has fundamental differences when it comes to coding. For example, inputs are not structured and in most of instances developers do not know exactly how AI work underneath it’s abstraction to generate answer. That, however, does not eliminate the need for developers to evaluate AI output in order to ensure its the output provided is robust, secure, etc. Depending on the case, this can even harm productivity rather than improving it.

This reinforce the view that coding is only a fraction of software development activity and cannot be seen in isolation from other cognitive intense tasks, that involve creativity, analysis and other complex cognitive capabilities.

Beyond Coding: The Holistic View of Team Performance

Measuring the performance of a software development team is on itself controversial topic. Now introduce an AI layer on the top of such discussion and we have an even more difficult task ahead.

One may claim that the current state of software development now resembles an industrial labor activity and that agile frameworks and methodologies were able to streamline the development process and allowed teams to produce code and releases with high frequency, such as a industrial production line. If such assumptions were true, we could measure a software development team's productivity by measuring things such as team throughput or cycle time, lead time, and also metrics such as the one suggested by the DORA metrics. AI gains should, then be reflected in such metrics. However, the reality is more complex than that.

Even if those metrics bring a better outlook after adoption AI tools, it is still arguable that the overall team performance and the outcome produced by its work will be affected the same way. Team performance is a wider concept and usually talks about the quality of the work done by the team both in terms of it's processess, practices, as well as outcomes produced. High performance is usually a result from multiple factors at play — i.e. process efficiency, high collaboration, internal safety, risk taking culture focus on learning, effective internal and external communication, right mix of skills, to name a few. And the use of AI won't necessarily affect all this factors positively.

Unlocking AI’s Potential: Understanding Optimal Applications

The BCG published an article advocating that gains coming from adoption of genAI tools are highly dependent on a set of variables, including the type of task, the skills of who is using it, and the way AI tools are employed. Knowing when and how AI is best used is still not a trivial thing, and the wrong usage can destroy value rather than create it.

Their study points bring results that indicated that the current GenAI capabilities are a true fit for creative tasks, because it seems to be easier of LLM models to use vast amounts of data to come up with creative, novel ideas, and useful ideas. On the other hand, the models not cope so well when they need to weight of nuanced quantitative and qualitative information to problem-solve or answer complex questions.

Another disturbing finding is that technology’s relatively uniform output can negatively reduce group’s diversity of thought.

Perhaps the most paradoxical result of this study is that we trust AI for things AI are not so good at, and we mistrust the technology for areas in which AI can benefit the most. As stated in the article:

People seem to mistrust the technology in areas where it can contribute massive value and to trust it too much in areas where the technology isn’t competent.

Rethinking Metrics and Approaches

The use of numeric metrics that can be translated into charts or tables may look safer and more tangible. Our rational and reductionist minds tend to look for such type of evidence, however, metrics in isolation may fail to deal with complexity. For instance, I've seen so many failed attempts to have a unified view of several teams using a single system of metrics. This often fail because teams may have different workflows, and also because we can generate metrics to understand team process, but is really hard to measure cross-team collaboration.

I personally advocate for the use of quantitative metrics, but in combination with a qualitative assessment, which usually is done through conversations at the team level both in group and individual set up. That seems to align with one of the core values (sometimes forgotten)from the agile manifesto: Individuals and interactions over processes and tools.

This also is corroborated by the excellent article wrote by Abi Noda and Tim Cochran on Martin's Fowler blog, in which the authors advocate for the use of qualitative metrics based on conversations with team engineers. The regular conversations may allow for a better understanding of things that are intangible or hard to measure such as "how easily is to navigate a codebase" or "what is the current state of the system versus the ideal one (aka tech debt)".

But, what does this have to do with AI? A lot.

AI works really well summarizing or making sense of a huge amount of quantitative data, but still fail to deal with subjective and nuances qualitative information. The technology still unable to make good judgement calls, and judgement call is exactly what we do when using a qualitative metric and classify something as good, bad, efficient, inefficiency, functional or dysfunctional.

In conclusion, the adoption of AI won't guarantee higher performance or productivity, if you cannot measure it. But measuring it wrong is also equally, if not, worse and misleading. There is strong evidence that we are far from translating the impact of AI by simply looking at numbers, and organizations should place effort into develop systems and ways of working that will leverage AI without forgetting that: at the end of the day, people is still the central element for great software development.

References

Böckeler, B., Mason, M., Ford, N., & Chandrasekaran, P. (2023) AI-assisted coding: Experiences and perspectives. ThoughtWorks' Technology Podcast [Online]

Brynjolfsson, E., Yang, J., & Yamazaki, A. (2023) Can AI actually increase productivity? Word Economic Forum [online]

Cuomo, J., (2023) Quantifying the Productivity Gains of Generative AI for Developers — Medium [Online]

Lyman, I., (2020) Can developer productivity be measured? Stack Overflow Blog [Online]

Noba, A., & Cochran, T., (2024) Measuring Developer Productivity via Humans. Martin Fowlers.com [Online]

Sharma, A. (2024) How To Measure Developer Productivity In The Age Of AI? Forbes Innovation [Online]

Stryker, C., (2024) 9 ways developer productivity gets a boost from generative AI, IBM Blog [Online]