AI in Software Development

It's impossible to ignore the potential of large language models for software development. After all, software development is based on languages, and the act of creating software is an act of translation - from idea to running system, from specification to implementation, from API to unit test - and LLMs shine at translation tasks.

The purpose of this article is to sketch out how AI assistants are already supporting typical tasks through the whole value stream of software development, and what may be coming over the next years.

I'm loosely grouping tasks and uses cases by phases of the software development lifecycle - Ideation and Analysis, Architecture and Design, Development - and additionally including a "planning and management" category for related use cases that don't fit any other category. For each use case, I'll attempt an assessment of the potential for LLMs (current/future share of value creation), and a list of some existing tools / vendors for that use case - though trying to compile a complete list of solutions would be a fool's errand, considering how fast the field is moving.

Possible timeline for the arrival of AI support for software development use cases

The assessment of potential is based on these criteria:

1. Possibility for working inside current limitations of language models (such as size of the context window and speed of responses)

2. Need for correctness and precision of output

3. The use case's overall share of value creation in software development

Autocompletion of code in the IDE is already in a state of maturity, with users reporting productivity and motivation gains from using tools like GitHub Copilot. Since Copilot will be mentioned several times in this article, it's worth mentioning that other coding assistants - e.g. Replit, CodeWhisperer, CodeComplete, StarCoder, Google Colab with Codey - have similar capabilities. One reason why the code completion use case is so successful is that it fits well inside of current LLMs' boundaries - generated code doesn't need to be 100% correct since human quality control is always available and response sizes are small, the context window can be limited to a few lines of surrounding code and is thus small. It's not obvious how that approach will scale for higher-level use cases, such as generating entire microservices or applications.

In general, the potential for AI support at the moment seems higher in use cases that require creativity but don't require absolute precision. Those use cases are typically found in the earlier phases of the software delivery lifecycle (ideation, requirements analysis, design). That doesn't rule out the use of LLMs for implementation and testing of code, it just means that humans (or additional language models) are needed for quality control of all results.

One big caveat in using LLMs for software development is the quality of results. It depends highly on the quality of inputs, and one reason for the attractiveness of ChatGPT and its siblings is that they can work off of very imprecise prompts and fill in any gaps with assumptions or confabulations. In addition to incomplete or incorrect functionality, the code output of LLMs has also been found to have issues with performance, security and robustness. In its current state of maturity, it's advisable to run AI-generated code through a rigorous set of automated tests.

For clarity, this article only discusses using LLMs to build and maintain traditional (non-AI) applications. There is an entire universe of other possibilities that opens when integrating LLMs with software applications, such as wrapping a prompt in an application (for example, on Steamship) or packaging software functionality so that it can be used from within an AI assistant (for example, ChatGPT plugins). Here, I'm also ignoring multi-step planning approaches (like AutoGPT, LangChain) that progressively break higher-level objectives into actionable steps including writing and executing program code (such as ChatGPT with the CodeInterpreter plugin).

Ideation and Analysis

Write business cases

The process of writing a business case (and documenting assumptions) doesn't come natural to many creative people, and the idea is to use AI as a "living template" to help structure thoughts and provide a skeleton or checklist for areas to cover. I tend to be skeptical about this use case though; so much of writing business cases requires very specific contextual knowledge (of a company's situation, customers, …) and mixing AI-generated confabulated nonsense with facts and well-weighed assumptions so early in the process seems like a risky proposition.

  • Potential: low
  • Value share: low
  • Role: Product Manager

Generate specifications and feature lists

I have experimented with this use case, with prompts as simple as "let's design an application for …". The language model came up with a solid feature set, including some aspects that I hadn't thought of. Don't expect much creativity; rather a compilation of where the market is at.

  • Potential: medium
  • Value share: low
  • Role: Product Manager

Write terms of service and privacy policies

This use case is squarely in the legal field; essentialy AI would be generating contracts. Everything that is being discussed in context of language models and legal work applies. Technical people might become overly reliant on prefabricated texts and neglect checking whether they make sense, and the world is certainly not going to become less complex when AI can be used to spout out multi-page contracts by the dozen with no cost or effort. On the other hand, license and service terms in the software industry are very standardized, and why not base your contractual terms on what's most likely to be in a huge number of existing terms, rather than rely on the one template your lawyer just happened to have available?

  • Potential: low
  • Value share: low
  • Role: Product Manager

Generate UI prototypes

This seems very promising, especially when image-to-text translation (think "rough paper draft to clickable prototype") becomes mainstream. Being able to rapidly explore and test multiple ideas with minimum effort within a given design systems seems very appealing from a UX perspective, especially when a (small) UX team is a bottleneck in software development. Could also be extremely useful in customer development, when testing solution ideas (for B2C as well as B2B).

See the "generating entire applications" section below for a list of tools and vendors.

  • Potential: high
  • Value share: low
  • Role: Product Manager, UX designer

Generate and evaluate A/B tests

A/B tests could be an interesting playground for AI, and a variation of the idea to generate UI prototypes - but in A/B tests, you would be testing your variation in production. How many versions of a design would you want to test if you are no longer constrained by the effort needed to sketch out and implement an A/B test? Depending on the number of users of a system (primarily in B2C), it's easy to imagine not only an A/B test but an A/B/C/D/E test with various variations being tested in parallel. With the current state of LLM maturity, I'd be worried about too much nonsense and/or faulty implementation to leak into production however, if variations are not thoroughly vetted.

  • Potential: medium
  • Value share: low
  • Role: UX designer

Write user stories and acceptance criteria

User stories and acceptance criteria require conciseness and precision, so I'm dubious of using LLMs anywhere in this process. Writing such requirements seems a lot like writing prompts to LLMs; you need to know precisely what you want. I could imagine that checking acceptance criteria for completeness is a valid use case - it's not uncommon to find acceptance criteria for edge cases missing, leading to either assumptions by software developers or some unnecessary back-and-forth communication.

User stories and acceptance criteria (whether human-generated or AI-generated) could be a useful intermediate step in multi-step application applications generation, however; having these spelled out as artifacts makes it easier to go back to an earlier step and explore alternative approaches.

  • Potential: medium
  • Value share: low
  • Role: Product Owner

Architecture and Design

Generate designs (UI)

As with generating UI prototypes, creating designs seems like a great fit for LLMs. There is generally no right or wrong solution to design problems. Instead, user interfaces need to honor conventions so that users will understand them, from button color to labels and icons. If enough visual designs are included in the training set and a style guide can somehow be included in the prompt, I can imagine LLMs very soon playing a large role here - also when directly translating the design directly into working code instead of CSS/HTML.

  • Potential: high
  • Value share: medium
  • Role: UX designer

Recommend and compare solutions

Selecting technologies and patterns is hard, and fraught with all sorts of interpersonal topics. When someone suggests a certain architecture, that choice can be based on personal experience, career ambitions or curiousness, and getting agreement from other people requires persuasion or institutional power. So why not bring in a neutral instance to level the playing ground? Based, of course, on the organization's technology guardrails (technology radar, ADRs or similar).

  • Potential: medium
  • Value share: medium
  • Role: Architect, Developer

Create architectural spikes

When implementing new projects or major new functionality, a lot of time can be spent with evaluation of architectural ideas, such as using certain libraries, APIs or patterns. Some teams plan "spikes" to explore the idea and to be able to estimate how much effort a full implementation will take. Obviously, some of the output from these activities will be thrown away. So why not ask AI to set up a new testbed for a certain idea, or remodel existing code based on the idea? I expect that the busywork involved with spikes would drastically be reduced, but the truly hard part - evaluating to what extent the spike was successful, and what implications a full implementation would have - still remains hard. Not to mention that a key reason for conducting a spike is to learn a new technology and it "feels" when applied to a problem, and learning is an activity that can't be outsourced.

  • Potential: medium
  • Value share: medium
  • Role: Architect, Developer

Create data models

Just describe the data model in natural language and translate it to your ORM or SQL using an LLM. I consider this a convenience, not a game changer.

  • Potential: low
  • Value share: low
  • Role: Developer

Development

Generate entire applications

This is certainly one of the holy grails to be sought (the other one being automated software maintenance), but I don't think we're there yet. This approach will certainly work in narrowly-scoped cases (such as using a predefined application template and describing application features in a standardized way), but the solution space is vast and "lost in translation" errors tend to creep in. When trying to generate new applications in ChatGPT, I usually end up with tutorial-style code lacking error handling, and subtle inconsistencies between source files. Breaking changes in external dependencies are also an issue; apparently the body of knowledge on which LLMs were trained rarely explicitly states the version of components with which a solution was written. Any response needs to be checked and revised with the current generation of models, and unattended software production feels far away.

  • Potential: high
  • Value share: high
  • Role: Developer, UX designer, Product Manager
  • Current implementations: Generating simple (promotional) webapps and websites with chat tools is already possible. Debuild and Pico MetaGPT (self-service) and Builder.ai (human "product experts" building apps using AI based on fixed-price-proposals) are addressing this emerging market; builder.ai also has a prototyping tool. We can expect much of the low-code / no-code application builder space to move in this direction, with language/image based AI driving apps that are hosted on the vendor's platform.

Generate microservices

As a more narrowly scoped version of generating entire applications, this could actually work. Again, the key is probably to constrain the solution space by using a template or elaborate context in the prompt, so that complexity stays manageable. I have found that starting with an API definition and a data model works well and saves substantial busywork.

  • Potential: high
  • Value share: high
  • Role: Developer, UX designer, Product Manager

Generate frontends

As someone who hates dealing with the complexity of web frontend technologies, design issues and device compatibility issues, generating frontends is deeply appealing to me. For this to work, it will be essential to graft a certain visual style and defined technology stack onto the process. My experiments with ChatGPT in this area have tended to spiral out of control at some point, with ChatGPT context and actual source code drifting apart and weird errors that require real frontend expertise to fix.

  • Potential: high
  • Value share: high
  • Role: Developer, UX designer, Product Manager

Generate data migrations

Surely an edge case, but may be interesting anyway: with entity and attributes generally named semantically, wouldn't it be possible to map information between two data models and take care of the more common problems when migrating data (converting data types, splitting/joining strings, moving attributes to linked tables, looping over source data, …)?It sounds appealing, but I've not tried this.

  • Potential: low
  • Value share: low
  • Role: Developer

Generate algorithms

This actually works well already in many cases. I have found that GitHub Copilot will automatically write control structures (looping over data, paginating API results, tree traversal, …) with minimal prompting, and the result is often 100% correct. It's surely a different story if you are designing and optimizing larger-scale data structures and algorithms to work in finite memory, benefit from GPU acceleration, etc.

  • Potential: medium
  • Value share: medium
  • Role: Developer

Generate and update API consumers / client libraries

I often see abstractions of external APIs implemented in code files; while it's not hard to write those, maybe there is value in prompting an LLM to "write a client for the given API specification to solve the following business problem"? It would certainly be useful to have authentication, encoding, pagination, error handling and other aspects already taken care of -especially when APIs change. As an example implementation, the Gorilla model (paper / repository) has been fine-tuned on API definitions in order to generate correct API calls based on natural-language prompts

  • Potential: medium
  • Value share: medium
  • Role: Developer

Generate unit tests

This topic frequently comes up in discussion with software developers. Writing unit tests can be tedious and often happens after the code was already written (instead of development being truly "test-driven"). Given the large amount of unit tests in public repositories, it seems absolutely plausible that tools like GitHub Copilot should be able to automatically write unit tests and catch the more important edge cases, and Copilot already does this with some degree of success (see here or here). We'll also see more dedicated tools for this use case, like qqbot.dev. Automated test-writing should make code more robust and save time, at the cost of more test code to manage when maintaining / refactoring code (but we'll eventually use LLMs for that too, won't we?).

  • Potential: high
  • Value share: high
  • Role: Developer

Debug and improve code

ChatGPT fairly often writes faulty code, but is able to find and fix issues in that code when prompted specifically. That hints at the potential of using LLMs to do this at a larger scale (in real-life code bases), though the token limits currently make it hard to actually get this to work. In any case, LLMs will be able to spot common errors and bad practices in code fragments. When a tool like Copilot is continuously scanning newly written code for issues, that should improve not only the overall code quality, but also benefit the developer's skills in avoiding such issues going forward. The jury is still out on whether this will work for codebases with millions of lines.

  • Potential: high
  • Value share: high
  • Role: Developer

Summarize & review conversations

Untold amounts of potentially valuable code changes sit around in branches for too long, waiting for peer reviews to happen and improvements to be made. Can we make it easier for reviewers to understand what is being changed and find weaknesses? And eventually automate the review process itself, or at least the first review before a domain expert does the final review? It seems possible to summarize PRs with their code changes and comments to make them easier for human reviewers (example), and AI could also review the code for functional issues and security risks - even though the "debug and improve code" functionality leads to a much better feedback loop when it's available all the time in the IDE instead of just happening at pull request time.

Summarizing the conversation on a ticket or issue is another, related use case (GitLab; it seems certain this will be appear in Jira as well).

  • Potential: medium
  • Value share: medium
  • Role: Developer
  • Implementations: GitHub, GitLab

Generate technical documentation

So far in this article, I've been talking mostly about translating natural language to code. But the opposite direction is valuable as well, going from code to natural language. Copilot does this to some extent, by suggesting inline comments, but that capability could be applied at a larger scale to various documentation tasks, from describing system architectures to APIs. I'm not sure if I would want large amounts of documentation to be AI-generated and maintained separately from the underlying code; keeping those documents up-to-date sounds like it could be more work than it is for human-generated documents, and re-generating documentation from scratch probably also makes another review of all content necessary. Inline documentation, especially for APIs, sounds like a better idea, with AI assisting developers in writing concise and easy-to-understand documentation.

  • Potential: medium
  • Value share: medium
  • Role: Developer, Technical Writer

Explain code

While this article has mostly been about turning text into code, there is value in the reverse direction as well, namely in summarizing and explaining bits of code. The "code-to-text" path is also attractive since it doesn't require the same accuracy that the opposite direction needs, and is thus more easily achievable with the current generation of LLMs.

How much additional productivity can be gained through this capability probably depends on the density of the code and the experience of the developer using the explainer feature. The "explain" feature has also gained some popularity as learning tool, the idea being that a language model could play the role of a senior developer in explaining solutions.

  • Potential: medium
  • Value share: low
  • Role: Developer
  • Implementations: GitLab, CodeGPT

Summarize code structure

Translating large amounts of source code to natural language sounds like a bad idea at first, but there may be real value here when maintaining (fixing, extending) foreign codebases. By summarizing classes and code files, and the summarizing those summaries again to tease out the function of modules and components, it should be easier for developers to find out where certain functionality is located and how the pieces work together. Writing the summaries in the user's native language could be a big plus.

Generating such summaries as static, general-purpose documentation is similar to generating technical documentation, but it gets more interesting when creating such an overview map of a codebase with a specific intent: imagine having your AI assistant scan the code specifically to find areas that are related to taxes in invoice generation, and finding internal/external dependencies such as tax fields being part of data structures or APIs. I haven't seen tools that do that, though this feature would be natural fit for example for GitHub Code Search.

And while we're scanning the code base with a specific intent, why no make a plan of what needs to be changed? That could help estimate effort and impact of planned changes and pave the way towards low-effort code maintenance.

  • Potential: high
  • Value share: medium
  • Role: Developer, Technical Writer

Automate software maintenance

This is the other holy grail of AI in software development, in addition to generating applications. Maintaining legacy custom-built applications is onerous for all companies; outdated technologies, limited application lifespan and high complexity make maintenance highly unattractive as a career opportunity, but such applications are often business-critical and therefore command sizeable development budgets. Let's consider the process of adding a feature or fixing a bug in a complex legacy application: from understanding requirements and business value, considering what the application's current expected behavior is and how the feature or fix may change it, working with users to understand how changes will affect them, finding the components to be changed, estimating effort and timeline for delivery of the change, adding tests and refactoring code, coordinating end-user tests and finally deploying to production. Software maintenance is different from creating new applications in that a lot more specific context goes into it - and rarely is that context written down. That makes software maintenance a difficult use case for LLMs, but it looks like we'll eventually get there.

  • Potential: high
  • Value share: high
  • Role: Product Manager, Product Owner, Developer

Planning and Management

For anyone who deals with text-based communication, LLMs will obviously become more and more important. I will focus only on a few use cases related to software development in this category.

Generate and score coding tests

When recruiting software developers, assessing coding skills with some form of coding quiz or interview has become commonplace. Creating your own quizzes can be a lot of work, and standardized test providers can't tailor their tests to your unique technical landscape. Could we use LLMs to create good coding quizzes based on samples of real-life code or requirements, along with a set of good solutions? It certainly sounds useful, though only organizations with significant growth or employee churn will be able to really make use of it.

  • Potential: low
  • Value share: low
  • Role: Engineering Manager, Senior Developer

Estimate level of effort

Estimation is highly inaccurate in software development, but reliable estimates are crucial for prioritizing work and planning releases. Apart from collaborative estimation exercises ("planning poker"), it seems that best way to increase reliability of estimates is to break down the task as far as possible into smaller tasks, each of which can be better estimated. But there is high effort involved in that processes. Can AI help break down complex tasks into subtasks? ChatGPT can certainly structure tasks on an abstract level, but it's unclear how enough knowledge about the context (codebase, architecture, requirements) can be injected into the model to receive specific tasks and estimates. Also, should estimates be based on a human doing the work or AI?

  • Potential: medium
  • Value share: medium
  • Role: Engineering Manager, Senior Developer

Summary

This was a long list, and I expect the list of possible use cases for AI in software development will only grow in the next months. The possibilities appear nearly limitless, like they usually do on the "peak of inflated expectations" in the hype cycle. Limitations will certainly be discovered - with quality of work, time or cost it takes for AI to deliver results - and some use cases will end up being discarded. Still, it's not unreasonable that the way we make software is about to change fundamentally. These are interesting times.