Brief summary
Volume 30 of the Thoughtworks Technology Radar was published in April 2024. Alongside 105 blips, the edition also featured four themes selected by the team of technologists that puts the Radar together. They were: open-ish source licenses, AI-assisted software development teams, emerging architecture patterns for LLMs and dragging pull requests closer to continuous integration. Each one cuts across the technologies and techniques included on the Radar and highlights a key issue or challenge for software developers — and other technologists — working today.
In this episode of the Technology Podcast, Birgitta Böckeler and Erik Dörnenberg join Neal Ford and Ken Mugrage to discuss the themes for Technology Radar Vol.30. They explain what they mean, why they were picked and what their implications are for the wider industry.
Explore volume 30 of the Technology Radar.
Episode transcript
Ken Mugrage: Hello, everybody. Welcome to the Thoughtworks Technology Podcast. My name is Ken Mugrage. I am one of your regular hosts. Today, we're going to be talking about Radar Themes from Volume 30. I’m joined today by my co-host, Neal. Why don't you introduce yourself?
Neal Ford: Hello, everyone. My name is Neal Ford, another one of your regular hosts, and I'm half in the host sofa and half on the guest sofa today, since we're talking about Radar themes.
Ken: Erik, can you introduce yourself, please?
Erik Dörnenberg: Hi, my name is Erik. I've been a long-time Technology Radar contributor at Thoughtworks. I'm now the CTO in Europe.
Ken: Thanks. Birgitta?
Birgitta Böckeler: Yes. Hi, I'm Birgitta Böckeler. I'm also based in Germany. I'm a Technical Principal at Thoughtworks and, same as Neal and Erik, I'm also in the group that curates the Thoughtworks Technology Radar.
Ken: Great. Thanks, everybody, for joining us. Neal, if you wouldn't mind, what's a Radar theme? Can you give us an overview of what they are?
Neal: Indeed, I can. The Thoughtworks Technology Radar, which we produce twice a year, is curated from projects. They feed information into this group of people that meet face-to-face twice a year, and we assess all of those, what we call blips, and eventually build a Radar of about 100 of those things. During that week, though, we pay attention to what topics seem to keep coming up over and over again, and we call those things themes. At the end of the week, we as a group get together and decide what the themes are for this particular Radar. There are between three or five themes.
These are things that cut across all of the individual blips in our Radar, and it's the only thing that's actually created organically by the group of people because we're mostly a curation exercise from blips that have been nominated from projects. So every Radar has a few themes, and today, we thought we would talk a bit about the themes on Volume 30, which is just out. There are four themes, where they came from, a little insight into why they are themes, and a few of the blips that supported them, et cetera. That's the overall mandate of our podcast today.
Ken: Great. Thank you. The first theme: Emerging architecture patterns for LLMs. A pretty easy thing, right!? Birgitta, do you want to start? What is it? How did you guys end up talking about this?
Birgitta: Yes. Not surprisingly, maybe two of the four themes are about AI, and in particular about generative AI, right? I think I did a calculation at the end. I think it was 32% of our 100-something blips that are closely related to generative AI, let's say. This first theme, Emerging Architecture Patterns for LLMs, LLMs standing for large language models, is about architecture patterns when you're building applications that are backed by large language models, where you're taking advantage of large language models in some form.
The most common example, maybe some kind of chatbot, let's say like a customer service chatbot, or like a eCommerce store search chatbot, or something like that. I think we should start with the maybe most common pattern that we're mentioning, already mentioned in the last volume of the Radar, which is retrieval-augmented generation, or RAG for short. This is usually the thing that I just recently told like a group of our developers, if there's one thing technically that let's say, run-of-the-mill application developer, full stack developer, as we sometimes say, there's one thing that you should look at and understand, it's RAG, it's retrieval-augmented generation.
It's basically this idea of infusing your prompts with your local context, local knowledge. Either, if you're doing something for your organization with your organizational knowledge, with your domain knowledge, with information from your shop with, whatever other data you have. That instead of going and fine-tuning a foundation model with all of your information, you're actually not doing that.
You're just using a foundation model out of the box and you're orchestrating your prompts with your additional data. It's basically like a search problem. The user asks a question, you use that, ask a question or give some instructions. You use that user input to search your data for relevant information and then infuse that into the prompts for the large language model.
Ken: Then you actually had the blip for fine-tuning — Rush to fine-tuning was on hold, I believe — can you go a little more into that? Because like you just mentioned, people are choosing a lot often between fine-tuning and RAG.
Birgitta: Yes. We caught the blip, like you said, rushing to fine-tuning or rush to fine-tuning and we put it on hold, which is a ring for saying like proceed with caution. We're not saying fine-tuning is not a good idea. It always depends on your use case. The reason why we wanted to call this out is that we get this question a lot. It's often one of the first questions clients ask us about generative AI, especially when they haven't looked into it so much yet, is like, "Oh, how can I fine-tune a model with my particular code basis or with my particular data?" The answer is usually that fine-tuning is not usually, but very often that fine-tuning is not the best solution for that because fine-tuning actually often doesn't help teaching a large language model new facts. It's more about teaching it new patterns or teaching it new tones. But if you actually want to infuse facts, then retrieval-augmented generation is usually a much better approach and also a lot cheaper.
Neal: Then a little bit of insider knowledge here. Birgitta and I were at a face-to-face and she actually showed me some code, so sorry for the audience that can't see that. We were talking about LangFuse and other frameworks or tools or libraries out there, and you were showing the advantage of going direct. Can you talk a little bit about that? how are you doing like testing your prompts? How are you monitoring what's going on? What are the technologies there or the blips there?
Birgitta: Yes, that's also part of this theme of emerging architecture patterns that while it's very easy to create like a little demo with a large language model, then when you actually want to productionize it comes with a whole tale of all of the things that we always have to do in software delivery. One of those is observability and testing — like, how are you monitoring your costs? How are you monitoring performance? How are you doing the trade-off between latency but also always giving fresh answers, all of those things.
LangFuse is one of the things that we put on the Radar as a tool to help you put some observability on your prompts. Then when it comes to testing, that's a whole other area because large language models don't give you deterministic answers. You also cannot say in advance, this is the expectation I have of, what I want back. Because it's very fuzzy. That's like a very interesting space but also very challenging. We discussed that a lot and like one blip maybe to call out in that area is not so much related to testing, but to guardrails.
We blipped a tool called Nemo Guardrails that is from NVIDIA, I think, that helps you, yes, put guardrails around both the prompts that people are sending to a large language model that your users are sending to see, how can I filter out the things that I don't want users to ask in my particular application context, but also to look at the responses from the large language model and filter out certain things that you don't want to show back to your users, either for ethical reasons or security reasons, or because that's not relevant in your application.
Neal: Things like PII, things you shouldn't be able to put together. Another fascinating pattern that came out of our discussion that came out of South America, I think was, and it's along the same lines of using an expensive LLM to validate the output of a less expensive LLM. I have lots of access to a cheap LLM, but I don't want to spend a lot of money on the more expensive one, so I can tell you selectively test the output to see how good the quality is. I thought that was a fascinating pattern. Toward this question of, okay, it's magic, but is the magic actually working? Erik, what about figuring out if it actually works or not? We talked some about that as well, didn't we?
Erik: Yes, along the lines, Neal, of what you said, and what Birgitta also talked about, the non-determinism and so on, what we've seen in practical applications of LLMs is really, does it work? Does it do what we, as the people who commissioned and wrote the software, what we actually wanted it to do? Especially if you think about chatbots, are they going around in loops? Are they having an entertaining conversation with the person on the other hand, with the human, or is it actually helping?
Birgitta: Or put another way, is it getting the user to do what we want the user to do, right?
Erik: That is the other way to say, from a company perspective, for example, if you're in e-commerce, is the bot actually helping the customer or is it making the customer find products that suit and put it in the shopping basket, for example. That's exactly areas of applications where we've seen these patterns, where we're now analyzing on a higher level, what is the state almost in the sense of a state machine, definitely in the form of a graph.
What is the conversation, which path is the conversation taking, and is it progressing towards the goal? You can again use, similar to what Neal said, you can use an LLM to understand at what state are we in the conversation to assess the state of the conversation rather than answering just to say, I am in this state to find out, is it going in the right direction?
Neal: It's a great example of a pattern that's emerging in the LLM space. Last one of our patterns that we talked about were putting them on the edge in small devices.
Erik: Yes, that's yet another one. We didn't put on the Radar the fact that these AI PC certifications are coming out now, but Intel, Microsoft, and I guess a few others have now started saying there's a certain class of PCs, mostly laptops at first that would be considered AI PCs. The key thing that differentiates them from others is the fact that they have hardware built into the laptop that actually allows them to run AI models. We have seen this for quite a while in Apple hardware, both on the iPhone, as well as in the Macbooks, these GPUs, Tensor processors or NPUs, I think as they are called on the iPhone, things that can run the model.
There's — I guess we speculated for the reasons behind it. There's a lot of idle speculation, but it is clear really that even the companies that are running the large language models are now seeing benefits of actually not only running them in the cloud, but also on the end-user devices. For me as a European, I like the idea of latency reduction and especially for more privacy when it runs on the device, but there can also be very clear commercial implications, positive commercial implications for the companies that are otherwise operating the models in the cloud. I've recently seen mentions of something called SLM, small language models.
We didn't put that on the Radar, but we put a couple of technologies on there. One is Ollama, the tool that allows you to run these large language models or medium-sized language models, any language model that is small enough to fit on your local hardware, to run on your local hardware. We can also see this when we look at how Google is positioning their language models. When they rebranded them to Gemini, they basically-- not basically, they created three different versions of the model. One of them, Gemini Nano, is specifically designed to run on mobile phones.
They still do require a fair amount of resources. This is why it's only running on specific Pixel phones with large RAM. As far as I remember, you lose four gigabytes of RAM immediately for that large language model or that language model to be running on your phone. It is definitely a trend we are seeing for a number of different reasons that all point in the direction that these models shouldn't only be deployed in the cloud.
Neal: I was going to say we're going to start paying a memory tax for large language models or devices. Now, suddenly, resources are going to plummet.
Erik: Memory is not that expensive. if you normally have a phone with 8 gigabytes of RAM, maybe the default will then be 12 gigabytes. I don't think this will really hinder adoption so much. As I said, there's enough reasons for pursuing this avenue, at least for a while, to see how it goes.
Ken: You're talking about LLMs on the edge, but also we talked a little bit about RAG and fine-tuning and so forth. Does that present any limitations if I'm running the model on the edge? Can I still do the same type of retrieval filtering, if you will?
Erik: I would say we haven't seen that yet. What we've seen so far is that you take a pre-trained large language model or language model and simply run that on the edge. As a technologist, I would say there's nothing stopping us from running something like RAG on the device with the right pieces of software available. A lot of the RAG currently is going into larger and larger scales. We talked about the peak of vector databases as well.
We've listed a lot of vector databases because they're often used in a RAG architecture. We've talked about that on the radar, that they're now looking at other search and retrieval mechanisms. These are not normally built to run on something like a laptop or on a mobile phone, but from an architectural perspective, there's nothing that would stop you from implementing that pattern on a lower-powered device, such as a laptop or a mobile phone.
Birgitta: A simple similarity search with the vector database, which is currently like a very common pattern, how people get started with RAG, is totally possible to run that in memory with not a lot of power behind it. That's absolutely possible. Also with reasonable results, there's definitely ways to make it a lot better. If you put proper search technology behind it, we also put Elasticsearch Relevance Engine to draw attention to the fact that this is a search problem. This is not just about vector databases, there's all kinds of other search approaches that we can use, but it is possible to do this in a very simple way in memory with useful results.
Neal: Great. Thanks. Moving on to the next theme, I know we could talk about LLMs for quite a bit longer. Something that's near and dear to our hearts, of course, is software development. One of the other things was AI-assisted software development teams. I don't know who wants to go, but can we talk a little bit about that one?
Birgitta: Yes, I think you'll have to listen to me a little bit more because my role in Thoughtworks right now is actually what we call the global lead for AI-assisted software delivery. I've been immersed in this space for the past eight or nine months. I start with coding assistants maybe because that's definitely something that we talked a lot about and that we have quite a few blips related to that on this volume of the Radar. GitHub Copilot is the first one to talk about. I think it's the third time that we have it on a Radar volume. We keep updating the text because there's news and there's like new features.
It's currently definitely the most popular product in this space and many of our clients are using it. This time in the update of the blip, we particularly talked about features that you can now use GitHub Copilot chat to also ask questions about your code base, which was the case six months ago when we released the last Radar before that. Also things like new interface changes that they have that further improve the developer experience. We also put on a bunch of other coding assistance-related tools, for example, an open source tool called AIDR that is quite popular among Thoughtworkers for little experiments.
You can currently only connect it to OpenAI or Azure OpenAI services with your own key. Then you run it like in your terminal and it's the only tool I've seen so far, there's lots out there, but the only tool I've seen so far that changes multiple files at a time for you, which is also something that GitHub Copilot doesn't do at the moment. I can, for example, tell it to, I don't know, add identifiers to my React component and then immediately use those identifiers in my test that I have open at the same time. That's maybe something that stands out with that one. We also did Codium AI, which is specializing in test generation.
We talk a bit about text-to-SQL with large language models, which is a big area of activity at the moment. We're mentioning two tools in that blip that helps you do that. Turn natural language into SQL queries. Then finally, we also blipped a open source IDE extension called continue or continue dev. Some of these names are quite hard to Google, but so it's called continue. The thing that stands out about that one is that you can plug it into a bunch of different model services. You have this IDE extension that gives you inline auto-complete like GitHub Copilot would as well.
Also as a chat functionality and explains code for you, generates tests for you, but you can plug it into your own hosted model or other model services than the common ones. Most interestingly, you can also use locally running models, for example, running in a tool called Ollama that we also blip. You can run a model locally and plug that in so that the, your code snippets actually don't leave your machine.
It remains to be seen if, different models also sometimes require different types of prompts. It remains to be seen how this tool fares, when you just switch from one model to the next, it sounds so easy. It might actually not work, not always work quality-wise, but it's like a first start for also trying this locally on your machine. I've done it for a few things and it wasn't so bad.
Neal: I'm not sure if I'm the only one that got a little bit of a shiver down my spine when you talked about it, changing multiple files at once. We talked, we've talked a lot over the last year and more about knowing what good looks like when you're dealing with generative AI in general. What are the consequences to quality of all this?
Erik: Yes, the consequences to quality do come up almost in every discussion around those tools and have been since we've covered these tools on the TechRaid as of three volumes ago. I've personally been quite interested in analyzing the quality of software for over a decade now. It is still a bit unclear, I have to say. One thing that we do understand, though, is that it does amplify what is in the code base. If you have a code base that is probably on the whole architecture really well, that it can probably help other people and new developers and also just existing developers continue the code base in that form and keep it well factored.
Whereas on the other hand, if you have a code base that is maybe not that great, might end in a downward spiral and might make the code base worse and worse. I don't think anybody really has done a lot of analysis. There are some studies and maybe Birgitta can correct me on this one, but there are some studies that come to some preliminary results, but I wouldn't say it is conclusive yet.
It is definitely a topic that we need to keep an eye on. One thing we'll talk about it in another theme later, that it will also stress other parts of the software delivery. For example, if you are creating code more quickly and if it's easier to generate more code and you do code reviews, then you're stressing the review system more and you might have unintended side effects there that are not directly related to quality, but indirectly related to quality.
Neal: I think one of the seemingly inevitable consequences of this is that code bases are going to get bigger, to Erik's point. Pull requests are going to get bigger. One of the things I'm telling architects and tech leads to brace for is a tidal wave of functioning but terrible code generated by LLMs. It'll work, but it won't have good abstractions. One of the lengthy discussions we got into in our meeting was the observation by one of our colleagues from China that if you are generating entire applications using LLMs, it's actually easier to generate big balls of mud.
They don't understand abstractions like model view controller and design patterns as well as just big giant masses of code. Then the problem you have there is what happens when it gets so big that even the AI can't understand it? There's no way humans can understand it. There's definitely a push and pull here on-- it may be the developers' roles in the near term shift to refactoring generated code even more than writing code because so much is going to be generated by these systems tools like this.
Erik: To relay the discussion a little bit that we were having at the Doppler meeting, there is also the counter-argument. If you have a code base that has a very clear design, that writing the code will follow that design automatically rather than accidentally straying away from it. Also, again, and I can at least personally attest to it, it's sometimes we talk about the red-green-refactor cycle.
There can be another step in that cycle and say, ask co-pilot or another tool like that and say, can you improve that code? Again, from personal experience, I can say oftentimes there are suggestions. I would not take the code as it is presented to me, but it does give me ideas of how to further improve the code. Again, as we discussed at the meeting and are discussing now, it can go both ways really.
Birgitta: Then definitely as to quality. One of the most frequently asked questions right now is like, how do I measure the benefits of this? I think that's actually almost like a no-brainer. These tools, they cost a fraction of what your team costs you. The feedback from developers right now is overwhelmingly, this is definitely helping me. In terms of like, it does provide value already today. I think it's much more important to monitor the risks to make sure that it doesn't become a net negative. Things like, is your code base growing faster?
Then the other thing that's happening is that there's, the large language models would never be perfect at like generating the perfect code. Because they also see a lot of bad code out on the internet in their training data. There is a lot of potential in combining them with other technologies that do understand code and can analyze code and have like, know what good code looks like that can then take the suggestions of the large language model and like try and warn you about things or make them better.
Just today I was actually asking Copilot to generate some code for me. It was giving me a little warning at the bottom about the thing that it had just suggested to me and telling me about a security vulnerability in there. That's an example of like putting in all of these little safety nets to make it safer for the developer and further reduce the review load of the developer for these suggestions.
Erik: Yes, it's also about discoverability. The APIs, the SDKs are getting larger and larger. I can certainly remember several cases where I thought this is working, I can move on. Then I thought, okay, let's see and ask, can I make it simpler? It turned out there was a better API in a newer version of the API that I wasn't even aware of that then simplified the code that I wrote. One thing maybe because we have Neal here on the call, one thing that I think is going to help us in that regard are fitness functions.
Neal: The category of fitness functions we talk about in our book are these internal code quality fitness functions, even simple things like psychometric complexity. I think these are going to start becoming very handy to assess generated code to see, is it of good quality? If you're going to throw it away in six months, who cares? If you're building a foundation for something that needs to be around for a long time, I think some of these automated fitness functions are going to be a great way to help assess the tidal wave of code that's coming at you.
Ken: We've talked a lot about code, but as we talked, we mentioned a lot that code's only one thing that a team does, right? What else beyond coding, I guess, Birgitta?
Birgitta: Yes. We put a blip on the Radar in the techniques quadrant that we called LLM augmented team because we are currently looking into this with a few clients beyond coding, what other software delivery tasks can large language models in particular help with? What we're basically looking into is like a team assistant that can make the team as a whole better, not just like an individual coder or like an individual on the team. The key here is certainly like what everybody's talking about in this space, which is like knowledge.
How can you use large language models to amplify knowledge on the team and like as a new mechanism to share knowledge on the team? What we're doing there is basically using prepared prompts to codify good practices that we want, it's yet another version of making it easy for people to do the right thing. Let's say they have some-- a team hasn't done threat modeling before. It's scary.
They procrastinate. If there's like a tool that helps you actually start applying it to your situation, as opposed to an article with a checklist and theoretical examples where you should do it, and you can, yes, make it easier for them to do this practice in a good way and to learn it and to also share experiences from other people in the organization that are not always available to you to talk to. You can have them infuse their knowledge into a team assistant that is backed by a large language model to make it more accessible to people.
Ken: Great. Shift gears a little bit, just to prove that you all didn't only talk about AI in this Radar. Something that's near and dear to our hearts at ThoughtWorks and those that have followed ThoughtWorks for a while know, we're very into engineering practices. One of our favorites is continuous integration. There are lots of ways of doing code. I'm pretty active in an organization around DevOps stuff. People are like, "Oh, yes, I'm doing continuous integration. I just have to do this before I push or whatever." It's like, that's not CI, please. We can be dogmatic about it. One of the themes was dragging PRs closer to proper CI. A little bit loaded because we are still saying proper CI in there.
Neal: Dragging.
Ken: Yes, and dragging. Yes. We might have a little bias in the title, but hey, we're nothing if not biased and frankly, we're proud of it most of the time.
Birgitta: Neal comes up with these titles, by the way. Didn't you say, Neal, that you were surprised that this one made it through?
Neal: I was actually surprised because this one is so opinionated. Normally, Rebecca tones down some of the opinion in some of these, but I think this one either got under her Radar or she agreed with it. There's a definite bias here towards the idea of continuous integration. In fact, we bring up in the blip that our chief scientist, Martin Fowler, just updated his definition on his blog wiki of a continuous integration to incorporate some of the modern practices in it. This is one of those topics that a lot of these pull request tools come up a lot on our Radar.
There's a little bit of a tendency for us to swat them away because we are biased toward super fast feedback loops that you get with continuous integration. There are many legitimate situations where teams are doing pull requests either because of remote, like an open source project, or even within organizations they've bought into institutionally the idea of pull requests. We're actually impressed with some of the tools that are starting to come around about this. That's really what this thing was about, is improving ecosystem if you are, in fact, forced to do pull requests instead of what we call proper CI.
Erik: I think it was also going through as a theme, because the group feels quite-- the group that writes the Technology Radar feels quite strongly about the topic. We have over the years come back to that theme many times, even without the context of pull request, even earlier than that, because as you already said, there is continuous integration and continuous integration. We, on one of the blips called it out as CI theater, this idea that you're doing continuous integration. If you have a CI server running, especially if you have it running on a branch, this even led to Martin coming up with this notion of a continuous integration certification.
This is a bit of a tongue-in-cheek article because of course there is no certification. Our industry loves certification. He came up with this mock certification. The essence of it though is still very valuable. That is what we mean here by proper CI, this idea. There's only three criteria. Does every developer commit at least daily to a shared mainline? Does every commit trigger an automated build and test? If it fails, how long does it take? Does it take less than 10 minutes for somebody at least to take ownership and fix this?
That is what we meant but the frustration, what we've witnessed is that there's so much semantic diffusion that CI begins to mean something else. That's I guess where we felt compelled to stick the proper in there. On the other hand, there's also this idea of pull requests. Here I can speak from personal experience. I've been maintaining a reasonably successful open-source project over the years, and I love pull requests. This open source project, which is really widely used in the industry, it's a test framework, a mock objects framework for Objective-C.
It's widely used in the industry. This would not have been possible without pull requests. Here, I'm in a very different situation. I've done a lot, and I do mostly do enterprise software development in teams that are closely knit, that trust each other, that spend 40 hours a week building that software. The open source side was very different. I got contributions from people who were far away, who I couldn't necessarily trust. I had no reason to trust them, who were making, suggesting interesting changes, but not following, nevermind the formatting style, but not even providing unit tests or anything.
That was a very different way. In that context, and I believe, I don't know this for sure, but I believe that pull requests were invented for that reason, for these open source projects where somebody could say, "I've taken your open source software, and here's a bit of change. Please review it. If it takes a day or a week, that's okay." There's a very different relationship. What we are now seeing is that we put pull requests as a synonym for peer review.
This also has morphed in the same way as CI had morphed away from its original meaning to something else. Pull requests are now used as peer reviews. We were like, no, we like the original idea of pull requests. We like the original idea of CI. It's just what is happening with them at the moment that is wrong. That, I guess, got the more opinionated title through.
Birgitta: Yes. It was like a past blip. I think that there's like multiple ways to do peer review and it's not only pull requests. Yes. It's like when we were, when I was preparing for this discussion, like I was reminded of like a quote by Kent Beck, that's also in one of Martin Fowler's articles on continuous integration and branching. I don't remember which one it was, but it really stuck with me because he-- Kent Beck in that quote is saying that we, with branches and pull requests, we give individuals the illusion of frozen time and the illusion that they're not working with other people who are working on code at the same time.
It optimizes the experience for the individual, but then, but that's an illusion. Eventually, we have to pay the price for it. He's talking about, different patterns. I guess our blips are also like different what Kent Beck calls alternatives for paying the piper. Eventually, we have to pay for the complexity. It's like inherent complexity that we have multiple developers working on the same code base. That's how I think about like some of the blips that we discussed, like alternatives for paying the piper, for dealing with the complexity of code reviews of pull requests, and dealing with the bottlenecks that they often represent.
Erik: Maybe there's an inherent bias because we have been making so much good experience with continuous integration in the sense of continually checking into the same main line that we are seeing this trajectory that what many of those tools that we discussed, all the ones that we actually put on the Radar, for example, GitHub merge queue, that ultimately the trajectory that they're aiming for is continuous integration. Again, they're just working around. They start from this vantage point of pull requests that were made for open-source software and for this very different setup.
They're trying. It looks like the intention is right. It is just continuously battling with the wrong tool to get it to where you could be. I remember in one of the discussions in the Radar meeting, the topic of stacked divs came up. There was an article on the pragmatic software engineer. I think it was, they talked about stacked divs. I think, you have to remember, we are jet lagged. Many of us are jet lagged when we have these meetings. I remember listening to this. At some point, I said, "Why don't they just do CI?" It was like, they're describing what they should be doing in our opinion, but then so close, but just not there. I get that emotionality is what is in that theme.
Birgitta: Maybe it should be clear, like the tools that we discussed-- we're not saying, oh, everybody should do trunk-based development all of the time. We know that pull requests are reality. There are also like some environments where for whatever reasons you cannot get rid of them. Tools like what we blipped GitHub merge queue, we discussed stacked divs, we discussed a tool called GitStream by Linear B. It's like, I think good that there are tools that are trying to alleviate those pains, but sometimes we get a little frustrated because we think all of this complexity could be solved in a different way as well.
Neal: I was describing the technical term of yak shaving to a non-technologist the other day, the problem that generates into another problem and another one, and you get four or five yaks down and you realize there's no way the effort I'm putting into this is worth the original problem I'm solving. These tools to me feel like sharper razors for yak shaving because they're forcing you to do a bunch of ceremonial stuff that you shouldn't be having to do anyway, but since you're having to do it, then we'll give you sharper razors or better shears for shaving yaks. You'd be better off just avoiding the yaks entirely, but if you have to deal with them, then we'll make it more efficient to deal with them.
Ken: I'm curious, I guess for, I think probably for Birgitta, you mentioned PRs with AI-assisted software earlier and how they might be getting larger and they might've changed a file that we didn't anticipate or what have you. Does that change our take on this at all? Do we want PRs in those cases where we have these larger changes?
Birgitta: It's definitely interesting. I've definitely seen a lot of crossing paths of pull requests in this space of AI-assisted software delivery in the past few months. One of them, Erik already alluded to that, our review processes and pull request review processes in most organizations are already a bottleneck. If you take a coding assistant and increase the throughput of how much code developers are producing, then you potentially put even more pressure on this pull request process. Which puts more pressure on you to look at tools that help you triage your pull requests or like somehow improve that process.
One of the things that we like to reduce pressure on that process is pair programming because it has the review just in time. It's like yet another reason for organizations to look at their pull request process and their continuous integration to alleviate that pressure. Then the second area where these two topics cross paths is indeed that there are some first observations that pull request sizes are getting bigger when you're using a coding assistant because it's easier to create new code with a coding assistant than to change existing code. Also, you're faster at producing this.
Somehow pull requests are getting bigger, which means that the amount of code your reviewer has to review is getting bigger. There's a higher chance that they're getting sloppy because it's just too much. Also, of course, it means your change set size when you deploy to production gets bigger, which increases the risk that something goes wrong or that when the deployment fails, it takes you longer to figure out what went wrong because the change was bigger.
Erik: Early on, what Birgitta said is there's a couple of effects, I think, that will also get amplified in this scenario. I've seen a common problem that we see in scenarios where there are peer reviews, especially when teams are under pressure, is that the person writing the code, because they are under pressure to write something quickly, that they will think, "Oh, it doesn't matter if it's not perfectly right or I'm not 100% sure, but it's okay. Somebody will review it," so they can reduce the time they spend and externalize that cost if you will, the time cost to the reviewer.
Then on the other hand, you have the reviewers who are also there, they're also developers in the same team. They're also under pressure and they say, oh, I know the person who wrote this code. They generally write good code. I don't have to review it so much to reduce the review time. Here, the danger is quite obvious, I think. If you're using a tool, a Gen AI tool to write the code, then you might be even more tempted to think, okay, what the AI wrote, you have a quick look at it and think that looks okay, but it doesn't matter.
You feel you're protected because somebody else will review it, which will further increase the burden on the reviewer. Then even worse, an effect that will happen is a lot of, and we are hearing this almost everywhere, can you prove that the AI-assisted coding makes the teams more productive? People will know that they're being measured and their productivity is being monitored. If you're introducing one of those tools, you already, as a developer, feel under pressure to somehow make it go faster, which will further increase those effects. I don't think there's, we definitely have no conclusive evidence of this, and maybe there will never be evidence, but it's definitely something that I'm expecting to see in a number of different scenarios.
Ken: I'm going to throw out a little bit of a controversial question because I heard a couple of words used there in both of your answers. There's a bit of a callback to the AI-assisted software development team. What is the relationship between a developer and an AI tool? Are they pairing with that tool?
Birgitta: I would say, colloquially, yes, they're absolutely pairing with that tool, but I wouldn't say that is what pair programming is as a practice. Pair programming as a practice, we, at least in how I would define it, and I wrote a long article about it on Martin Fowler's website, how I would define it as a practice to make the team overall better, to improve knowledge exchange on the team, to improve collaboration, to improve situational awareness on the team, all of those things.
You cannot do that by talking to an AI because then when you're out tomorrow, your colleague still doesn't know what's going on. You might not have heard during a pairing yesterday about this domain context that your teammate actually heard about two days before that, like it's all of this like tacit knowledge that goes around that the AI doesn't have.
Ken: Yes, thank you. I knew that would be your answer. I cheated. I read your article, but I wanted to get that out there. Moving on, it's funny, as we record this, I'm actually at the Open Source Summit from the Linux Foundation, and this is a topic that's come up a lot, open-ish source licenses. I guess, Neal, do you want to give us an intro to this one a little bit?
Neal: Yes, this is probably the one we have the least valuable insight or call to action to. It's more just an observation that we made. As I said, these themes come from preponderance of things that come up on Blips. We noticed two interesting things. The license space in the software world has been really stable for a long time. There are a number of well-known open-source licenses with certain rights and restrictions, et cetera, and it's a well-known, subtle space. We've noticed two things happening in that world recently.
One of them is this idea of tools that become really popular and then suddenly shift to a commercial license model and strand a lot of ecosystem support, sometimes relatively abruptly, given the amount of reach they have. The other were licenses, particularly for AI tools, that promise you some things, but then there are hidden license, like you need to add your own Chad GPT key in here to make it work. A little bit sneakiness around licenses and capabilities. We're used to this idea of licenses moving to a commercial model, just the abruptness of it and a little bit lawyerly underhandedness of some of the licenses associated with some of the AI tools.
Erik: One thing I definitely remember when we do look at the tools sometimes in more detail during those discussions and one thing that genuinely surprised me, and I don't want to name it, I actually did forget what the name of the tool, but it violated this understanding that GitHub repository comes with one license. In GitHub, you also create that understanding because at the very top level, not only in the directories, in the source code files, but also pulled out from the source code files in the UI of GitHub, it says, what license is this repository under?
In that tool, though, it turned out that I think it was two or three directory hierarchy levels down. There was another license file that governed only the two files in that directory, and that made it almost impossible to use it in a commercial context. That, for example, was something that wasn't used to me, to be honest with you. That was part of the understanding of that theme, that this landscape really is shifting and things that seem to be so cast in stone that GitHub pulled it into the UI, they are not being taken for granted. Obviously, it never said you have to have the same license for all files in the directory or in a repository, but it was a common understanding, and that we are seeing change as well.
Ken: Okay, well, so thank you, everybody, for listening, and thank you to the guests. I appreciate it. Do go to Thoughtworks.com and check out the latest version of Radar and give us your feedback. Thanks a lot.
Neal: Thanks, everyone.
