AI-powered code generation: A deep dive into GitHub Copilot Part 2

Bruno Belarte

Published: November 27, 2023

In the previous article I shared two experiments I made to discover how to work with Copilot. I experimented with two approaches and presented their pros and cons. In this second blog post, I’ll be trying Copilot in an existing code base, implementing a simple change you might find in any project. I will then summarize my findings and draw a conclusion after these three experiments.

Experiment three: A real life use-case

Let’s try Copilot in an existing code base. For this we will use the MaEVe open-source project I was working on. The project implements a charge station management system for electric vehicles. The system deals with events coming from and to the charge stations. The task at hand was to change the transaction store: refactor the code to a new package structure and replace the Redis implementation with a Firestore one. This work led to several commits.

This experiment represents a real life use case because:

There is an existing code base
The task at hand is realistic in any project (change a dependency and adapt the code)
It’s simple enough to be completed in a few commits
The produced code has very good test coverage

One thing to note is that the experience has not been done in the most efficient fashion but optimized to play with Copilot. The previous implementation had extracted the interactions with the Redis API into a few private functions. The smart move would have been to simply change those. To increase the amount of interaction with Copilot, I decided to start the implementation from scratch.

Requirements:

Use the new package hierarchy
Keep the in-memory implementation
Replace Redis implementation with Firestore, correctly implementing the five functions:
- CreateTransaction, FindTransaction, UpdateTransaction, EndTransaction and Transactions
Change call sites to adapt
Eventually remove the Redis implementation

Moving the in-memory implementation to the new structure was just a matter of copying the file and tests to the right place, and adding an extra parameter to the methods. No need for Copilot.

Giving context to Copilot

I was not familiar with Firestore, so I did not know how to start on the implementation. The code already had two implementations of a Firestore store (i.e. charge station and token). I opened those files, the existing Redis implementation and its test. Old implementation and expected code examples served as context for Copilot. This is important because Copilot inspects only opened files, not the whole project.

Experimenting with TDD

From there I tried test-first to implement the CreateTransaction method. Copilot was able to generate the test easily, as it was mostly a matter of replicating an existing test from the Redis implementation. A manual step was necessary to instantiate the new implementation in the test. Because I was doing TDD, Copilot was not able to infer the syntax of something that did not yet exist in my code. I also had to fix some erroneous data (e.g. expected values). I then started implementing the function itself, prompting Copilot one line at a time. With the context, it was able to correctly call the Firestore API (thanks to the opened files) and handle error cases properly. It was also able to generate the implementation of the function, though as in the previous example, some data was wrong and needed manual editing.

The first test failed for a reason I did not understand. It came down to me not knowing Firestore and Copilot suggesting slightly erroneous code. The key generated by Copilot had the following format: Transaction/<charge station id>/<transaction id>. This is an invalid key as / should separate the collection name and the id of a document. I change the second / to a -. This fixed the issue.

From there it went pretty smoothly. With the previous context and the first function as additional context, Copilot was able to correctly generate tests and implementations for FindTransaction, UpdateTransaction and EndTransaction. With the usual amount of manual editing of generated data, of course!

Implementing new functionality

For the last method, what was required from Firestore were new behaviors — something that both existing implementations lacked. This time I used comments inside the functions body to prompt Copilot: first to get all the document references for transactions, then to transform those to Transaction structs. This was enough to help Copilot generate proper implementation. The unit tests were here to confirm everything was working as expected. Thanks TDD!

Changing the call sites

This step was quite tedious. It mostly involved changing a package name in many places. Copilot showed no value here, as the autocompletion in the IDE was sufficient.

As I was changing the endpoint to get all transactions, Copilot also suggested the correct parameters. The new method required an extra parameter (of type context.Context) as the first parameter. When changing the http handler, Copilot figured out the context I needed could be accessed from the http request parameter rather than created in place. I had not thought about that and was pleasantly surprised to see Copilot suggest it.

Conclusion

Another enjoyable time with Copilot!

In this experiment, Copilot was able to speed up my development. It is great at generating almost correct code from the previous implementations. It is also great at generating correct code when the new file already has correct code. This is true for both tests and implementations. The weak point again being data, requiring manual fixing in both production and test code.

Opening relevant files was a great help. All the suggested code was relevant to the task and often very close to correctness.

Continuing with the go-with-the-flow approach of typing comments or code and seeing what was suggested worked well. Because the context was rich, Copilot was able to suggest complete functions. But there is a flip side: this doesn’t go that well with the TDD philosophy. The generated code often did more than what was necessary to pass the one new test. There is probably room for improvement in my workflow, but it was definitely not straightforward to do TDD by the book.

Key take-aways:

Context is key!
TDD with Copilot requires a lot of effort and discipline
Zero-Shot prompting is enough for refactoring tasks
Always double check the data suggested

My two cents

“Your AI pair programmer”

Github.com

“Nope!”

Bruno Belarte

Github claims Copilot is “Your AI pair programmer”. The truth is that we are still a long way from AI pairing. In essence, pair programming is about two humans interacting to solve a problem while avoiding waste (e.g. context switching, handover, delays). Writing code is a by-product of this activity. Copilot, meanwhile, is only about writing code. There’s no discussion, nor any concept of problem solving in the generated code. Copilot can only provide statistically likely code blocks.

From my experiments, Copilot is great at generating boiler plate code (e.g. instantiate a simple server, filtering a list, adding an endpoint). But prompting Copilot is an art. The one thing to remember is “context is key”. Whatever information you can provide (e.g. opened files, comments, variable name) will drastically change how efficiently Copilot can generate code. To improve the experience, one must learn to communicate with the tool.

Copilot certainly has the potential to increase your coding speed — the number of lines you can produce in a minute. Prompting is equivalent, in a way, to Googling your question. Because you stay within the editor and can accept a suggestion simply by pressing tab, you can produce code faster. Using context can dramatically speed up repetitive tasks. But what matters here is the conditional. It can improve your speed, but at what risks? You are going faster, be that in the correct direction or not. Some boldly claim GenAI will make developers 100 times faster. First, I believe this is a huge exaggeration. Second, producing code faster does not mean solving problems faster.

When generating code step by step, I found it very easy to review the code on the fly and accept or reject it. When generating entire functions (e.g. more than 20 lines), I needed to take a breath and thoroughly review the suggestion before making a decision. Is the error handling correct? Does the generated data make sense? Is it covering all cases? Is it doing more than it should? Is the style correct? This is where TDD can really help. If you have a rigorous test base, it’s easier to just run the tests. But if you also used Copilot for the tests, you have to be sure about them. TDD with Copilot requires a lot of discipline, as it is easy to generate more than necessary for the next step.

As mentioned in experiments two and three, review fatigue will happen when using Copilot. This can lead to abuse, such as over-trusting in the tool. Or the opposite, ignoring the tool.

In general, trying to generate whole functions from scratch, from comments only, is tricky. It misses the context. It can work quite well for classic functions (e.g. sort this list, filter that map) but something more domain specific is difficult. Outside of that, I feel like typing code as usual and accepting small bits at a time works best. How to best use Copilot? I believe spending time writing comments and thinking how to prompt efficiently is actually slower than going with the flow. When starting to type, you can accept part or all of the suggestion. It worked quite well with my ways of working.

One flaw I have noticed with the code generated is that it sometimes is not up to date. That is, Copilot can generate code with deprecated functions or an older version of a library. Both happened during my experiments. For example, the web server generated was using echo instead of echo/v4, which is the latest. I only noticed it after a while and changed the import manually. Thankfully that was all that was necessary, but there could have been breaking changes between the two versions that would have forced me to rewrite part of the code. This happened both during experiment one and two.

One thing worth mentioning is that Copilot acts as a black box. Some of the code in the IDE is sent to the cloud; suggestions appear after a small delay. That small delay is actually part of the flow and needs to be considered. If you rely too much on the tool, you might spend a significant amount of cumulative time waiting for suggestions. You cannot control what happens in the cloud. The exact same experiment could yield completely different results if made a few weeks apart. Has the model been re-trained with new data? Has the algorithm changed?

One thing worth remembering is that Copilot (and GenAI in general) is only one tool among others. Overuse or overconfidence might lead to wasted time. This is what happened to me during experiment three. I purposefully made the decision to rely heavily on Copilot for my refactoring and ended up spending more time on the task. There are many situations where Copilot might help. But there are also many situations where another tool might do a better job, starting with the IDE.

Experienced developers familiar with GenAI programming will be able to increase their speed. That is, if they know how to prompt efficiently and have the capacity to carefully review the generated code. It is way too easy to press tab and generate dozens of lines at a time.

I mentioned in the introduction that I ignored Copilot Chat for these experiments. I believe this could be an interesting tool and it can improve the process of communicating with Copilot. I will experiment with Copilot Chat in the future. Stay tuned for more articles!

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

AI-powered code generation: A deep dive into GitHub Copilot