Brief summary
Database branching has, for a long time, been a troublesome piece in the modern developer workflow puzzle: a good idea in principle but in practice a slow and often expensive challenge. Get it right and you can accelerate productivity and remove bottlenecks; get it wrong and you're potentially creating all sorts of trouble for yourself, from privacy risks to additional complexity.
However, things are changing. Thanks to the emergence of new platforms such as Neon, Supabase and Databricks Lakebase, branching a database can become as familiar to developers as managing code branches and multiple environments with, say, Git and Terraform.
On this episode of the Technology Podcast, host Ken Mugrage is joined by his Thoughtworks colleague Cam Casher and Databricks' Kevin Hartman to discuss the work Thoughtworks and Databricks have been doing together on Lakebase. They discuss the platform, their experience using it with Spotify's Backstage and the opportunities database branching can offer software engineering teams in an increasingly AI-assisted and agentic world.
Read Cam and Kevin's recent series exploring Databricks Lakebase.
Ken Mugrage: Hello everybody, and welcome to another edition of the Thoughtworks Technology Podcast. My name is Ken Mugrage. I'm one of your regular hosts. I'm thrilled today to be joined by both a ThoughtWorker and an external guest. I'll start with Kevin Hartman from Databricks. Please go ahead and introduce yourself.
Kevin Hartman: Thank you, Ken. Thanks for having me. Kevin Hartman, working at Databricks as a partner solutions leader. Before going to Databricks and really fleshing out my career in data and AI, I was a software developer, craftsperson. I have a long career in experiencing the transition from waterfall to agile, lived through those moments. I'm really excited to be talking with you today about how those moments now could be revisited. With some really interesting changes in the way technology has come forward.
Ken Mugrage: Cam Casher.
Cam Casher: Hey, Ken. Good to be here today. Yes, I'm Cam Casher. I am coming up on my seven-year Thoughtworks anniversary in a couple of weeks. I've spent most of my time about splitting software engineering, data engineering. The topic today touches on both in a certain way. It's pretty cool to see the cross-function of it as we get into this cool Databricks tool we've been exploring.
Ken Mugrage: Cool. Speaking of the topic today, Kevin and Cam have worked together on a blog post and a couple of talks called Eliminating the Barrier Between Analytic and Operational Data Environments. As a long-time continuous delivery person, it really struck a nerve, and we were like, "Hey, we need to get them on here." If we could, Kevin, I guess let's start with you. A summary of what's the problem space you're trying to solve here and what are you trying to do for people?
Kevin Hartman: Yes, that's a great question. The problem space is addressing the monolith in the room that was always rigid and never really very much adaptable in the way that Git innovated the way we treat code. In that you really couldn't branch a database before to mirror what you do in a GitOps sort of environment. There have been workarounds. We've addressed them in ways that are both creative and also somewhat expensive in some ways. It's now, I think, a really great opportunity to revisit some of those practices with what we have now in latest innovations to change how we treat the database and really align it even more tightly with the dev lifecycle.
Ken Mugrage: In even continuous delivery, we talk about that you're going to constantly be pushing to production and what have you. The truth was, especially when it came to the database, it was filing the ticket with that group or what have you. I remember visiting a client when they were like, "Oh, these people want to push to production really fast, but we need to test the changes and secure it." How does this make that different? How does this help alleviate that pain, what you're doing here?
Kevin Hartman: Instead of thinking about this as multiple environments in which I manage that and replicate it and set up the pipelines to make sure I had some level of quality data in which to test against, I could change that and say, "Nope, I don't actually need to do that. I don't have to provision a separate environment and keep those things in sync. I can actually just fork directly from my database." Treat my database like your main, called prod as main, if you're doing the Git analogy. When I cut a branch, I could do lots of interesting things on that.
One is I could first align that to my feature development. I cut a branch from Git, I cut a branch from my database, and I can write real live integration tests as part of my dev loop. TDD becomes a lot more real to the practitioner in that I don't have to put mocks in place. I can actually create real BDD-style behavior testing in my TDD testing. I can also think about incorporating some of those steps and starting to shift the Agile QA into a true pair programming sort of model again. I don't know how many of us still do pair programming. I know that in some organizations, it's controversial.
If we think about what the intent is, and that is to produce and create quality code artifacts upfront so you don't have those risks downstream, this really allows that pattern to come to fruition.
Ken Mugrage: Cam, as a developer, how does this change your behavior when you're thinking about code and with databases and the backends and so forth?
Cam Casher: Yes, it's a good question. When I started learning more about Lakebase, it touched on two points in my mind. One was, what constraints is this touching on and allowing to now happen from a development standpoint? Also, what are the new unlocks in a way that you might not have even considered when you're able to branch your database, essentially? Having been on a lot of development teams where we've had to align on whether it's trunk-based development or having a development branch, staging branch production, it brings that conversation to database development, which hasn't really been a consideration in as long as I've been working in tech.
From my perspective, it just unlocks this whole new way of thinking. That all played into this proof of concept that I tried to build off of and create to learn more about Lakebase, where I took a tool that I had already worked with in the past called Backstage from Spotify, which is an internal developer portal. I tried to basically create a use case for how Lakebase could optimize it in whatever way. One of the really interesting ideas was how traditionally OLTP and OLAP had been different storage layouts, different compute profiles, things like that.
It really allowed you to bridge the gap and wire them together when you can branch and point these two different storage and database layers together and get certain insights that you wouldn't have thought otherwise. Even touching on a FinOps perspective of how much are you spending on having different databases and branches spun up. It seems like it is a bit of a paradigm shift in a way. I'd love to see this really working with a client here soon from a consultant standpoint as a Thoughtworker. Really interested to bring in some of this information and have some real-life use cases to speak to as well.
Ken Mugrage: One of the things that you talk about in the article is this concept of developing against fiction. You take that further to say the DORA attacks. What does that mean in this context?
Cam Casher: That's a great question. Working against fiction is really this idea of working against mock data for your database testing. When you're working with mock data, you might not necessarily have the correct production shape or the correct values to test off of. You're ultimately testing against fictional pieces of data. When you introduce a tool like LakeBase, you can have branching swap mocks for live actual production shape branches where you can have the same shape, the same constraints, and the same data semantics.
When you tie this back into the DORA metrics, you could consider something like change failure rate is dropping because we're able to use real data testing to catch what mocks might miss. You have this idea where you don't necessarily need to develop against fiction, which is just a fancy way to phrase that.
Ken Mugrage: Agents are writing code now too, right? There's some speed there. We've always just pretended that we're agile. There's a quote in there where you talked about we stopped talking about the part that was still waterfall because the database stuff was still waterfall, but we're pretending everything's agile. Now, especially, we have agents cranking out code at hyperspeed, faster than we can review it, et cetera. Does this help there at all? What's your thoughts on that?
Kevin Hartman: Absolutely. If you think about what you do and don't want agents to do is you don't necessarily want them to make a change into production data directly. It's one of the things. You also need to put other guardrails around what they do. Frankly, agents need to follow and leverage the same practices that we do. Unfortunately, those practices are built around that whole concept of creating a ticket for a database change or something like that. Well, the agent's not going to wait around for a ticket. That's just so anti-agile already that now we're pushing that semantic onto an agent now to repeat that.
It's especially more important now for an agent because we need to actually adapt and have the ability for agents to do their work, to progress, to do experimentation, to discover what the right pathway is. You're not going to discover the right pathway unless you actually can test something against what is at least as close to live as possible. The branching paradigm of your database gives the agent to do that, the flexibility. It could be destructive. As soon as I create a branch from a database, I could do whatever I want with it. I can apply my migration scripts to it, or I can change, alter, whatever I want because it's something I can take and throw away later.
That applies for the human in the loop as well, or just the human doing it. Getting back to the agent question, I think, unfortunately, if you don't have the prescription down as to what to do and how I should process this, how I should take the branch cut from both places, and do all these things, an agent is just going to do something that it knows how to do as quickly as possible. It probably is not ideal. An agent needs guardrails, a framework, per se, to operate against. This is something else that we've been working on as well, and that is an app dev toolkit, a kit that has those semantics baked in so that when I do the workflows and orchestrations in a coding loop, I have something to follow.
It's especially important for an agent to have this information to truly leverage it.
Ken Mugrage: Then I have to address, for any of our long-time listeners, what for me is an elephant in the room, and probably not for others, is that if you look at Thoughtworks sentiments of old defaults, what we call them and stuff, in fact, I did a talk on this. I had to look it up. It was 12 years ago. That said that it was titled branches, a long four-letter word. We've tried to avoid long-lived branches, but those were feature branches that were different. As I understand it, what this is doing is giving the development team choices.
Can you talk a little bit — is this in the face of main base development? Do they work together? Is this an alternative? When you say branch, do you mean something that just lives forever, or what do you mean?
Kevin Hartman: No, personally, it's more a funeral. I don't branch forever and keep branching. I need to do a code merge in the same vein. I need to throw away that branch and do a migration back to main or back to prod. Features, if I'm pulling a feature, you can think about this in terms of how we've operated around scrum, two-week sprint cycles, release cut maybe at the end of a couple of sprints. I accumulate all these things. This pattern could follow. It certainly does move up any risks that you might encounter upstream, so you discover them quickly when we have integration tests upfront.
With that pattern and the way that we operate, we can also-- organizations do Kanban too. It's sometimes harder for organizations to do Kanban. This actually makes it easier for more organizations to adopt Kanban than ever before. Because I'm actually doing things with more immediacy. I'm also not waiting for a test cycle downstream when now I've lost context on what that thing was. Now a QA developer is asking me to go and fix something like, "What do I do again?" All those things are now live in the moment. I can really facilitate instant releases if I so choose.
It's really up to the architect and the patterns that are in play at the organization to really decide. This really opens up that flexibility.
Ken Mugrage: It occurs to me that we've been talking for 10 or so minutes about the problem we're trying to solve without really describing the solution. What is it that Lakebase is doing here that's different that enables all of this?
Cam Casher: I guess, yes, going back to the POC that I did. I wanted to basically look into all these claims that I was seeing around Lakebase and see if I could replicate them, get any sort of metric or data point on it, and basically prove it out like a proof of concept. I was able to test out branching, see how quick it was, see how much lead time it could realistically drop for teams, and determine that it is really applicable and can happen immediately. Like I was saying earlier, just unlock a lot of efficiencies and optimizations for teams.
Another thing that I determined was not only just branching, but point-in-time recovery is another option here, which is essentially just branching at a given time. Where normally you just branch at the exact moment, or now, in other words. What branching at point-in-time recovery offers up to is being able to really let an engineer debug properly and go back to a specific pain point in time and be able to figure out what was the current shape of the data at that time and what can I specifically look into. When you think about having all these different branches, I think it's important to note that you don't have to worry about orphan branch costs or having an endless amount of branches live at any given time because there is a time to live of 30 days.
There is scale to zero built in. From a cost perspective, that isn't really an issue. During the POC, I was able to test out unlocking this idea of utilizing FinOps to get a sense of costs of things you might not normally be used to. Going back to the Spotify's internal developer portal of Backstage, you're able to see certain costs associated with different services you might have living in the portal and get to see that all in a unified way if you're going to dive into the Databricks platform and use something like Unity Catalog as well.
Ken Mugrage: Are these product features or ways of working or a mixture? Because GitOps is a way of using Git. It's not a thing. Where does this fall down?
Kevin Hartman: That's a really good question. There's the technology landscape and the surface areas of what the platform supports. We've been talking about branching as the concept. As Cam mentioned, each branch, if you're not using it, it scales to zero. I could leave them and not do anything with them, and then there's zero cost. If I go back to it and decide I want to pick it up again, then yes, then that cost will start to accrue because everything is cloud-based. I can prune those as I need. That's a technology capability.
Another technology capability is the ability to govern my database in a way that's uniform and set up policy and permissions and masking at the top level, and then inherit those policies as I go down in each branch. If I am a developer, and the question, I don't know if it had come up yet, but if I am developing and I say, "Oh, I can break a branch against prod, does that mean I should actually be seeing everything on prod? Does that make sense?" Not usually. What do you do there? You set up policy. Certain elements, certain columns actually need to be masked.
Maybe it's fine for a service principal to see more when you're running CI tests. That could actually be a different permission level that is not available to you personally as a developer, but it is something that you could do in CI. In CI, as part of my process, I'm doing CI with another fresh cut of my database branch. All these technologies are here. What does that mean now? How do I take that and say, the technology is here and now what do I do with it? This is the actually really interesting intersect, and this is why I'm so passionate about this topic in general is because I've had that experience for 30 years before getting into data and AI, and I do know what those pain points look like.
I've been through Agile, coming from Waterfall, TDD, even XP. What's missing is the methodology adaptation to support this. You're talking to practitioners such as yourself and suggesting, "Hey, the technology's here, we need a new methodology shift as well to go with that." It's a paired approach. You can lead the horse to water, but you need to be able to drink and then leverage it in a way that doesn't make you thirsty next time you leave the pool.
Ken Mugrage: Yes, people are always the hard part. There's a number of patterns, and some of them we've already touched on, but there's a number of patterns that you're trying to enable or dissuade depending on which one. If you don't mind, just either one of you go into it a little bit. We already talked about this a little bit, but eradicating the mock burden, what does that mean? I live in mocks.
Cam Casher: Yes, I can touch on that. Back to this POC, one of the things I was able to claim was that the result was 20% to 30% of integration test code is mock infrastructure. That's technically 20% to 30% that you could reduce if you were to just use branching off production and running your tests without mocks in this idea. This goes back to that original question you asked about developing from fiction. It's really just an optimization, and allowing for development time better spent in a lot of ways.
Kevin Hartman: Just to add to that, too, it's like we're not eliminating all mocks, but the ones that interact with your database from your REST service on down, those are the ones. That's the opportunity right there. Because you have to stub certain things out because you may not have access to live services in other parts of your application landscape, that's going to happen. You do have to service those in different ways. What we're saying here is that one of those services that you had to pretend actually doesn't need to pretend anymore.
That's the one area. The other areas probably still have to be, unless you have been granted test APIs. There's a parallel spin-up from all your SaaS providers, and they give you an alternative version, a sandbox that you can play against, and some do. That you can ping out and destroy it or whatever because it's not the real thing. That is a luxury sometimes, but now we actually have this luxury for the database.
Ken Mugrage: Cool. I would say the continuous delivery person to me says, "Please don't do that with your external services. The latency is not what you want in your pipeline." That's it.
Kevin Hartman: That's true. Here you have a private link so that it's direct. The latency is sub-second.
Ken Mugrage: Then I think we've already covered this one, but just because the pattern too is environments that get speed, just the ability to get checkout-B. Is it that simple?
Cam Casher: It is that simple.
Ken Mugrage: This one then, these words in this order scare me. Destructive testing without asking permission. What are you all saying?
Kevin Hartman: It is. It's like on your branch, not on main, not even on staging if you're having an interior sort of environment, but on your feature branch, the one you just forked, it's yours. Do what you want with it. Destructive testing is encouraged because why not? You're exploring new edge cases that you maybe not had considered before because you were not afforded that luxury. You had to share that environment, a dev stage environment. If you're running embedded, sure, you can do full destructive tests. Embedded, it's a proxy. It's maybe a closer fiction. It's still not nonfiction.
Cam Casher: Yes. I'll just add too that you could talk about this in the same vein as just having a place for experimentation. That's obviously very important from a development team. When you even tie this back to DORA metrics as well, you might have failure modes that are typically uncovered from production. When you're able to have this destructive testing, which can be free and easy to use via branching, then you might be able to have change failure rates drop because then you can uncover things in a test environment that you typically wouldn't find out until it's live in production.
Ken Mugrage: Then the fourth pattern, again, we've probably already covered this one, but we're talking about the same primitive service agent. Tell me about the Lakebase App Dev Kit. How does that play here?
Kevin Hartman: That is an interesting one. I'll take this one, Cam. It's pretty much fresh. You're hearing it first here. Initially, when Cam and I started working together, I had developed a source control management plugin that worked with Visual Studio Code. It actually marries the concept of your database branch with your Git branch and also orchestration. What happens when I do that branch, and now I want to do a PR, and how I want now to do a fresh cut when I'm going to go through CI/CD, all of those operations are embedded through Git hooks, in a way, within the IDE plugin.
We could put a link to that later. What I discovered is that, well, here's the methodology and the way that you can use this. It's opinionated, but it follows development best practices. How do we take that and externalize that so that agents or, say, someone that doesn't use an IDE, can obey that contract and those primitives without the IDE plugin? There's a kit. It's a Lakebase App Dev Kit, brand new. It was born from the plugin first so that agents can follow along with this paradigm shift as well. In fact, the plugin now has been readapted after the extraction.
The plugin now leverages the same toolkit as a library. We'll share the link on the App Dev Kit. Lots of great stuff around orchestrating workflows for your source control management, but also release management. Then one that I'm adding very soon would be to follow the TDD workflow pattern and have all of the code creation part of it be part of the test-driven contract that we've all have experience with. This really plays well because your quality, just like everything that Kent Beck and others have said, the quality really goes up when you're able to put those things together.
I have a test project that actually uses the TDD approach, the plugin, paired together with TDD, and it works very well. I'm having a lot less problems with my agents in code refactorings because I don't know how many people have gone through and leveraged Cloud Code or others. It might do a pretty good job if you have a pretty good spec, a plan from which to operate from. As soon as you start doing iteration on that, it quickly runs into and creates this spaghetti mess, this code mess that is unmaintainable. If you ask it, if you were working with an agent and you say, "I want to fix this thing," it doesn't really necessarily understand the solid principles or other things.
It'll create new code, parallel code. It's a mess. Having a TDD-style operation on top of that and obeying more of the architecture guidelines, when you pair those things together, you have a much higher-quality code product.
Ken Mugrage: You touched on this a little bit with permissions earlier. Think about governance here because access to production data is a multi-forked thing. Of course, I don't want to bring down production. That's generally considered bad. There's also things like personally identifiable information, things like that we don't want. Whether or not our feature branch or whatever it is actually using it, we simply can't expose it. Is this locked down by default, and you open it up? Is this open, and you lock it down? What's governance look like here?
Kevin Hartman: Let's visit maybe the DBA role in days of past and what we had to do. I was a DBA once, for being part of one of the roles I had. It's a job, for sure. For supporting multiple developers, it becomes an overtime job. Think of it this way instead. I still have my reviews for schemas, but now, I could do that asynchronously when I'm-- a DBA can do a PR review just like everyone else can, and see what database changes are going to be recommended or suggested, and have an opportunity to say, "No, I don't want that. That's actually going to break against the overall design model I have in mind." Fine. That could be a rejection of the DDL. That's a role for the DBA.
The other role is really setting up the permissions model for different roles within my now database environment, but instead of having multiple databases that I provisioned, I do that at the top level, and then at the branches that inherit those policies. If I am a developer, and I have a certain set of permissions, and I can have attribute-based permissions which also includes things like masking by default for certain elements, certain table, no columns, and when I'm looking at that data or operating with it, that's my policy, and I can't see that data.
If I do have to do something in a field where I don't have permissions to actually see, well, yes, I don't have permissions to see it, but a service principal could when I deploy to the CI/CD environment, and then they can run through the tests that you couldn't. That's how you establish the new paradigm and really change the responsibilities into becoming more of an architect on everything rather than a gatekeeper, and that's a big shift for DBAs.
Ken Mugrage: Then the next one, as far as challenges that this claims to solve, and I'm sure your claims are 100% valid, but production support reproducibility, so talk to me about that a little bit, please.
Kevin Hartman: This is actually a really interesting topic for me. I came up in the days where I developed solutions around CQRS and event sourcing and event logs, things like that, so what happens? This becomes more truer, so I never would ever want to replay all my events in an event log from day zero just to get to the point in time where now I need to discover where the defect occurred. Who wants to do that? Probably no one.
What I can do, if I do have an event sourcing sort of model and I am recording each and every transaction that occurs from every single consumer or whatnot, well, I can actually isolate that and say, "Well, let me go to the point in time just before this event happened, branch the database, and then start applying the event log at that point in time," and then watch what happens. Think about now my troubleshooting time to figure out how to solve for that defect. I've just eliminated all of that burden of trying to spin up a fresh environment and trying to replay everything from epoch into now a very quick turnaround of just the snap I needed, apply some event log transactions, get to that, and I can actually record that event and whatever I had transplayed, and actually put that into my test bed so it never happens again.
Cam Casher: Yes, and honestly, to work off what Kevin is saying, realistically, it's just a time save. You are not having to reproduce an error you might've seen in prod and try and replicate something that's like the bug, or that's something that's like the production environment. You can actually branch in and replicate it exactly the same way. You're saving time, and you're not having to play a guessing game, essentially, for support engineering.
Kevin Hartman: You can automate that too. If I'm going to make sure I solve the defect, the problem, I can then replay all of the transactions and events and make sure that my defect remediation had worked, and I could do that all in a branch.
Cam Casher: Just to go back to DORA again, this directly relates to mean time to recovery, time to restore to service when we're talking about the time save and how much more quickly a support engineer could diagnose and understand a fix.
Ken Mugrage: There's a lot more to this, and I know you all are putting out a pretty detailed blog series of a few parts and some talks and so forth. I'm going to encourage your listeners to read more about it, but I would like to ask each of you the age-old question, what can our listeners do on Monday morning? What's actionable? I'll give you each a turn. You can go in whatever order you like, but what's your piece of actionable advice that people can do? It has to be a little bit more than just go buy Lakebase, but short of that, what can people do to learn how this might help them?
Cam Casher: In my case, I mentioned the POC I did with Backstage. The primary reason around that was because I didn't really have a super applicable project at the time to test out. My train of thought was, "Well, what projects have I worked on recently? What have I liked working with? What could be relevant and applicable here?" That all pointed to Backstage, which I had liked working with before. Really, the thing to do on Monday or the action item here is just to start experimenting and start recording your metrics and findings.
Make a story about it and see if you like it, and see if you think it's something that could make your life more efficient, save time, or solve problems or constraints you're facing. It really just starts there with experimentation and a POC. If you do have an actual project you're working on for an account or an assignment with work, see if there's a way for you to explore in that sense. Maybe show off to your manager or your peers and show them this cool new thing you've tried out and all the learnings you have.
Kevin Hartman: Yes, I would echo that. Don't buy a Lakebase, but try. That would be my advice, is POC it out. Go through, and it's pretty cheap, as all things considered, to try something out. Look at what we've been saying around measuring inventory tax. What is it that you can do to say, "Well, what is my pain point in my environment? What is it that I'm waiting on? What happened in the sprint retro? What are the things that I've never really addressed, or actually just couldn't address, and they've been silent?" Ask the question again. "Hey, if we had the opportunity to change something, what could we change? What could we do to try it out?"
Another thing you could do is just, well, go back and look at the number of mocks you currently have and look at that as opportunity. All right, to say, "Oh, my mocks continue to drift as I get further and further along in release, and keeping those up-to-date are, boy, those are big pain points for me because now I have to really understand how my data model has shifted and keep those mocks in alignment." Otherwise, I will trust something that doesn't exist. That's the fiction. Count those up, see what the opportunity is. Then I would say, if you have the ability to try something out, try it out.
There's an app dev kit that you can take and deploy locally. You can operate on that headless without an IDE, or you can use the Lakebase plugin in your favorite, I'm saying favorite because there's really only two right now, VS Code or Cursor. Others will be supported, I'm sure, but this plugin was more of a start as a POC itself. I would encourage you to take that approach to treat it as a POC. See what opportunities it affords you.
Ken Mugrage: Well, I am excited to see where this goes because we were chatting about it, that we hit the 25th anniversary of the Agile Manifesto, and other things that are happening. It was always like, "Yes, we're going to do all this stuff iteratively, except for the database." We're going to do all this stuff iteratively, except for the infrastructure. I'm really going to be watching closely to see where this goes. I want to thank you both for your time and encourage our listeners to check it out. Thank you.