Mingle is a Rails 2.3 app running on JRuby. In late 2012 we attempted to upgrade it to Rails 3. It’s been a long time since my last post and I wish I had better news to report, but unfortunately our team abandoned the upgrade approach we were taking. It was one of the toughest decisions I’ve had to make, and I wanted to share my reasoning such that it might help others learn, not just about Rails but also about all long-lived projects.
What we tried
As I wrote, we did some prep work, including moving to bundler and 2.3.x upgrades during normal release cycles. Then, after we released 12.4, we opened a work stream for two pairs during the holiday season to do a lot of the application-wide changes such as converting environment.rb. We’d spend all day checking in code that didn’t make anything green but rather allowed a rake task to fail slightly later in its process. It was a bit scary.
By February, we were making slow but steady progress, whittling down unit test failures. It was down to one pair due to other demands.
By March, after pausing the upgrade work to put out a small release, we put a pair back on R3. They spent about a week merging in the previous month of trunk development and getting us back to where we were before pausing.
In April, we noticed that the velocity of unit test failures meant we were bordering on zero net progress. We put three pairs on the task for a short burst and then left two dedicated pairs to keep going. After a month of this focused effort, the 8,000 unit tests were green and we started into the 3,000 browser acceptance tests.
At this point, we estimated we could have stopped the line, used all four of our pairs, and spent three months to be able to merge to master and put out a release. However, this would be a bold strategy and given my realizations about customer-facing value, was not the right thing to do. I held a dev huddle (with donuts because it’s easier to take bad news on a full stomach) and announced I was halting the upgrade.
So what about the principle of fail fast? Why did it take us so long to realize we were taking the wrong approach?
The scope for the upgrade was staggeringly large. To relativize it, we spent two pair weeks going from 2.3.13 to 2.3.18. In comparison, we had spent something like 8 pair months on the Rails 3.0 branch when we pulled the plug and we estimated it would be something around 6 more pair months left. The worst part was that the scope was so uncertain, you needed to fight your way through a lot even to see it. It felt much like climbing a mountain peak, collapsing exhausted but content, but then realizing another even larger peak looms above. For example, when we finally got the browser test build to run, I had no idea if there would be 3,000 failures or 30 failures. When it wasn’t 30, that was the undeniable signal to STOP.
When you build a Rails app, your app is a Rails app through and through. There are no layers that are not tightly coupled to the framework. And because of the popular Ruby practices of meta-programming, alias method chaining, and good ol’ fashioned monkey patchin’, your code tends to become ridiculously coupled to the framework. It means an incremental upgrade is nearly impossible to even conceive of. Therefore we made a branch. And because of scope, this branch was long-lived. And you all know the pitfalls there.
Some of the changes from the branch could have been ported back to master but many could not – and you’d need to be working on the branch to discover almost all of them. An example of one that could be ported back was we had to change some attributes called “changes” because they were now an internal name conflict. An example of one we could not port back was the change to using routing middleware so you no longer could capture routing exceptions. One big mistake was not merging anything mergeable back to master at the time we did it.
Backwards compatibility too limited
Beyond those referenced by the flood of deprecation warnings in 2.3.x, a giant swath of behaviors just stopped working entirely in 3.0. We maintained a running tally of unit test breakages and at one point something like 20% of them were broken. I’m sure they tested the vanilla use cases, typical apps, etc, but it wasn’t like going from Java 1.4 to 1.5. And I’m not arguing it should or even could be that rigorous, but something closer to that would’ve made the upgrade more feasible. Due to massive internal refactoring, methods and classes just would be gone and it wasn’t clear what they were replaced by. API Dock pages would just show them as missing after 2.3.×.
We didn’t realize the Rails 1.x debt we had
It took me a long time to realize that the 1.x → 2.x upgrade had been fairly quickly and no attempt was made to switch to the many best practice patterns that the community had learned along the way. And given lean development practices, perhaps such an attempt would’ve been called futureproofing at the time. In any case, we were really upgrading an app whose codebase had a healthy dose of 1.0 style code in it. This obviously ballooned the scope unexpectedly because we ended up having to make changes that went directly from the 1.x way to 3.x way. Unsurprisingly, there weren’t stackoverflow questions about anyone doing this.
Not considering alternative ways to achieve the goal
Another mistake I feel that I especially made was not challenging whether there were other ways to achieve the real goal. Would it have been good enough for most of the app to be powered by Rails 3.2? Mingle has a monolithic codebase and, in retrospect, we should have interpreted the scope of the upgrade as a smell that the real problem was that the app was too big*. A saner approach would have been to carve out a piece of the app to upgrade and see how that went.
*It’s tough to size an app, but just for comparison’s sake the app is ~100K lines of Ruby, 450 model classes, and 95 controllers.
Was our experience typical?
Personally, I wonder how many Rails projects have yet had a lifespan of 6+ years and have endured multiple major version upgrade. If many have, they haven’t blogged widely about it. I have some possible assumptions around why this might be so:
- Most software products and/or the companies behind them die young
- Those that don’t, the Successful Household Names, can afford to rewrite and re-architect vast portions of their app as they scale. In fact, from popularized examples, successful products always rewrite vast portions of their app as they scale.
- Even if a product lives long enough undergoes a major Rails upgrade (1.x→2.x, 2.x→3.x), it won’t have to survive a second one for reason #1
If you buy these assumptions, it follows that there have not been a lot of big apps that have upgraded from 2.x to 3.x without a major rewrite and a correspondingly-sized team to do it.
A long-term ThoughtWorks consulting project is currently undergoing a Rails 3.0 upgrade similar in scope to ours but the last I heard it was also following the long-lived branch of the entire codebase approach (and encountering similar issues). When that has some lessons to share, I’ll try and link from this blog.
So what now?
When we attempt the upgrade again, I think our approach will be quite different. A logistical change would be for us to immediately port all possible changes from the Rails 3.0 branch back to master. However, I believe that porting early on its own would not allow us to succeed. A more significant, and in my opinion requisite, change in how we do the upgrade is to make sure we’re upgrading a much smaller app. This means throwing away the whole idea of One Big App. As we’ve moved to a SaaS delivery model, we’ve already started to test the waters around splitting out services.
No matter which approach(es) we decide to try next time, we need to be prepared to invest significant effort for a long time. However, if our product survives long enough to have to confront those challenges, then we’ll already have been successful.
Wait! But what about security updates?
Having the huge flood of security patches and gloom-and-doom warnings halfway through the upgrade actually strengthened our resolve to get to 3.0 and then 3.1 and then 3.2, lest we be stuck on an unsupported legacy version full of Swiss Cheese remote code execution errors. However, if we had banked that six pair months of effort, we could potentially port many of the fixes and so that’s what we’re forced to do now.
This blog was originally posted to the Mingle team blog at getmingle.io.