13 September 2016
Mingle is a Rails 2.3 app which was running on JRuby 1.7.25 with a language runtime compatibility of Ruby 1.8 – until now. Along with a geographic transition of the development team, Mingle went through a Ruby version change. It all started in April 2016, and has now finally come to a happy end. We want to share the adventures we had as part of this ride.
Active development of Ruby 1.8 stopped ages ago and development of security patches officially stopped on July 31, 2014. This makes fixing bugs painful.
There is enough proof of performance improvements with ruby 1.9 which makes this choice a no-brainer. Our personal take is that much of it comes due to the language spending less time guessing what the developer wants and sticking to specified behaviour (strict lambda argument count checking, Object#to_s behaviour, string is no longer an enumerable, :symbol behaviour, etc.).
So these improvements, as well as our long term goal of moving to the latest and greatest of both Ruby and Rails, set us off on our journey of upgrades.
There were two approaches that we could take towards the latest-and-greatest of Ruby on Rails:
We went with the second option as:
We have a formidable set of unit tests (~7030), functional tests (~2430) and acceptance tests (~2450). This gave us enough confidence to go ahead with the upgrade, knowing we would get fast feedback if something broke.
This is how we proceeded:
The initial problems we faced were:
Another source of complication was the fact that we use test-loadbalancer to execute the entire test suite in a reasonable amount of time (~15 minutes for the unit tests and ~35 minutes for the acceptance tests). Each TLB agent has its own job and associated run log in our CI pipeline. Each run of the test suite produces a random split of tests across the TLB agents based on the time taken by each test in previous test executions. Running the entire test suite from a developer machine was not a simple task; Some of the unit tests also interact with the database to ensure we are good with both the databases we support. One of the key challenges was to see how many tests were fixed and which ones started failing newly. The process of clicking through each agent’s logs was tedious and slowed down the search for the next test that needed to be fixed.
The first helper script we wrote was to force clean the bundle setup on our precommit CI server. This was done as we were facing gem load error in the initial stages of upgrade as we constantly kept on changing the gem versions for various gems according to Ruby runtime.
Then we wrote another script to fetch and collate unit test logs from our CI pipeline. This helped us identify a bunch of test that failed for the same reason. We could then apply the same change to relevant places in the code and fix all of those tests in one go. The script was later generalized to fetch and collate any logs of jobs that were split and run parallelly using TLB.
A popular question in our daily stand-ups was ‘Are we there yet?’. We wrote a script that helped visualize how many test passed in the 1.9 runtime by parsing the respective CI job logs. This helped us track our progress and keep our stakeholders informed. We noticed that progress during the initial weeks was slow, especially when we started forced pair rotation. But it picked up pace and the team understood the problems, and started to identify patterns that helped us speed up. Towards the end of the upgrade, changing one or two lines of code the code fixed a lot of failing tests.
The upgrade pair spent a lot of time comparing code execution on 1.8 to 1.9 specially for functional and acceptance tests. So we decided to write a script that could run tests against both or one run-time based on command line flags. This helped us track down the offending line of code faster. This script also automated the process of starting and shutting down the selenium server we needed to run acceptance tests that run on Selenium RC, which further helped us reduce the number of steps to run tests. This script can also run multiple files in the same folder at once, which was not so easy to do with TestUnit.
The one thing that kept our confidence high while making code changes that affected multiple parts of the system was our extensive test coverage on unit, functional and acceptance tests. We could be sure that nothing major was regressing. We were able to parallelise upgrade work by aiming to fix each type of test viz. unit, functional and acceptance, separately. This extensive set of tests made sure we could develop new features along with the upgrade.
We were always working on the master branch of our project and never branched out. Based on some of our previous experiences, we knew the trouble of merge hell when code changes were made all across the app. We did have a few places in the code where we would switch method calls based on the current Ruby version, but were happy to accept the cost for the advantages we were getting. This also meant that we could deploy the same app but switch the runtime when desired. This made testing and deployment to 1.9 environments far simpler.
Another thing that helped us succeed was that we ensured there was always work on the upgrade. Even while delivering customer-facing features, there would be one pair, at a minimum, working on the upgrade. Soon we were able to assign more pairs and parallelise most of the work.
So what was our biggest learning from this upgrade? Ruby 1.8 was very forgiving, and therefore it turned out to be indirectly punishing too. Here’s why:
String/text encoding has caused a lot of issues. Ruby 1.8 treated everything as as stream of bytes and did little or no inference from it, especially on strings that originated outside the Ruby VM. Ruby 1.9 decidedly views all strings as US-ASCII. We needed to explicitly encode strings as UTF-8 for them to be interpreted as such. Adding the magic comment (# encoding: utf-8) at the top of ruby source files fixed a lot of these errors. Also in Ruby 1.8 String#each used to split the string on default delimiters i.e. “\n” and iterate over each split, which in Ruby 1.9 raises an error, as string does not have Enumerable as an ancestor anymore. This lead to lot of failing tests as we had to fix the params passed to methods.
Array#to_s behaviour was to join the string versions of all objects inside it. Ruby 1.9 added the commas (,) as separators between the stringified objects and enclosed all of them in square brackets which looks more like an array now. This broke a lot of rails function parameters and led to a few unusual failures.
As most of the modern browsers now have UTF-8 as the default encoding for everything which causes some of the parameters (with special characters) to have multi-byte character encoding which is treated by Ruby 1.9 as US-ASCII by default. This broke the functionality in which query params had special/accented characters. We had to explicitly convert all query parameters into utf-8 in a ActionController::Base before filter.
Ruby 1.8 allowed arbitrary number of arguments for any lambda/proc, whereas Ruby 1.9 raises an error if the number of arguments being passed, while calling the lambda, is different than the defined number of arguments in lambda/proc definition. This caused us quite a bit of trouble as we were using OpenStruct with procs as members to test anonymous models. We had to change the definition of lambdas being used.
The format in which a Date/Time object prints and parses changed between the 2 ruby versions. This caused failures in our tests and there was a need to explicitly set formatting when desired in a particular format. As for most projects in Mingle have the default US style date-time format, we had to change quite a lot of code to parse the already existing format.
Ruby 1.8 used strings as the method names when we list the instance methods on any Class which changed to :symbols as default in Ruby 1.9. This lead to a peculiar issue, when a class is loaded during a database migration and reloaded by Rails after migrations, which caused a method to be chained with itself resulting in an infinite loop and stack level too deep error. The better approach to take here is to use Object#respond_to? method to check if a particular method is present or not.
The YAML parser in Ruby 1.9 changed to Psych from Ruby 1.8’s Syck/Yetch. This lead to various changes in the YAML document support for both versions of Ruby. Psych is strictly conforming to the YAML specifications while Syck is a bit forgiving and allows some things, which are not recommended to be added to your YAML document in the YAML specifications. These changes caused a lot of tests to fail, and we had to change few things around YAML parsing to get everything working. We also had to port some of the patches from Rails 3 to fix issues specific to serialization/de-serialization to and from YAML.
We hope our learnings are useful to anyone who would make an attempt to upgrade their application to Ruby 1.9. We look forward to reaping all the benefits of performance improvements that Ruby 1.9 promises!