A Unique Production Issue at JSS

The first Bahmni implementation was done at the Jan Swasthya Sahyog (JSS), a hospital in rural Chhattisgarh in India. The implementation in this project has been ongoing for two years. I’ve recently become a part of this project to take on some of the responsibility of this implementation. In my first week on the project, I faced a very unique production issue. I call it unique because as an Application Developer, I had never encountered an issue like this before and more so because of the challenges that made it difficult to solve.

# What happened?

I was in Bangalore to understand context of the project from the team, when we got a call from JSS. They informed us that an electric surge had caused the EMR (Electronic Medical Record) application and the Internet to go down. While the Internet outage at JSS didn’t fall under  the ThoughtWorks project purview, the application did. For the benefit of those who aren’t aware, JSS is located at least a day away from any ThoughtWorks location in India.

# So, what did this mean?

Production being down is of the highest priority/severity for any client that we work for, irrespective of the domain. We take a downtime of even a few minutes very seriously. Depending upon the role that the application plays, its downtime could hurt the client in many ways - ranging from revenue loss to legal issues to reputation damage to loss of productivity. In this case, the application being down meant an increase in the wait time and distress of patients, with more than 50 percent of them suffering from chronic and serious illnesses like tuberculosis (TB), cancer and diabetes.

# Why was this a difficult problem to solve?

We had to get to the root of the problem. We started by speaking to the people who handle the IT systems at JSS. Troubleshooting turned out to be tricky and cumbersome because of two major reasons:

  • No Internet. This meant:
    • The option of logging in over VPN to troubleshoot was ruled out
    • Platforms like Email, Skype, Whatsapp which provide faster and better communication through file sharing (mainly images, in addition to text) were ruled out
  • The IT staff at JSS have limited knowledge of the application internals and UNIX-like systems, which meant:
    • All troubleshooting actions had to be explained step by step over a phone
    • All troubleshooting commands had to be dictated letter by letter. For instance, for them to key in "ls /etc/sysconfig/", we had to say “ls space slash etc slash sys tab <look at what is coming up> <select sysconfig> …”
    • The results expected from troubleshooting commands/actions had to be explained to them verbally

The above mentioned challenges increased resolution time.

Initially, only Mujir (a senior developer on the project with good knowledge of JSS and Bahmni) was looking into this. But after listening to the challenges he was facing, a couple of us decided to team up with him to find a solution to the problem. It was especially intriguing to me, given that I was going to take some responsibility of this implementation in a few weeks, working at the site. 

# What was done?

After many phone conversations which lasted a few hours, we started to look at hardware failures. We did the following:

  • Replace Network Interface Card (NIC)

We zeroed in on the NIC on the application server box as one of the probable issues. The NIC had probably burned out because of the power surge. Narain, who had some experience handling computer systems, had recently joined JSS and was working with us on the issue. He suggested that we replace the NIC card with the spare ones available to see if it fixed the problem. However, this was a server box and none of us had hands-on experience with server hardware modifications before. We hadn’t interacted with Narain much before and were unaware of his experience in handling this. We were conflicted about the option of supporting him remotely.We even started exploring the option of having someone from the project travel to JSS, preferably from the Hyderabad office as it is the closest office. But the earliest anyone could reach JSS was by the afternoon the next day.

While the logistics of a team member travelling to JSS was being sorted out, we deliberated and decided to go ahead with remotely supporting Narain to replace the NIC card as the immediate course of action. He seemed quite confident of being able to replace the card. After opening the box, Narain gave us the bad news that the card was integrated with the motherboard and couldn’t be replaced. What’s worse, the phone got disconnected. Even as we quickly started evaluating other options, we got a call back from Narain in a few minutes with the good news that he had fit the card into an external slot that was available.

  • Replace Hard Drive

The next task was to get the box onto the network. We were not able to get this done quickly as we were all fairly new to this exercise. The limited time that we had was slipping by. So, we proposed another solution. We had a master-slave setup at JSS for data replication with the application server being on the master hardware. Both the master and slave have the same hardware configuration. We decided to exchange the hard disk of the master with the slave so that the slave hardware be utilised as master.

  • Setup Network Interface

But, this meant that the MAC addresses mapping would change. We had landed with the same problem again. The network interface wasn’t working. This happens sometimes when troubleshooting. We stayed calm and went through the process again. After re-reading and exploring, we identified that we were missing one of the steps to create the  ifcfg-<network interface> file. By following all the steps correctly, we were able to bring the system up. Phew!

It had been almost 7 hours since we started looking into this issue. As we were working on getting the application up, the work to fix the internet was going on simultaneously. The internet was up by the time we got the application up. Thankfully, we could login to the VPN and verify what was done. The slave box was happily playing the role of master. However, the replication was broken as there was no box acting as slave. But by this time, all of us, including Narain at JSS, were confident of setting up the network interface, amongst other things. So we quickly fit the hard disk that we had taken out of slave box back into the master box and set up the Network Interface and IPs appropriately. This exchanged the role of master and slave played by the two hardwares, but everything was back up again. We verified if all was up in order before breathing a sigh of relief.

All’s well that ends well and this time, it taught us as well. Here are my learnings from this experience. 

# Learnings

  • Power Surge and Surge Protectors

I learnt about power surges and and surge protectors. Here’s what you should know. Power surges are short, fast spikes in the electricity being supplied to a power outlet. Many events can cause power surges, such as lightning strikes, power outages, short circuits, electromagnetic pulses, and turning large machines on or off which share the same power line. When a computer is plugged to an outlet or connected to a router via a cable, the computer is vulnerable to power surges. Power surges wreak havoc on the hardware of computers and can physically fry the network interface card.

Earlier, network cards were built on expansion cards and plugged into computer buses. But with the increased usage of network and internet, they now come built directly into the motherboard of computers, resulting in them being smaller, delicate and more vulnerable to such surges.

While nothing can guarantee absolute protection from a direct or very close lightning strike, computers and components like NIC can, to some extent, be protected from mild power surges by using surge protectors, which are devices designed to absorb the harmful effects of surges in electrical power. When a power surge happens, there is a probability that the surge protector might get destroyed. But it’s always easier and cheaper to replace a surge protector than a NIC.

  • Failover Mechanism

Simply put, a failover means having another redundant system to switch over to when the previously active system fails. We have it on all systems that we build.  In this case, even though data redundancy and backup were available, system failover was not implemented due to other priorities. But now with the system having reached enough usage and importance, we’ve learnt the lesson of prioritising the failover the hard way.

  • Steps to replace NIC card

The steps to replace the NIC card configured are not very difficult and can be found on the internet easily. They might vary slightly, depending on the operating system being used. While the steps are not very difficult, it’s good to know them before hand to avoid actions being taken in panic when a failure happens.

  • Remote Support Options

Dealing with support issues over the phone is very difficult. A simple software like AirDroid comes handy. Since it was cumbersome to type SMS’ on a phone, we started using AirDroid which allowed us to send the SMS that we had typed on our computer.

  • Learning while doing

Last but not the least, I got first hand context of the production system and challenges surrounding it.

To conclude, my observation has been that the challenges and learnings in a resource constrained setting and in remote environments, can be quite intriguing and unique compared to developing enterprise applications. If you are a technologist wanting to solve problems in such environments, then apart from application specific knowledge, the ability to deal with hardware and network issues, awareness of frugal solutions, like AirDroid we used in this case, and the ability to communicate and deal with users who are not very technically savvy, could turn out to be very necessary arrows in your quiver! It always fascinates me to see how doctors and engineers are collaborating on this project to improve health care for the underprivileged.

# Relevant Reading:

A few articles that I found worth a read after dealing with the issue