By: Wayne Rash
Updated: The Company says the shutdown that stopped e-mail service to BlackBerry users resulted from a software upgrade that went awry.
BlackBerry maker Research in Motion announced late April 19 that it has determined the apparent cause of the shutdown that stopped e-mail service to BlackBerry users throughout North America earlier in the week.
According to a statement from the Waterloo, Ontario-based company, the shutdown on April 17 was related to a software upgrade that went awry, followed by a failover process that also didn’t work properly.
The BlackBerry blackout happened when the company introduced a new, noncritical system routine into its database, officials said. The routine, according to RIM, was designed to improve cache optimization but instead caused a series of interaction errors between the databases and the cache.
"After isolating the resulting database problem and unsuccessfully attempting to correct it, RIM began its failover process to a backup system," company officials said in a statement. Officials said that the company had repeatedly tested the failover process successfully, but this time something went wrong.
"The failover process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue," officials said in the statement.
The company’s statement goes on to say that its analysis continues and that it has identified certain aspects of its testing, monitoring and recovery process that need to be fixed to prevent this from happening again. "RIM apologizes to customers for inconvenience resulting from the service interruption," company officials said in the statement.
Analyst Jack Gold, who is principal analyst at J. Gold Associates in Northborough, Mass, thinks users shouldn’t be too surprised at the outage.
"I can’t fault them too much because this happens to everyone. I’m inclined to cut them a break here because they didn’t do anything that they thought would adversely affect the systems," he said.
"What it sounded like they were doing was putting some code into the system that would make it more efficient."
Gold said that it’s clear that their testing didn’t work. He said it was also clear that the company needed to make sure it had redundancy that actually worked and that had a reliable failover method.
"They sort of had redundancy. They actually have another NOC running in Europe. The NOC in the UK is set up for Europe. In theory, if the NOC fails, you should be able to flash over to the other NOC. Apparently that didn’t work, either." Gold also said that RIM made a big mistake in how it communicated with users, or more accurately failed to communicate.
"They really need to get better at communicating with their end users so they know what’s going on. We can usually deal with it if we know what’s going on. For several hours RIM wasn’t really very forthcoming. You really need to tell your users what you know and when they plan to do about it," Gold said.
Craig Mathias, principal of Farpoint Group, said that RIM needs to review its procedures. "Mission-critical systems with single points of failure are a problem," Mathias said. "It’s very difficult to architect solutions that don’t fail, but it can be done. The military does it."
Read more about the recent BlackBerry shutdown & restoration in “BlackBerry Devices Up and Running Again”
Mathias noted that RIM isn’t the only e-mail system that shares such problems. "A lot of people use the BlackBerry as their primary e-mail service, and when it's down it's down," he said. "But that’s the problem with e-mail; it’s loaded with single points of failure."
Mathias said that RIM needs to design a more reliable solution and that the technologies to solve these problems are available. Mathias added that a key rule of business is, "never...