An Untested Plan Is Worse Than No Plan At All

I can only assume that someone at TATA’s London Data Center knew this at one time. However, when the rubber met the road, all they got were skid marks. As you can see by this marketing brochure, the state-of-the-art facility was designed with uninterruptable AC power supply (UPS) and redundant generators and utility feed.

But when the power went out and the UPS failed, firms including C4L, ServerCity and Coreix were taken out with it. C4L’s report to its customers said: “We found it very difficult to get a hold of our supplier as it appears they base their entire operations out of this data center. Phones were down and emails simply bounced back.”

Even more interesting is that TATA appears to have been completely unaware of the power outage until a customer contacted them at 4:45 PM to report that their monitors were showing the data center’s temperature rising. At 6:55 PM a power engineer arrived to find that the UPS batteries were depleted and the three generators failed to start.

In other words, the data center was dead and stayed that way until systems started coming back online at 7:30 PM. TATA finally called its customers at 9:50 PM to let them know that the utility power was back but was at risk for another 8 hours until the UPS batteries were fully charged.

What Did We Learn From This?

I am playing armchair business continuity planner here and I could be wrong in my assumptions but here they are so that you can check your own BC plan for similar issues.

  • It appears that TATA was not monitoring the temperatures (or power for that matter!) in its data center which could have provided an indication that something was wrong. Do you monitor your data center’s vital signs at multiple locations?
  • When the power went out, there was no way to reach TATA because they apparently base all of their operations out of one data center instead of distributing them. If you run mission critical operations, do you have a backup site? At a minimum, do you have a backup site for your organization’s command and control functions?
  • TATA’s phones apparently don’t work when the power goes out. If you rely on Voice Over Internet Protocol (VOIP) phones, they will stop working when the power fails. You should have at least one Plain Old Telephone Set (POTS) in every critical location so that you can make and receive calls when the rubber meets the road.
  • Why didn’t the generators come online? Were they regularly tested? Did they have fuel?

Some Other Thoughts

  • Do you know how much time you have between a power failure and battery depletion?
  • Do you have an emergency shut down plan for your servers in case the power fails and the generators don’t come online? A controlled shut down is better than a crash which could corrupt disk volumes and databases.
  • Do you have a process to follow when the power comes back online? Do you know which systems need to come up before other systems?
  • Do you have a systems priority list to follow if you don’t have enough power to bring the whole data center back online?
  • If the entire data center cannot be powered back up immediately, which systems need to come up first and which can wait?
  • If you have generators, do you ensure that they are maintained, fueled, and regularly tested with the full data center running off of them?
  • Does your change control process include reevaluating your UPS capacity when new equipment is installed in protected areas?
  • If you host your servers or rely on cloud computing, do you know for absolute certainty how they are being protected?
  • If you are hosting mission critical operations have you thought about using multiple service providers?
  • Do you have alternate power supplies for life safety and security systems?

Any other thoughts on this topic?

Get a Trackback link

Post a Comment