A Sidekick In The Pants – Part 2

In our last episode, we saw that thousands to tens of thousands of T-Mobile Sidekick customers may have lost all of their data, including contacts, calendars, e-mail, and photos because something failed in the cloud. I said that this is a teachable moment, and I would like to take a bit more of your time to cover the issues in depth. In one of my earliest posts I provided a list of questions that you should be asking your cloud provider. Amongst them were:

  • Does the provider back up your data or is that left to the customer?
  • How many generations of backups are maintained in case you need to recover from a data corruption issue?
  • Are backups protected from theft and damage?
  • Are backups encrypted?

RAID Is Not A Backup

I want to make it very clear that RAID (Redundant Array of Independent — or Inexpensive — Disks) is not a backup and neither is data replication. While both of these technologies ensure that your data is in more than one place, usually in real time to near real time, neither of them can protect you from data corruption. If something corrupts the primary copy of your data, the corruption could be replicated to the backup copy. And a RAID controller gone mad also can corrupt or destroy your data.

Let’s take a closer look at both of these scenarios. Depending on the RAID level which is configured, the array will keep your data safe if one or more of the disks in the array fail. When you replace the failed disk(s), the array automatically will rebuild itself to protect you from the next failure. However, if more than the maximum allowable number of disks fails, or the the array controller fails, data can become corrupted or lost entirely.

In 2002, GunBroker.com suffered from a RAID corruption issue which took them down for 40 hours. Apparently the corruption was caused by a technician running a remote status check on the array. When the tech logged out, the main storage processor locked up and the system was unable to fail over to the backup storage processor. It took the vendor 11 hours to bring the array back online and when they did, GunBroker’s owners asked that a full backup be taken before bringing their servers back online. GunBroker took a few hours to restore an older database backup from tape then used the majority of the remaining downtime to recover as much data as possible from the corrupted database.

But technology failures are not the only cause of data loss or corruption. If a maintenance technician takes down the wrong volume or an operator drops the wrong SQL table, you also can suffer from data loss. The first incident can be handled by a replicate but in the second case, the same table will be dropped on the replicate.

So the moral of these stories is that using a storage array or data replication does not absolve you from taking backups. Without their backup, GunBroker would have been out of business. And note that if a disk controller doesn’t fail entirely but rather corrupts the data that it is writing, the replicate also will be corrupted. That’s when you need to call for an offline backup copy of the data which was taken before your data was corrupted.

In fact, Microsoft said in an emailed statement that the Sidekick recovery process has been “incredibly complex” because it suffered a confluence of errors from a server failure that hurt its main and backup databases supporting Sidekick users. Oops, I guess they didn’t have an offline backup…

Smartphone Backups

Let’s get back to the Sidekick story to see what other lessons we can learn. The Palm Pre also is backed up in the cloud, and Apple offers MobileMe for their iPhone device. A search for local backup solutions for the Pre came up with The Missing Sync This software runs on Mac and PC and will create a local backup of much (but not all) of the data on your Pre. As an iPhone user, I know that in addition to MobileMe my device also is backed up within iTunes on my local primary Mac. I have MobileMe configured to replicate my Contacts, Calendar, and other information to all of my Macs and similar capabilities exist for Windows systems.

Because I use Time Machine and Retrospect to make offline copies of all of my critical information, I should be safe if MobileMe goes offline or corrupts my information. My Retrospect backup media goes into a fireproof safe.

Recovery Point Objective

As part of your overall business continuity plan, you need to determine the Recovery Point Objective, or RPO for each of your business critical systems. RPO is a definition of how much data you can lose if you suffer a failure. It usually falls out of your business impact analysis. Your RPO will help you specify what backup methods you need to use. The lower your RPO, the less data you can lose. I talk about the various options on page 9 of this paper.

But again, RAID or data replication will not protect you from data corruption whether caused by a technology or human failure. No matter how low your RPO, you still need to take offline backups as your final backstop. Luckily GunBroker knew this, and you should take it to heart as well.

Conclusion

If you are a business owner or manage a line of business for an organization, you need to be aware of how your data is being treated. This goes double if it’s in the cloud. Whether it’s the internal IT department, an outsourced IT department, or a cloud provider, you need to have a service level agreement that covers protection of data, including security and backup of it. Don’t just believe what your provider tells you, but ask to run data restoration tests to a dummy system to verify that your data actually can be recovered.

If you are an individual, you are responsible for your own data protection. How do you protect information that you don’t want to get out? How do you protect information that you cannot lose? Many companies make low-cost external disks with one-touch backup and one of these may be right for you. If you have a smartphone which is backed up in the cloud, make your own backups just in case.

If you need help, ask for it. There are plenty of certified business continuity professionals who can help your organization, and plenty of self-help forums for individuals using a specific technology. Don’t leave protection of your data to someone else. Remember, the data G-d helps those who help themselves.

Update: 2009-10-14 00:53 GMT

Alleged details on the events leading up to Danger’s loss of customer data are starting to come out of the woodwork, and it all paints a truly embarrassing picture: Microsoft, possibly trying to compensate for disgusted-and-quit and/or laid-off Danger employees, outsources an upgrade of its Sidekick SAN to Hitachi, which — for reasons unknown — fails to make a backup before starting. The upgrade runs into complications, data is lost, and without a backup to recover from, untold thousands of Sidekick users lose their data in an epic way rarely seen in an age of well-defined, well-understood IT strategies.

If you have been trying without success to get your own organizations to take backups, this might be a good time to ask again. It’s also going to be really fun watching the posturing between Microsoft, Hitachi, and T-Mobile over the next few weeks. Grab a root beer and a tub’o’popcorn and watch the fireworks. Let The Games Begin!

Post a Comment