Multiple Data Centers
It’s no wonder that in our always-on, always-connected world, companies have taken to creating multiple data centers and splitting their data processing across them to meet their recovery time (RTO) and recovery point (RPO) objectives. For more information on RTO and RPO, see this paper. Rather than talking about data centers, let’s talk about the databases and applications within them, since this is what we are really protecting. Transactions are processed by a source database and protected by a target database.
There are many ways to configure multiple databases. They can run in parallel, or some can run idle until another one fails. Two databases running in parallel can run in reciprocal mode (each one processing transactions against half of the database and protecting the other half running in the opposite data center) or in active-active mode (each data center is able to process transactions against the entire database and back up the workload in the opposite data center. With the availability of multiple data centers, source and target processing can get much more complicated. Below are some examples of multiple data center topologies. Gold boxes represent the source database, while maroon boxes represent the target database. When the source database fails, the backup database takes over the entire load, although perhaps at a slower pace.
When a transaction is processed that changes the source database, the changes need to be replicated to the target database(s). There are dozens of methods available to replicate data from one database to another. Some are implemented in hardware, some in software, and some are hybrids. We’ll talk about data replication methods in part 2 of this series. To keep things simple, we’re going to limit our discussion of data replication to two data centers.
In a simplex, or active-passive system, all transactions go to one data center and are replicated to the opposite data center. If the source data center goes down for any reason, the target takes over the load. A downside of the Simplex configuration is that since the target is passive, there is no way to know if it can successfully take over the processing load from the source. What if the login passwords expired and the operations staff cannot login to start the system? What if new applications or a new network configuration were installed on the source system, but someone forgot to migrate them to the target? Also, how do you explain to your management why so much processing power seems to be wasted. There are a few ways to ensure that the target is ready to take over when the need arises. You can swap the source and target systems periodically, say, once a month. You also can fire up the target and process dummy transactions to ensure that everything is working.
But a better way to ensure that the target is ready to take over at a moment’s notice is to process transactions on it all of the time.
In a split workload system, the database is spread across the two data centers but normally only half is active in each data center. For example, if we’re processing debit card transactions, we can split the database so that all cards ending with 0-4 are processed in one data center and the cards ending in 5-9 are processed in the opposite data center. A front end system decides where to route each transaction. When a transaction comes in to the correct data center, the cardholder’s checking account record is read, the amount of the debit is subtracted, and the record is written back to the database. Using home banking, the cardholder can see their account balance at any time.
If one data center fails, the opposite data center ‘unlocks’ the target database and makes it active so that transactions can be processed against it. When the failed data center comes back online, the entire database is synchronized with it and it again takes half of the load.
In an active-active system, both data centers can process transactions against the entire database. No matter what digit the card ends with, each datacenter can process the transaction. Since an active-active system can process all transactions at both data centers, why doesn’t everyone use it? The key problem with an active-active system is data collisions. Assume that two people are using debit cards that are attached to the same account. The following example illustrates why collision prevention or detection is important.
- System A – Checking $100
- System B – Checking $100
- System A – Checking $100 – $50 = $50
- System B – Checking $100 + $30 = $130
Although locally consistent, each system has different views of the customer’s balance, and the final state of the databases on both systems depends on which order the data is replicated. If future or cross-user transactions build off of these transactions (for example, if these were stock buy and sell orders), both databases could wildly diverge from each other in a very short time period. The key to active-active data replication is collision detection and prevention–the databases cannot be allowed to diverge. Best practices for active-active replication can be found in a series of books from Paul J. Holenstein, Bruce Holenstein, and Bill Highleyman called Breaking the Availability Barrier.
The title of this post is “An Online Database Copy Is Not A Backup” and that is absolutely true. Any change that is made to the source database should be replicated to the target database. This includes accidental or intentional modification or deletion of data. If someone drops an SQL table, or deletes thousands of records before they resign, the action is replicated to the target in seconds.
In addition to your online data replication, you also must ensure that you have offline backups that can be restored if necessary to recover damage to your database. How often should you take a backup? That depends on your recovery point objective. But before you back up a running database, ensure that whatever method you use can perform online backups of any protected database file and ensure transactional consistency when the database is recovered. I’ll discuss this in a future post as well.
In my next post, I’ll discuss the issues around the various methods that can be used for basic simplex replication.