Achieving Massive Scalability with SQL Server
With each release, Microsoft SQL Server continues to gain in performance and scalability. A little known fact buried under Oracle marketing is that SQL Server’s TPC-C score is now even higher than Oracle’s. However, few other than Microsoft have millions of dollars of hardware needed to generate the 709,220 transactions per minute achieved in the published TPC-C test result. So for the rest of us, with down to earth budgets for database licenses and hardware, how can we achieve scalability for extremely high volume database driven applications?
Database scalability is a long thought over and debated topic, and there are no easy answers. There are several ways to achieve more scalability out of your database, however the architecture you choose will greatly depend on your unique needs.
When it comes to scalability, there are two high level ways to achieve it –- scaling up or scaling out. Scaling up is to handle volume on a single high performance server, while scaling out is to distribute load among many low cost servers. When scaling up you add processors to your server, when scaling out, you add servers to your cluster.
Scaling up is the easiest fix, where throwing money at the problem will solve it for most, as long as you are willing to scale up your budget. Ultimately, there are levels of volume which even a 32 processor Windows Data Center Server can’t satisfy, and as XML/SOAP Web services usage grows, this kind of super-enterprise scalability will push the most powerful servers to their limits and demand alternate ways of achieving scalability. The following techniques focus on scaling out instead of up.
Microsoft Cluster Services
Before discussing any of the real approaches to scaling out with SQL Server, let’s bring light to a common misunderstanding. Microsoft Cluster Service (MSCS) does not do anything to help performance or scalability. Despite its misleading name, it does not enable you to scale out or distribute traffic. Instead, if one server fails, MSCS will fail over to a backup server, which will resume processing –- eventually. In addition to not adding any benefit in performance or scalability, MSCS can take an embarrassing 1 to 5 minutes to transfer its resources to the backup server and complete the failover process, during which time the database is down.
Federated Database Design
SQL Server Enterprise Edition comes with a feature called Distributed Partitioned Views (DPV), which enables you to create a federated database, to divide processing of queries between multiple separate database servers. With DPV, the database is partitioned into many member tables, storing each of those tables on separate database servers, and then creating a partitioned view over the set of member tables. DPV can be a useful way to gain scalability, as it is built into SQL Server, and client applications do not need to be aware of the fact that the database has been federated.
However, federated database design (which DPV is based on) has a few drawbacks which may prevent it from truly solving your massive scalability problem. The first flaw in this design is known as the ‘convoy effect’. Because multiple machines are involved in the processing of one query, the complete results of that query cannot be assembled until the slowest database in the federation has completed processing its portion of the view. This means that if you use several quad processor servers and one dual processor server in your federation, your queries may only execute at the speed of the dual processor. It also means that even if identical hardware is used on all servers in the federation, the server with the most popular tables (highest data access volume) may have more load than other servers in the federation, and slow response time while other servers in the federation have CPU cycles to spare.
The other big issue with federated databases is availability. Because the database’s data is physically split between servers in the federation, if any one database server goes down, the entire database system becomes unavailable, as it is no longer able to access part of the data needed to complete processing of queries. To aid this situation, Microsoft Clustering Services can be used on top of DPV, however this is largely cost prohibitive. Both DPV and failover capability via Microsoft Clustering Services are only available with the Enterprise Edition of SQL Server (four times more costly than Standard Edition). A backup database node would be needed for each database server in the federation, doubling the cost of hardware and database licenses, and doing nothing for performance or scalability, as the backup servers do not share in the workload, but sit idle until a failure occurs. That’s not a lot of value for your money.
There are as well, a few development pitfalls to be aware of when considering DPV. The biggest issue is that integrity constraints are not allowed. Integrity constraints can not be implemented because the data is split between separate machines. Writing complex triggers may alleviate some of this problem but adds other problems and additional maintenance of its own.
A clever way to achieve scalability when certain parts of a client or server application only require read access to the database, is to point those applications to a copy of the database on a separate database server. This can divert volume from the primary database, and can be achieved by using continuous transactional replication or log shipping. The power in this approach is that it is relatively easy to configure, and it allows you a quick solution to load balance your database read access.
The obvious drawback of this approach is that it does not use live data. It must be acceptable for clients accessing this database to use slightly delayed data. Depending on the level of volume, the amount of data which has changed, and the number of servers you are replicating to, replication delays may range from as low as 1 minute to 15 minutes or more. This may be suitable in cases where backend server applications perform complex queries or reporting on the database, and do not require up to the minute data. For example, if you are developing a calendar application, modifications to your calendar will be made to the master database. The application which users interact with to view and modify their calendar must access the master database, so that when a new entry is added, the subsequent calendar view will display the newly added entry. However a background process which calculates unused time segments for all calendar users, could access a replicated copy of the database, since this process does not necessarily require real-time data.
Using replicated database copies is not suitable when you need access to live dynamic data. Let’s say the calendar application above has a web service interface, in which multiple applications can interact with. Perhaps a human resources application which can mark planned vacation days on your calendar, and a conference call application, which can mark your calendar with a reminder about a scheduled conference call, both need update access your calendar. When multiple applications need transactional read and write access to the same data store, they must talk to the master database.
If the human resources application were to check for availability on the replicated copy, it might think the time slot is available when in fact a previous update to the data had not made its way to the replicated copy yet. As applications increasingly interoperate, live access to a single massively scalable data source becomes more and more important.