A Practical Path Towards the Ultimate Microsoft SQL Server Scalability and Availability

Microsoft Cluster Server (MSCS)

Microsoft SQL Server Enterprise Edition supports MSCS. MSCS was designed to provide high availability by running multiple SQL instances at the same time. The idea was to have automatic failover capability when the primary crashes. This idea would have worked fine it MSCS supports multiple replicated datasets, one for each instance. Instead, MSCS supports only a single update database image. Since transactions are mostly I/O bound, this limitation makes MSCS unable to help in scalability at all. It is ironic that MSCS still needs scheduled backups for dataset availability.

Distributed Partitioned Views (DPV)

Microsoft SQL Server Enterprise Edition also supports DPV. DPV enables the “federated database design”. The idea is to partition your tables across multiple physical SQL servers. Queries updating different tables will be automatically directed to multiple SQL Servers to achieve scalability. For very large databases, this scheme must be used to obtain reasonable response time. It requires significant application recoding, if the application did not start with partitioned views in mind. This technology makes the availability problem dramatically worse than using a single SQL Server since any server crash can potentially halt the entire cluster. The delivered scalability is also limited. For example, if some tables attract more queries than others, these servers will be more heavily loaded than others. Using MSCS on top of DPV helps neither scalability nor availability. It will result in a costly cluster with very high license fees and maintenance complexity.

Automated Transaction Log Shipping

The simplest way to produce a replicated SQL Server dataset is perhaps using log shipping. You will need two running SQL Servers. You can automate the process that the backup server is being updated every few seconds. Since the backup SQL will never be in sync with the primary, it cannot be used for real time updates.

Theoretically the backup SQL’s value is only to be realized during disasters. Clever uses of the backup SQL would include report and data mining applications. The trouble is that when the log is applied, the database cannot be used at all. This makes the reporting and data mining application schedule a challenge. Another thorny point in practical production is that the speed of log shipping typically lags behind when updates are frequent. The delay can be in minutes to hours. When the primary SQL crashes, all transactions not yet shipped will be lost. In addition, log shipping must be defined for each database.

If your SQL Server is hosting multiple databases; such as a typical ASP (Application Service Provider), you will have to maintain multiple scripts and make sure each is running smoothly on time.

Immediate Transaction Replication

Microsoft SQL Server 2000 supports transaction replication based on publisher/subscriber relations. The system was designed to incrementally propagate transaction data in distributed environments. Transactions are first applied to the primary server and then applied to the subscriber server (or a replication server who relays the contents to the target SQL Server). There is also an “immediate update” option designed to minimize the possible delay. Since the target SQL is never in sync with the primary, in either case it cannot be used for real time load sharing. But they do help with reporting and business intelligence applications. Practice indicates that managing the publication queues is tricky due to network/processor speed differences and the large overhead by the replication engine.

Disk or Database File Mirroring

Hard disk vendors and backup software vendors also offer some help for SQL Servers. Their offers consist of synchronous and asynchronous replication of database files. Asynchronous database file replication has exactly the same problem as for log shipping, provided that these solutions often does not need a running SQL instance. All synchronous replication products use the well-known two-phase-commit protocol (2PC) that either the primary or the backup side does not commit, the entire transaction must be rolled back. This effectively cuts each transaction’s survivability by half – making it practically unusable.

Hardware Mirrors

One ancient method for providing high availability is to use lock step duplicated hardware. This concept has produced the most cost prohibitive design in the industry. The real trouble is that the money was thrown at the wrong problem – hardware is no longer the weakest link. This has been true at least for the last ten years. These systems still operate on a single update dataset offering no help for scalability. It is ironic that these systems’ availability requires log shipping or regular backups.

Homegrown Transaction Replication

Since the hardware costs have gone down so much, it is cheap enough to build a Microsoft SQL Server cluster with identical datasets. It appears that all one needs to do is to make sure that updates are sent to all SQL Servers and the read-only can be distributed to any.

A closer look will find that DCP will happen, if the application support concurrent updates for multiple clients. This is not recommended for anything other than single client update applications.

Replication Middleware

One emerging technology involves injecting a middleware between the data clients and SQL Server(s). This represents a unique approach in solving scalability and availability problems. Since the middleware is capable of “seeing” all concurrent transactions when they pass through the middleware, it provides a unique opportunity for data synchronization, automatic fail over and load balancing. While the middleware will introduce some overhead, if the benefits include both scalability and availability, the prizes may well be worth the troubles.

To date, there have been two proprietary methodologies:

  • Modified database access API to capture transactions for special synchronization controls (SQLup by Incepto.com (now defunct)); and

  • Proxy-like packet router by embedding the data synchronization controls in the packet transmission mechanisms (ICX-UDS by Parallel Computers Technology Inc.).

Regardless the data synchronization algorithms, changing API is a bad idea for practical reasons. Captured transactions must be sent to a middleware server before reaching SQL Servers. The middleware server becomes the single-point-of-failure. A proxy-like packet router can appear seamlessly in any existing SQL infrastructure. It can be monitored and switched by network appliances to remove the single-point-of- failure. Therefore, it is the most promising in solving both the scalability and availability problems.

Continues…

Leave a comment

Your email address will not be published.