Real Life DBA: Dealing with a SQL Server Cluster Failover

Tuesday, 10:30AM

I was setting at my desk, waiting to hear back from the Microsoft support tech. My computer made a noise, letting me know that I got new e-mail. I checked my mail, and there were a series of messages from our alert software. Our production SQL Server cluster had just failed over again. Double crap!

This time, I didn’t know what was going to happen. If the first node still was “bad”, then a failover from the second node to the first node might mean that I had a dead ERP system.

I immediately brought up Cluster Administrator to see what was happening. At my first look, all of the SQL Server cluster resources were pending. In other words, the failover had started, but was not done. I froze as I watched the pending status, hoping that it would change to online. After about what was one of the longest minutes I had experienced in my life, the resources changed from pending to online. At least I had one good piece of news, the failover had succeeded, and the SQL Server was back online. But for how long? I still hadn’t found the cause of the initial failover, and I didn’t know how to prevent another one.

Like yesterday, I notified the proper people, and then started looking at the error logs. This time, I found something useful. There was a reference to an application popup message:

Application popup: sqlservr.exe – Application Error : The instruction at “0x4115d3c0” referenced memory at “0x4115d3c0”. The required data was not placed into memory because of an I/O error status of “0xc000026e”.

Although this was not all that useful, it helped more than yesterday. I also reviewed the SQL Server logs, and they contained many messages this time, but none of them pointed a conclusive finger as the cause of the problem. One log that was interesting was the DMP file created when the application error message above was logged. In other words, when the MSSQLSERVER service died, a DMP file was created (it wasn’t created yesterday). What I found interesting in it was that a particular stored procedure, that was written in-house for a report, was repeated over 200 times in this DMP file. I didn’t know what this meant, but when I contacted Microsoft next, I emphasized this, asking what it meant.

Tuesday 11:00AM

I called the Microsoft support tech, describing what had just happened and what I had found. He wasn’t of much use, only asking me to send the new log and DMP files to him. After doing this, since I had nothing else to do, so I did some more research on the Internet, looking up the various error messages that I had found in the various logs.

I didn’t really find anything useful on the Internet, but what I did found got me thinking. For example, the previous week I had made a single change on our production SQL Server. Previously, the server had been set so that it would not run parallel queries. We had turned this feature off about a year and a half ago because there was a known bug that caused problems when parallelism was turned on in the server. Supposedly this bug was fixed in SP3 of SQL Server 7.0. Even though I had installed SP3 on this cluster, I had never turned parallelism on. I was just being conservative.

So why had I turned it on the previous week? What had happened was that a report developer had come to me and asked me to analyze why a particular query she had written was so slow when it ran on the production server (the server had parallelism off) but it was fast on the development server (this server had parallelism on). Well, as you expect, it was because of the parallelism setting. The difference between the two servers was significant. The query on the development server only took 40 seconds to run, but it took 13 minutes on the development server (and what was worse, is that the development server was a smaller server than the production server).

Because of this report’s poor performance, I decided that it was perhaps now time to turn parallelism back on the production server. Parallelism had been on the development server for over a year now, and we had never seen any issues with parallelism as we did in previous service packs, so I assumed that the bug had been fixed. I also conferred with my boss, and he agreed that turning it on made sense.

There was nothing specific in the logs that pointed the finger at parallelism as being the cause of the failover, but to be conservative, I decided to turn off parallelism on the development server. There was little downside to doing this, other than slowing some reports, so I made this decision on my own. And besides, this was the only thing that had changed on the server in months, and well, you just never know.

Tuesday 3:00PM

I called the support tech again, and he wasn’t still of much use. It was at this point that I asked that this support call be escalated to the next level of support, as he wasn’t  helping me in a timely fashion. The support tech didn’t have any problem with this, so he said that he would escalate it and that the person taking over the support call would contact me. I soon received an e-mail from the new support tech, and he said he needed some time to review the logs, and that he would contact me the next day.

Wednesday 10:00AM

I called the newly assigned Microsoft support tech, and asked him what he found in the logs. Like me, he saw the over 200 instances of this particular stored procedure in the DMP file. He said that when a parallelism is turned on, what the Query Optimizer does is to try and see if a query can be broken down into multiple threads so that the query can be executed on multiple CPUs at the same time, speeding performance. Based on the DMP file, it appeared that for some reason, when this particular query ran, that instead of only creating enough threads to match the number of CPUs in the server (which should have been four), that over 200 threads had been created instead, which in turn crashed the server. Obviously, this was a new bug.

To help confirm this, I did some research about the particular report that ran this stored procedure that caused the problem. I found out that this report was running on both Monday and Tuesday as the exact same time the SQL Server had failed over each day. Based on the evidence, it appeared that we had found the cause of the failover.

While the original parallelism bug that we were familiar with was not the cause of the problem, it had been fixed in SP3, but by turning on parallelism, it caused this new bug to raise to the surface. One question I had was why hadn’t we seen this problem on the development server, which always had parallelism turned on? Our best guess about this is that the development server is rarely very busy, while the production server is almost always busy. So the bug must arise its ugly head when a particular set of circumstances occur (and we were lucky enough to experience this two days in a row).

So What Do We Do Now?

We have several options to deal with the bug. One option is to move to SQL Server 2000. Unfortunately, we can’t do that because our ERP package currently doesn’t support SQL Server 2000. It may be 6-12 months before this can happen.

Another option is to turn parallelism back on for the server, but turn it off for this particular query using the MAXDOP query hint to turn it off for this particular query. This of course, assumes that we have no other queries that might cause a repeat of the bug we found.

We could also work with Microsoft to help isolate the bug so they can potentially fix it. The problem with this option is that our server is a production server, and I can’t use it for experiments.

And the last option, is the one we chose. And that is to turn off parallelism on the production server, and live with any poor query performance we get due to not using parallelism. We made this choice because we are very conservative, and it is more important to have our system up and running 24/7 than it is to run every query at the highest possible speed.

]]>

Leave a comment

Your email address will not be published.