I had just gotten out of a short meeting and went to my desk to see what important e-mail messages I might had missed. In my inbox was a series of messages sent from ELM Enterprise Manager 3.0, the software we use to monitor our SQL Server event logs. The message had arrived about 30 minutes earlier, when I was away from my desk.
Instantly, my eyes focused on the name of server that the messages came from. They were from our production SQL Server cluster, the one that runs the database for our ERP package. In other words, the most important SQL Server I manage. It handles over a billion dollars worth of sales every year, and generally has over 100 active connections. Oh crap!
I quickly read through the messages. Right away, I discovered that SQL Server had failed over from the primary cluster node to the secondary cluster node. I then started Cluster Administrator and verified that this was true.
The SQL Server cluster was running Windows 2000 Advanced Server, SP2, and SQL Server 7.0, SP3. The cluster was dedicated to SQL Server, and this was the first time in over seven months, since the cluster was installed, that a failover had occurred.
My next step was to run to our company’s Help Desk staff, who share the same room, and I asked them if they were getting any phone calls. Yes, a whole mess of them. I quickly told them that the SQL Server cluster had failed over and that they should tell users to exit their ERP software session and then re-logon. This software is not smart enough to retry a broken connection, so users are down a short time as they have to restart their software.
Once I informed the Help Desk Staff, I ran, looking for the ERP Manager, letting him know so that he could inform all the appropriate parties. Then I ran back to my desk to find out what happened.
My first step was to verify, from Enterprise Manager, that the SQL Server cluster was indeed up and running. It was. Fortunately, the failover process worked as advertised, and users were now able to get back to work after a 1-2 minute delay.
Since I didn’t know what caused the failover, my first goal was to try and identify its cause, so that I could prevent it from happening again. Depending on the nature of the problem, it was possible that the cluster could try to fail back, and if that happened, I wanted the primary node to work.
The most obvious place to start was the Event Log of the cluster, which is where I went. To my surprise, there were no immediate indications of what caused the problem. In the log I noticed that once the MSSQLSERVER service had failed, that the Cluster Service had tried to restart it, which is what is expected. It tried to restart the service three times, failing each time. After the third try, the Cluster Service initiated a failover, which went smoothly.
I looked at the logs very carefully, but found no obvious reasons why the MSSQLSERVER service had failed. The only thing that looked at all suspicious was that the service failed over when the primary node was in the process of performing a transaction log backup.
Next, I checked the SQL Server logs themselves, hoping to find more clues. There weren’t any. It was as if the MSSQLSERVER service died on its own, with no apparent cause. My stress level went up instantly. This would not be an easy fix.
Since there were no obvious clues, and I needed to find the cause of this problem right now, I decided to call Microsoft Technical Support. We have a company contract, so I didn’t need to get any special permission to make the call.
As you might expect, I got a first-tier support SQL Server specialist on the phone and I described the problem. I don’t really like dealing with first-tier people as they invariably can’t help, and have to bump the support call to the next level support staff.
In any event, the support tech said he would send me an e-mail that would include a utility program that would go out and locate, then compress all of the necessary hardware, software, and log information on my SQL Server cluster. I was not familiar with this particular tool, but when I got it, I ran it and sent the results to the support tech as soon as I could. Then I began waiting. He told me that he would get back to me later in the afternoon, once he had a chance to review the logs.
While I was waiting, I did some research on my own on the Internet, but couldn’t find anything very useful. I hadn’t heard from the Microsoft support tech yet, so I called. He said he wasn’t finished, but he speculated that the failover occurred due to one of these reasons:
- The cluster didn’t have enough RAM.
- The CPU was so busy that it caused the “Is Alive” cluster request to fail.
- The database was corrupted.
After hearing the support tech’s theories, I actually got a little upset. While I don’t know everything there is to know about SQL Server clusters, I knew enough that the likelihood of any of these reasons being the cause of the cluster failover was very small. Why? First, each node in the cluster has 4GB of RAM, and I hade never seen SQL Server even use 3GB of it, ever. Second, the CPU was not all that busy, and if the “Is Alive” request had failed, this would have shown in the logs, and it did not. And third, we had run a DBCC CheckDB on the database in question about 6 hours before the failover occurred, and it had shown no signs of problems. I explained this to the support tech, who told me that he would look at the logs some more and get back with me the next day. Based on the conversation, I really doubted if the tech had spent much time looking at the logs as this point.