Error analysis in Cluster log file | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Error analysis in Cluster log file

Hi I have been looking at an issue for a while now which generated the following error in the cluster log We are running SQL 2000 with 2x W2k3 nodes (Active/Passive). There are 3 virtual instances of SQL, namely A, B and C. For a few months we have been getting the below error (on all 3 instances but C is particularly problematic). Generally speaking the appearance of the error produces a Event ID 1069 in the System Log along with "Cluster resource ‘SQL Server’ in Resource Group ‘SQL A’ failed." (for SQL A read A, B or C as appropriate). For a few days last week I had 2 instances running on one node and the 3rd on the second node. This configuration did not produce the errors. When I moved the 3rd instance back, the errors began again later that day. I also noticed today in the Event viewer application log that we had 17052 errors which normally occur when there is a 1069 in the System log, however on two occasions there was no corresponding 1069 error. If anyone could shed any light on this matter, I would be very grateful Cheers Steve
000007bc.00001050::2006/11/06-12:14:24.180 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.
000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] OnlineThread: QP is not online.
000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
Following on from the above scenario, I had four 1069 events yesterday afternoon which forced one of our 3 instances of SQL to failover to the other node. Once again, since that point there have been no further 1069 events reported
Hello Steve, Two questions:
1) could you post the 1069 & 17052 errors?
2) how is the cluster heartbeat configured? (maybe that is what the communication link failure is about)
Hi Thanks for the reply. Here are examples of the errors 1069 – Cluster resource "SQL SERVER" in resource group "SQL A" failed 17052 (each group of messages represents a single 17052 entry) [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed [sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()). [sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation. [sqsrvres] OnlineThread: QP is not online. [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure I’ll dig out the Heartbeat configuration and post shortly

As another little point of note, one of the SQL instances failed over to the other node yesterday afternoon (as per my original message) and since then there have been no issues. I’m wondering whether there is an issue with running the 3 instances of SQL together. I was also doing some research into the Windows configuration as well. Should the Boot.ini file contain the /PAE or /3GB switch? Each node has 8GB installed.
The error suggests a connection problem. That could be the heartbeat, so I’d be interested in that. If the single node is really stressed with all three instances on it, that could explain why you haven’t had problems with one instance on the other node. But that shouldn’t be the case. What are the port numbers the instances are listening on?
Instance A is on Port 1433
Instance B is on Port 1609
Instance C is on Port 1764
What information are you interested in relating to the Heartbeat configuration?

The binding order. Heartbeat should be last, no microsoft client or file & print sharing.
Following up from what I was saying yesterday, we experienced a 1069 error on Instance C yesterday evening, despite Instance A being on the other node. The below error was logged in the event log, which I haven’t seen before. [sqsrvres] checkODBCConnectError: sqlstate = 01000; native error = 304; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (SECDoClientHandshake()).
Heartbeat NIC 1 and 2 both have Microsoft file and print sharing, as does the "Live Teamed Network". Binding for the "Live Teamed Network" and "Heartbeat Nic 1 + 2" is as follows File and Print sharing for MS Windows
TCP/IP
Client for MS Windows
TCP/IP Also noticed an 11160 event as below The system failed to register pointer (PTR) resource records (RRs) for network adapter
with settings: Adapter Name : {D6A61ABB-8908-4190-AB29-1DD22E7022E6}
Host Name : svr-cor-sql2
Adapter-specific Domain Suffix : corpscc.southampton.local
DNS server list :
x.x.x.x, x.x.x.x
Sent update to server : x.x.x.x
IP Address : x.x.x.x (this is the Live Teamed Network) The reason that the system could not register these RRs was because of a security related problem. The cause of this could be (a) your computer does not have permissions to register and update the specific DNS domain name set for this adapter, or (b) there might have been a problem negotiating valid credentials with the DNS server during the processing of the update request. You can manually retry DNS registration of the network adapter and its settings by typing "ipconfig /registerdns" at the command prompt. If problems still persist, contact your DNS server or network systems administrator.

It seems to me that the heartbeat configuration you describe, is wrong, according to KB article 258750 (Recommended private "Heartbeat" configuration on a cluster server). I’ve often seen a wrong heartbeat configuration cause cluster problems. Maybe it also causes your problem.
Thanks once again for the reply. I was reading that article just a short while ago and I see what you mean.
It also suggests that the Heartbeat NIC (s) should not have "Register this connections address in DNS" checked, which it is on both Heartbeat NIC’s
]]>