SQL Server Performance

Error analysis in Cluster log file

Discussion in 'SQL Server Clustering' started by Symphony, Nov 6, 2006.

  1. Symphony New Member

    Hi

    I have been looking at an issue for a while now which generated the following error in the cluster log

    We are running SQL 2000 with 2x W2k3 nodes (Active/Passive). There are 3 virtual instances of SQL, namely A, B and C. For a few months we have been getting the below error (on all 3 instances but C is particularly problematic). Generally speaking the appearance of the error produces a Event ID 1069 in the System Log along with "Cluster resource 'SQL Server' in Resource Group 'SQL A' failed." (for SQL A read A, B or C as appropriate).

    For a few days last week I had 2 instances running on one node and the 3rd on the second node. This configuration did not produce the errors. When I moved the 3rd instance back, the errors began again later that day.

    I also noticed today in the Event viewer application log that we had 17052 errors which normally occur when there is a 1069 in the System log, however on two occasions there was no corresponding 1069 error.

    If anyone could shed any light on this matter, I would be very grateful

    Cheers

    Steve




    000007bc.00001050::2006/11/06-12:14:24.180 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
    000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.
    000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] OnlineThread: QP is not online.
    000007bc.00001050::2006/11/06-12:14:24.415 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
    000007bc.00001050::2006/11/06-12:14:24.430 ERR SQL Server <SQL Server (SQLC)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
  2. Symphony New Member

    Following on from the above scenario, I had four 1069 events yesterday afternoon which forced one of our 3 instances of SQL to failover to the other node. Once again, since that point there have been no further 1069 events reported
  3. catullus Member

    Hello Steve,

    Two questions:
    1) could you post the 1069 & 17052 errors?
    2) how is the cluster heartbeat configured? (maybe that is what the communication link failure is about)
  4. Symphony New Member

    Hi

    Thanks for the reply. Here are examples of the errors

    1069 - Cluster resource "SQL SERVER" in resource group "SQL A" failed

    17052 (each group of messages represents a single 17052 entry)

    [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    [sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.

    [sqsrvres] OnlineThread: QP is not online.

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure

    I'll dig out the Heartbeat configuration and post shortly
  5. Symphony New Member

    As another little point of note, one of the SQL instances failed over to the other node yesterday afternoon (as per my original message) and since then there have been no issues. I'm wondering whether there is an issue with running the 3 instances of SQL together.

    I was also doing some research into the Windows configuration as well. Should the Boot.ini file contain the /PAE or /3GB switch? Each node has 8GB installed.
  6. catullus Member

    The error suggests a connection problem. That could be the heartbeat, so I'd be interested in that. If the single node is really stressed with all three instances on it, that could explain why you haven't had problems with one instance on the other node. But that shouldn't be the case. What are the port numbers the instances are listening on?
  7. Symphony New Member

    Instance A is on Port 1433
    Instance B is on Port 1609
    Instance C is on Port 1764


    What information are you interested in relating to the Heartbeat configuration?
  8. catullus Member

    The binding order. Heartbeat should be last, no microsoft client or file & print sharing.
  9. Symphony New Member

    Following up from what I was saying yesterday, we experienced a 1069 error on Instance C yesterday evening, despite Instance A being on the other node.

    The below error was logged in the event log, which I haven't seen before.

    [sqsrvres] checkODBCConnectError: sqlstate = 01000; native error = 304; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (SECDoClientHandshake()).


    Heartbeat NIC 1 and 2 both have Microsoft file and print sharing, as does the "Live Teamed Network". Binding for the "Live Teamed Network" and "Heartbeat Nic 1 + 2" is as follows

    File and Print sharing for MS Windows
    TCP/IP
    Client for MS Windows
    TCP/IP

    Also noticed an 11160 event as below

    The system failed to register pointer (PTR) resource records (RRs) for network adapter
    with settings:

    Adapter Name : {D6A61ABB-8908-4190-AB29-1DD22E7022E6}
    Host Name : svr-cor-sql2
    Adapter-specific Domain Suffix : corpscc.southampton.local
    DNS server list :
    x.x.x.x, x.x.x.x
    Sent update to server : x.x.x.x
    IP Address : x.x.x.x (this is the Live Teamed Network)

    The reason that the system could not register these RRs was because of a security related problem. The cause of this could be (a) your computer does not have permissions to register and update the specific DNS domain name set for this adapter, or (b) there might have been a problem negotiating valid credentials with the DNS server during the processing of the update request.

    You can manually retry DNS registration of the network adapter and its settings by typing "ipconfig /registerdns" at the command prompt. If problems still persist, contact your DNS server or network systems administrator.
  10. catullus Member

    It seems to me that the heartbeat configuration you describe, is wrong, according to KB article 258750 (Recommended private "Heartbeat" configuration on a cluster server). I've often seen a wrong heartbeat configuration cause cluster problems. Maybe it also causes your problem.
  11. Symphony New Member

    Thanks once again for the reply. I was reading that article just a short while ago and I see what you mean.
  12. Symphony New Member

    It also suggests that the Heartbeat NIC (s) should not have "Register this connections address in DNS" checked, which it is on both Heartbeat NIC's

Share This Page