SQL Server Performance

SQL 2005 Cluster with SSIS job failures - different SQL software location

Discussion in 'SQL Server 2005 Clustering' started by chrisvdm@directchannel.co.za, Mar 23, 2010.

  1. I have a problem with my 'upgraded' SQL2005 Enterprise cluster, in the sense that the two nodes have different configurations, even though the versions are the same on both the nodes. I need some suggestions how to resolve this, by either fixing the current problem, or to reinstall the SQL cluster from scratch.
    Just a little bit of background information :
    I recently upgraded our SQL2005 Standard Cluster (Active/Passive Nodes - in perfect condition and operational) to SQL 2005 Enterprise. This was done by logging onto the Cluster Server (with Node1 as the Active Node) as the cluster administrator, and doing the upgrade as follows :
    start /wait setup.exe ADDLOCAL=SQL_Engine InstanceName=MSSQLSERVER UPGRADE=SQL_Engine SKUUPGRADE=1 /qb
    I then failed over the cluster onto the second Node2, and repeated the whole installation again as described above.
    When I rebooted the Cluster Nodes, and startup the Cluster Services, the SQL Cluster failed to come online, on any of the two nodes. The Server Cluster was 100% operational, but the SQL Cluster failed to startup. I managed to find a solution on the Web, and did the following :
    I copied the two files SQSRVRES.dll and SQAGTRES.dll to both the SQL Node1 and SQL Node2 directory c:windowssystem32
    This resolved the problem, and the SQL Cluster did come up with only some of the services online, while the main Cluster services were still down.
    I managed to trace this problem, and had to failover the cluster to each Node1 and Node2 respectively, and then add each Node as an owner to each of the relevant SQL Cluster services. For some or other reason during the upgrade process was the Nodes removed as "owners" of those services.
    When I restarted the Cluster, the SQL Cluster came up for Node1, and testing the SQL Cluster failover to Node2 also proved successful.
    I also compared the SQL versions on each node with the command : SELECT @@version first on Node1 with failover, then on Node2 with failover.
    The version reported on both, are : SQL2005 v9.00.1399.06 (Intel x86) Oct14 2005 00:33:37 build 3790 SP2
    At this point I still had to do the SP3 upgrades on both nodes, but decided to let the Cluster run for a day or two, and then check services, jobs, etc for errors. I then found that the daily backup job, the index rebuild job, and the database statistics job failed with SSIS errors, as below :
    At the time, the SQL Cluster was running on Node1.
    =====
    03/19/2010 08:16:39,,Error,[364] The Messenger service has not been started - NetSend notifications will not be sent
    03/19/2010 08:16:39,,Error,[125] Subsystem 'SSIS' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'ANALYSISCOMMAND' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'ANALYSISQUERY' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'QueueReader' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'Merge' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'Distribution' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'LogReader' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'Snapshot' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'CmdExec' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Error,[125] Subsystem 'ActiveScripting' could not be loaded (reason: The specified module could not be found)
    03/19/2010 08:16:39,,Information,[432] There are 11 subsystems in the subsystems cache
    =====
    When I failed over the cluster onto Node2, and ran the same jobs (Backup, Index Rebuild, Statistics, etc) it went through fine.
    I then managed to find some documentation to trace the fault, and did this on both the nodes :
    =====
    command :select * from msdb.dbo.syssubsystems
    SQLCLUSTER with NODE1 as the main cluster
    subsystem_idsubsystemdescription_idsubsystem_dllagent_exestart_entry_pointevent_entry_pointstop_entry_pointmax_workwer_threads
    1TSQL14556[Internal][Internal][Internal][Internal][Internal]80
    2ActiveScripting14555C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLATXSS90.DLLNULLActiveScriptStartActiveScriptEventActiveScriptStop40
    3CmdExec14550C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLCMDSS90.DLLNULLCmdExecStartCmdEventCmdExecStop40
    4Snapshot14551C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMSNAPSHOT.EXEReplStartReplEventReplStop400
    5LogReader14552C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMlogread.exeReplStartReplEventReplStop100
    6Distribution14553C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMDISTRIB.EXEReplStartReplEventReplStop400
    7Merge14554C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMREPLMERG.EXEReplStartReplEventReplStop400
    8QueueReader14581C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnsqlrepss90.dllC:program FilesMicrosoft SQL Server90COMqrdrsvc.exeReplStartReplEventReplStop400
    9ANALYSISQUERY14513C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLOLAPSS90.DLLNULLOlapStartOlapQueryEventOlapStop400
    10ANALYSISCOMMAND14514C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLOLAPSS90.DLLNULLOlapStartOlapCommandEventOlapStop400
    11SSIS14538C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLDTSSS90.DLLC:program FilesMicrosoft SQL Server90DTSBinnDTExec.exeDtsStartDtsEventDtsStop400
    SQLCLUSTER with NODE2 as the main cluster
    subsystem_idsubsystemdescription_idsubsystem_dllagent_exestart_entry_pointevent_entry_pointstop_entry_pointmax_workwer_threads
    1TSQL14556[Internal][Internal][Internal][Internal][Internal]80
    2ActiveScripting14555C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLATXSS90.DLLNULLActiveScriptStartActiveScriptEventActiveScriptStop40
    3CmdExec14550C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLCMDSS90.DLLNULLCmdExecStartCmdEventCmdExecStop40
    4Snapshot14551C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMSNAPSHOT.EXEReplStartReplEventReplStop400
    5LogReader14552C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMlogread.exeReplStartReplEventReplStop100
    6Distribution14553C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMDISTRIB.EXEReplStartReplEventReplStop400
    7Merge14554C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLREPSS90.DLLC:program FilesMicrosoft SQL Server90COMREPLMERG.EXEReplStartReplEventReplStop400
    8QueueReader14581C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnsqlrepss90.dllC:program FilesMicrosoft SQL Server90COMqrdrsvc.exeReplStartReplEventReplStop400
    9ANALYSISQUERY14513C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLOLAPSS90.DLLNULLOlapStartOlapQueryEventOlapStop400
    10ANALYSISCOMMAND14514C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLOLAPSS90.DLLNULLOlapStartOlapCommandEventOlapStop400
    11SSIS14538C:program FilesMicrosoft SQL ServerMSSQL.2MSSQLinnSQLDTSSS90.DLLC:program FilesMicrosoft SQL Server90DTSBinnDTExec.exeDtsStartDtsEventDtsStop400
    =========For both nodes the results are the same, but when I compare the actual directories on Node1 and Node2, I found discrepancies.
    ### On Node1, I found the following subdirectories :
    C:Microsoft SQL ServerMSSQL.1Binn ......... minimal files
    C:Microsoft SQL ServerMSSQL.2Binn ......... SQL Reporting Services, etc
    C:Microsoft SQL ServerMSSQL.3Binn ......... all the required SQL files
    C:Microsoft SQL ServerMSSQL.3Install ....... all the required SQL files
    ### On Node2, I found the following subdirectories :
    C:Microsoft SQL ServerMSSQL.1Binn ......... minimal files
    C:Microsoft SQL ServerMSSQL.2Binn ......... all the required SQL files
    C:Microsoft SQL ServerMSSQL.2Install ....... all the required SQL files
    No MSSQL.3 subdirectory
    Obviously, what this means is that Node1 will always have a problem due to the location of the software.
    =======
    Question :
    1. Do I just 'FIX' the problem on NODE1 by copying all the SQL software files from ..MSSQL.3*.* into directory ..MSSQL.2 ?
    ( Could also rename the directories to "MSSQL.2" and "MSSQL.2_OLD" respectively ? )
    or
    2. Should I rather REMOVE SQL2005 completely from both the nodes (including the registry entries ...... where?)
    and REINSTALL SQL2005 Enterprise from scratch on the CLUSTER as per standard installation ?? The Windows Cluster is 100% OK.
    ==================
    I would appreciate some valid suggestions.
    The latter seem to be a better option to avoid any other problems creeping out of the 'Upgrade woodwork' so to speak, but I need to be sure.
    Regards
    Chris
  2. satya Moderator

    Welcome to the forums.
    Do you have a MS support contract?
    I sense a big problem here that may class as bug too, coming to the first error "Subsystem 'SSIS' could not be loaded " it is a generic problem occured due to the changes of SQL Server installation directory gone missing/differs from old instance. Within the RESOURCE DB SQL manages the table of pointers to SSIS dll for MSDB and you need to clear and repopulate with existing locations, see this KBA once it is done you have to restart SQLAgent for new changes.
  3. With regards to the Microsoft Support contract - No. But I can get the support from a third party.
    I can do the re-population of the table pointers, but I don't think it is necessary as the pointer entries in the table is the same for BOTH, but the ACTUAL software files are sitting in different locations on each Node.
    On Node2, it is in the place as indicated in the table pointers, and therefore the Node2 Cluster is working fine, but, on Node1, the software is sitting in a different location, (..MSSQL.3), which is not on Node 2.
    How do I resolve this issue and be sure there won't be any other issues popping up ?
    The only CLEAN alternative I can see is to Reinstall from scratch.
    Regards
    Chris

Share This Page