SQL Server Performance

Learn from my failed 2005 upgrade

Discussion in 'SQL Server 2005 General DBA Questions' started by merrillaldrich, Feb 19, 2007.

  1. merrillaldrich New Member

    All - at risk of embarassing myself publicly, I am going to post the detail about a failed upgrade of a production server from 2000 to 2005. It is my hope that I can help some other person prevent going through the same experience.

    In summary: If you have an application running atop SQL Server 2000 AND these things are true:

    1. The application has poor query plan reuse behavior, and a lot of ad-hoc SQL (you can evaluate this with the perfmon counter SQL Compilations / Sec, or by looking at the procedure cache for hit ratio and the number of times query plans typically get reused.

    2. The application typically creates a lot of logins per second (another Perfmon counter)

    then expect a rather dramatic increase in the required CPU capacity if you are considering upgrading from 2000 to 2005. If you don't have the headroom to handle a large increase, then load test VERY carefully.

    (Details to follow)
  2. merrillaldrich New Member

    The following is a report I created summarizing the reasons our upgrade failed. It's been anonymized to protect the players involved; when reading this, XYZ Corp is my employer, Initech is the vendor that created the software we use. For reference, this is all referencing an HP server having 2 Intel Xeon 3ghz processors, with two external RAID 10 arrays of 8 drives each and 24 GB RAM. The database(s) total about 150GB of information, and serve about 500-1000 users each day via a web app:

    “Initech” / SQL Server 2005 Post Mortem

    Prepared 2/19/2007 by Merrill Aldrich

    This document summarizes the reasons why our attempt to move our primary “Initech” database server from SQL Server 2000 to 2005 failed during the week of February 12, 2007. The information contained here is a combination of analysis by Microsoft SQL Server support engineer Mr. ____ and performance data collected here at XYZ Corp though our usual performance monitoring techniques. Detailed backup for these conclusions is available.

    What Happened

    1.XYZ Corp conducted internal testing of “Initech” running on a SQL Server 2005 test server for eight months, from about May of 2006 to January 2007. Those tests were successful within the limitations of what equipment was available: specifically, it was not possible to fully load test the combination of “Initech” and SQL Server 2005 with the same workload that we have on the production server, because there were not funds available to obtain a test server having the same hardware as the production server. Obtaining such a server as a test platform is truly cost-prohibitive. So the “Initech”/SQL Server 2005 combination was logic tested, and it was load tested up to the meaningful capacity of the machine on which it ran, and both tests were successful. Because in the past the production server has been “RAM starved,” and because SQL Server 2005 has some advantages in that area, it was hoped that the production server would have better overall performance after being upgraded. [This turned out to be an error, as the behavior of the “Initech” / SQL Server 2005 combination under a full load of users on the production server presented some surprises that we could not have reproduced on the test server, as we#%92ll see below.]

    2.On 2/10/2007 our main production server “InitechDB” was taken offline for the upgrade. The operating system (Windows Server 2003 Enterprise) was brought up to date with service packs and patches, including Service Pack 1.

    3.An in-place upgrade of SQL Server, from SQL Server 2000 Standard Edition SP 3 to SQL Server 2005 Standard Edition SP 1 was performed.

    4.The “Initech” databases and InitechWeb were successfully brought back online in “compatibility mode 80,” which emulates SQL Server 2000 for the client application, and ran smoothly.

    5.On Monday, 2/12/2007, the normal number of users logged on to the InitechWeb application, beginning with those in the east coast time zone (US). We did see the advantages we had hoped that SQL Server 2005 would provide. The application ran successfully until about 9:30 AM Pacific time, at which point the total user load for all four time zones came online. Unfortunately, at this point, the server#%92s CPU utilization was above 98%, and the application began to slow dramatically. Through the day, we determined that the server would not be able to serve data fast enough to keep the required number of users working, and that it was continuously “pegged.” The source of the bottleneck appeared to be CPU utilization.

    6.A service ticket was opened with Microsoft#%92s SQL Server professional support, which ticket was handled by engineer Mr. ____. Mr. ____ assisted us by running a data collection tool (pssdiag.exe) that collected trace information and performance statistics on the running server during its peak usage, and analyzed that data exhaustively.

    7.On Tuesday, 2/13/2007, we ran a final one-hour test of the SQL Server 2005 setup, in hopes that some configuration changes might mitigate the CPU problem. In particular, the server was adjusted to a) Turn off a function called “AWE,” which provides the SQL Server process with additional RAM above 3 GB, and b) Turn on a new feature available in SQL Server 2005 called “Forced Query Parameterization,” in an effort to reduce the CPU load caused by SQL Compilations.

    8.During this final test we collected performance data; however, we made the determination that the changes did not provide enough relief, and that the system could not be made to run satisfactorily in the time we had available. We would have to “roll back” the databases to SQL Server 2000 to keep the business functioning.

    9.On the afternoon of 2/13, we took the system offline once again, removed SQL Server 2005 and performed a complete rollback of the SQL Server software and data to the state it was in on the preceding Saturday, before the upgrade. The Windows operating system updates and patches were left in place.

    10.In the days following there was one lingering side effect of the change, which was an interaction between a Windows Server 2003 SP1 security function and the “Initech” InitechWeb application; we were able to work around that with a registry change on the morning of Wednesday, 2/14. (There were other issues, by coincidence, the same week, but all were the result of essentially bad timing, and had to do with “Initech” licensing and authentication, and the setup of new business. All in all, however, the week was what one might call a “Initech Perfect Storm.”)

    Why it Happened

    This was obviously a disastrous situation, but if there is any silver lining, it#%92s that we were able to gather very useful information about why this happened and what “Initech#%92s” development group can expect about the interaction between their application and SQL Server 2005.

    The crux of the problem was an increase in database server CPU utilization per concurrent user on the system: that is, the system worked in principle, but as more and more end-users connected to it and began using the InitechWeb application simultaneously, the demand placed on the CPUs in the database server increased dramatically. Ultimately the CPUs could not keep up, and the result was a very slow end-user experience. Despite the additional data and procedure cache memory that SQL Server 2005 provided, the overall performance was much worse than with 2000.

    After analyzing the measurements taken from the database server, we can trace that CPU activity to two underlying causes: the volume of “SQL Compilations” being performed and the behavior of a Windows/operating system component called lsass.exe. These two factors arose out of an interaction between the “Initech” InitechWeb application and SQL Server 2005 that was only apparent on a system under full production load. That is, it was neither “SQL Server#%92s fault” nor particularly “Initech#%92s fault” but instead an outcome of differences between how the SQL Server engineers imagine that a scalable application is constructed and how “Initech” is actually built. So while neither one technically “broke,” they could not function together at the very demanding performance level required for the number of users we support.

    Specifically, here are the differences:

    1.SQL Server is constructed in such a way that it assumes a large/scalable application (like “Initech” InitechWeb) that supports many users and large sets of data will take advantage of a concept called “Query Plan Re-use.” Essentially this means that the application will “present” database queries to the database server in one of several specific formats that allow the database to compile the query once and then reuse it, without recompiling it, whenever the application requests the same or similar information later. There is a large savings in CPU utilization from this technique, and it makes the database server run faster and support more users. This is a well-known concept in the database world, and has been true for some time – it#%92s not new to SQL Server 2005. The mechanisms to gain this advantage are variously called Stored Procedures, Parameterized (or auto-parameterized) Queries, or Prepared Statements.

    The “Initech” application, unfortunately, does not make good use of this mechanism, with the consequence that nearly every request it presents to the database server – even very similar queries – has to be first compiled and then executed, instead of executed directly from an existing compiled query plan. “Initech#%92s” approach to this problem is basically to assume the database server will be able to handle the load, and they have not taken steps to improve the application in this area. So because “Initech” has a relatively poor query-plan-reuse profile, the database server spends a lot of CPU time compiling practically all incoming requests before running them. This is a scalability issue we#%92ve known about for some time, so it#%92s not really news.

    For example, our existing SQL Server 2000 setup performs around 240 SQL Compilations per second on average, during a busy period, which is a huge number. This number is not related to the size of databases or quantity of data, but instead to the number of users working in InitechWeb at the same time, and to the fact that it does not reuse the existing compiled queries very much,. As another, similar measure, the server is faced with as many as 140,000 “SQL Batches” per minute at peak times, and if all of them require compilation and cannot be reused, it is extremely busy compiling the code for each one before it can execute the work required.

    2.Following on 1, SQL Server 2005 includes deliberate/designed enhancements that, unfortunately, exacerbate the interaction with “Initech” InitechWeb. Specifically, the SQL Server 2005 engine is “more careful” in the SQL compilation process to find the very most optimal query plan for each incoming query – again, partly on the assumption that the resulting plan is likely to be reused, and therefore that it makes sense to expend the resources to make it better in quality. The underpinning of this is “better (but more intensive) compilation > faster execution.” If the application did reuse the plans, this would indeed be a good tradeoff. However, the specifics of how “Initech” works forces the server to mainly discard the plans and compile new ones from scratch. So the database server is compiling better quality query plans, which would probably execute its “actual” work faster, but is not allowed to use them more than once, in many cases. The result is increased CPU activity, where the application cannot take advantage of the results that would otherwise be realized from that increase.

    3.Finally, on the compilation issue, there is a “limiter” or “governor” built into SQL Server 2005 that may be contributing to the problem: the SQL Server team, again using the same set of assumptions about best practice for application design, assumes that a reasonable load of simultaneous SQL compilations is expected, but that a huge number constitutes a problem. They have built in a resource limiter that will kick in if the server is faced with an astronomical load of simultaneous requests for compilation, which shows as a lock on SQL Server#%92s internal processes called “RESOURCE_SEMAPHORE_QUERY_COMPILE.” During our failed upgrade, we saw this lock often, and one theory I have is that the normal/expected load of SQL Compilation generated by InitechWeb, under heavy load, might trigger this limiter in SQL Server 2005, because it exceeds what would reasonably be expected. It is not entirely certain, however, whether the presence of that lock type in our specific situation was the limiter or just a side-effect of the pegged CPUs.

    4.We also tested, in our one-hour, last-ditch effort to save this upgrade, a SQL Server configuration called “Forced Parameterization.” This is a server-side feature that essentially sets the SQL Server 2005 engine to deliberately ignore the format of the SQL queries coming in from the application insofar as it will force as many of them as possible to be converted into reusable query plans. In our test this seemed to help the figures for query plan reuse somewhat, but did not bring the overall server performance back to an acceptable level. It remains to be seen in detail why that is, but it is true that whole classes of queries generated by “Initech” are not eligible even for this form of forced parameterization, due to the “Initech” mechanism around “Proprietary_xxx” views used for the enforcement of row-level security, and various details of the structure of queries.

    5.Lastly, unrelated to the inner workings of SQL Server detailed above, we had an issue with the Windows service lsass.exe that appears when using the “Initech”/SQL Server 2005 combination under heavy load, which does not appear with SQL Server 2000. This is, again, a result of an interaction between the two products that only appears with a large number of users. Here I have had to make some educated guesses, because I cannot find a really authoritative explanation. The facts are these:

    a.Under SQL Server 2000, the lsass.exe service is not busy.
    b.Under SQL Server 2005, this service uses CPU resource in proportion to the number of logins per second, peaking at about 12-15% of total CPU utilization on the server. This represents a dramatic increase.
    c.The “Initech” InitechWeb application generates a LOT of login traffic between the database server and the web servers: about 100 logins per second, on average, for a busy period.

    What I think is happening is this: from SQL Server 2000 to 2005, I believe the login process added a layer for SSL encryption that was not there before, and is an effort to better secure the database server. In the larger scheme of things, this is a welcome enhancement. However, this encryption layer probably adds a small amount of overhead for each login to the server. For an application that uses a reasonable number of logins over time, the additional overhead would be trivial. However, if “Initech” InitechWeb causes in excess of 100 logins per second, it#%92s possible that the overhead for the SSL encryption for that frequency of logins actually adds up to a sizeable CPU load on the database server, which then competes with the SQL Server process itself for CPU time.

    What can be done

    There is no immediate fix for these issues, because the underlying causes are deeply and architecturally embedded in both systems (“Initech” and SQL Server). However, over the long term it seems imperative that “Initech” address these issues:

    1.Query plan reuse: “Initech” has to move their application to some system allowing query plan reuse at the server, whether that be though prepared statements, stored procedures, or some other mechanism. This will entail some revision of existing code. Failing that, the application will scale less well on SQL Server 2005 than it does running on SQL Server 2000, resulting in additional hardware expense/requirements. The new “Forced Parameterization” feature in SQL Server 2005 should be tested carefully as a workaround, but in my view is not really a solution to the underlying problem. XYZ Corp does not have the resources to adequately test that functionality with real workloads.

    2.The interaction of logins with lsass.exe: “Initech” will need to examine how the InitechWeb web application authenticates against the database server, determine why we see in excess of 100 logins per second, and whether that is the cause of the additional CPU load generated by the lsass.exe service when using SQL Server 2005. It may be necessary to redesign how InitechWeb connects to the database server, to make more efficient use of persistent and/or pooled connections.

  3. merrillaldrich New Member

  4. joechang New Member

    I am heavily inclined to blame Microsoft for this

    prior to release,
    I told MS about the partial map of SQL 2005 costs relative to SQL 2000 that I had generated,

    it would have been a fairly simple matter given the resources MS has for the SQL 2005 product launch to build more complete map

    then use PSSDIAG & Read80Trace to characterize each users applications
    from the list of stored procedures & distinct SQL
    generate the 2000 & 2005 execution plans
    noting significant items of concern

    Instead, like all marketing people, they emphaize the possible positive benefits,
    not mentioning any negative consequences
    even though most negative consequences are easily correctable if one knew how and where to look

    shows you how much you can trust marketing propanganda to stick with you thick or thin
  5. Luis Martin Moderator

    Should we put this post in some other place?. Sticky may be?.


    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All in Love is Fair
    Stevie Wonder


    All postings are provided “AS IS” with no warranties for accuracy.



  6. alzdba Member

    Thank your for this valuable feedback.[^]
    Certainly this may urge companies to invest in a QA environment matching production.
    This emphasizes the need for stresstesting in QA !
    I think when a company has to rollback 2 or 3 day of activity, things will actualy be calculated in hard $$ that may easily exceed the QA-costs. Not only by having to redo some activity, but also keeping "missed opportunities" in mind because of timeouts, downtime, to slow webpages,....

  7. satya Moderator

    Indeed this is a valuable feedback and a FAQ for every DBA who are going in the path of SQL 2005 upgrade, I'm posting this post in news section for more visibility.

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  8. merrillaldrich New Member

    Thanks, guys. That makes me feel a little better, at least <img src='/community/emoticons/emotion-1.gif' alt=':)' />
  9. MohammedU New Member

  10. satya Moderator

    Merrill
    Do you have anything to addup on the above post, as I have taken most of the points to SQL dev. team in this case.

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  11. merrillaldrich New Member

    Hi Satya - thank you for showing it to them. The main questions I have are just these two:

    1. Is the limited number of SQL Compilations/sec we saw due to a limiter, or to running out of CPU resource? I would like to know because it would determine whether a more powerful server might actually succeed at this workload, or if a more powerful server would just hit the same limiter as the server we have. The vendor is not planning to change the architecture of their application to reduce the amount of ad-hoc SQL, so it might follow that we won't ever be able to upgrade(!)

    2. Is my theory right about the lsass service?

    Cheers,

    Merrill
  12. satya Moderator

    1. I will get back on that later, as still under investigations.

    2. LSASS (as you may aware) is the service required for Active Directory that runs under this process and includes the authentication and replication engines for Windows 2000 and 2003 domain controllers.

    Check the port options for this service and by defaut between 1024 and 65536 unless you are using a tunneling protocol to encapsulate the traffic. One of the KBA refers:

    quote: An encapsulated solution might consist of a VPN gateway located behind a filtering router using Layer 2 Tunneling Protocol (L2TP) together with IPsec. In this encapsulated scenario, you must allow IPsec Encapsulating Security Protocol (ESP) (IP protocol 50), IPsec Network Address Translator Traversal NAT-T (UDP port 4500), and IPsec Internet Security Association and Key Management Protocol (ISAKMP) (UDP port 500) through the router as opposed to opening all the ports and protocols listed below.
    Refer to KBAhttp://support.microsoft.com/kb/224196/ too for information.

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  13. merrillaldrich New Member

    We don't use active directory, and the machine isn't a domain controller. What's the link then with SQL Server?
  14. satya Moderator

    THen how come LSASS service is using CPU time (maxing), if possible try using PRocess Explorer from Sysinternals website to see what SVCHOST is doing.

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  15. WingSzeto Member

    We have a 1 CPU (3.0 Gbytes) dual core and 4 Gb of Ram. The db size is about 22 Gb. We have four web servers and one database server. The user connection is pre-configured through the web servers and the database server has about 120 connections at peak time. Currently our system shows 24~50 complication per sec on average. Indivdual incident could go as high as 260 but not often. The logins per sec is less than 0.5 per sec on average. In about a month, I will be upgrading from SQL 2K sp4 to SQL 2k5 sp2, is there any concern based on our current configuation? We do use sp's and some parameterized queries but not enough to call ourselves totally well design yet.

    wingman


    quote:Originally posted by merrillaldrich

    The following is a report I created summarizing the reasons our upgrade failed. It's been anonymized to protect the players involved; when reading this, XYZ Corp is my employer, Initech is the vendor that created the software we use. For reference, this is all referencing an HP server having 2 Intel Xeon 3ghz processors, with two external RAID 10 arrays of 8 drives each and 24 GB RAM. The database(s) total about 150GB of information, and serve about 500-1000 users each day via a web app:

    “Initech” / SQL Server 2005 Post Mortem

    Prepared 2/19/2007 by Merrill Aldrich

    This document summarizes the reasons why our attempt to move our primary “Initech” database server from SQL Server 2000 to 2005 failed during the week of February 12, 2007. The information contained here is a combination of analysis by Microsoft SQL Server support engineer Mr. ____ and performance data collected here at XYZ Corp though our usual performance monitoring techniques. Detailed backup for these conclusions is available.

    What Happened

    1.XYZ Corp conducted internal testing of “Initech” running on a SQL Server 2005 test server for eight months, from about May of 2006 to January 2007. Those tests were successful within the limitations of what equipment was available: specifically, it was not possible to fully load test the combination of “Initech” and SQL Server 2005 with the same workload that we have on the production server, because there were not funds available to obtain a test server having the same hardware as the production server. Obtaining such a server as a test platform is truly cost-prohibitive. So the “Initech”/SQL Server 2005 combination was logic tested, and it was load tested up to the meaningful capacity of the machine on which it ran, and both tests were successful. Because in the past the production server has been “RAM starved,” and because SQL Server 2005 has some advantages in that area, it was hoped that the production server would have better overall performance after being upgraded. [This turned out to be an error, as the behavior of the “Initech” / SQL Server 2005 combination under a full load of users on the production server presented some surprises that we could not have reproduced on the test server, as we#%92ll see below.]

    2.On 2/10/2007 our main production server “InitechDB” was taken offline for the upgrade. The operating system (Windows Server 2003 Enterprise) was brought up to date with service packs and patches, including Service Pack 1.

    3.An in-place upgrade of SQL Server, from SQL Server 2000 Standard Edition SP 3 to SQL Server 2005 Standard Edition SP 1 was performed.

    4.The “Initech” databases and InitechWeb were successfully brought back online in “compatibility mode 80,” which emulates SQL Server 2000 for the client application, and ran smoothly.

    5.On Monday, 2/12/2007, the normal number of users logged on to the InitechWeb application, beginning with those in the east coast time zone (US). We did see the advantages we had hoped that SQL Server 2005 would provide. The application ran successfully until about 9:30 AM Pacific time, at which point the total user load for all four time zones came online. Unfortunately, at this point, the server#%92s CPU utilization was above 98%, and the application began to slow dramatically. Through the day, we determined that the server would not be able to serve data fast enough to keep the required number of users working, and that it was continuously “pegged.” The source of the bottleneck appeared to be CPU utilization.

    6.A service ticket was opened with Microsoft#%92s SQL Server professional support, which ticket was handled by engineer Mr. ____. Mr. ____ assisted us by running a data collection tool (pssdiag.exe) that collected trace information and performance statistics on the running server during its peak usage, and analyzed that data exhaustively.

    7.On Tuesday, 2/13/2007, we ran a final one-hour test of the SQL Server 2005 setup, in hopes that some configuration changes might mitigate the CPU problem. In particular, the server was adjusted to a) Turn off a function called “AWE,” which provides the SQL Server process with additional RAM above 3 GB, and b) Turn on a new feature available in SQL Server 2005 called “Forced Query Parameterization,” in an effort to reduce the CPU load caused by SQL Compilations.

    8.During this final test we collected performance data; however, we made the determination that the changes did not provide enough relief, and that the system could not be made to run satisfactorily in the time we had available. We would have to “roll back” the databases to SQL Server 2000 to keep the business functioning.

    9.On the afternoon of 2/13, we took the system offline once again, removed SQL Server 2005 and performed a complete rollback of the SQL Server software and data to the state it was in on the preceding Saturday, before the upgrade. The Windows operating system updates and patches were left in place.

    10.In the days following there was one lingering side effect of the change, which was an interaction between a Windows Server 2003 SP1 security function and the “Initech” InitechWeb application; we were able to work around that with a registry change on the morning of Wednesday, 2/14. (There were other issues, by coincidence, the same week, but all were the result of essentially bad timing, and had to do with “Initech” licensing and authentication, and the setup of new business. All in all, however, the week was what one might call a “Initech Perfect Storm.”)

    Why it Happened

    This was obviously a disastrous situation, but if there is any silver lining, it#%92s that we were able to gather very useful information about why this happened and what “Initech#%92s” development group can expect about the interaction between their application and SQL Server 2005.

    The crux of the problem was an increase in database server CPU utilization per concurrent user on the system: that is, the system worked in principle, but as more and more end-users connected to it and began using the InitechWeb application simultaneously, the demand placed on the CPUs in the database server increased dramatically. Ultimately the CPUs could not keep up, and the result was a very slow end-user experience. Despite the additional data and procedure cache memory that SQL Server 2005 provided, the overall performance was much worse than with 2000.

    After analyzing the measurements taken from the database server, we can trace that CPU activity to two underlying causes: the volume of “SQL Compilations” being performed and the behavior of a Windows/operating system component called lsass.exe. These two factors arose out of an interaction between the “Initech” InitechWeb application and SQL Server 2005 that was only apparent on a system under full production load. That is, it was neither “SQL Server#%92s fault” nor particularly “Initech#%92s fault” but instead an outcome of differences between how the SQL Server engineers imagine that a scalable application is constructed and how “Initech” is actually built. So while neither one technically “broke,” they could not function together at the very demanding performance level required for the number of users we support.

    Specifically, here are the differences:

    1.SQL Server is constructed in such a way that it assumes a large/scalable application (like “Initech” InitechWeb) that supports many users and large sets of data will take advantage of a concept called “Query Plan Re-use.” Essentially this means that the application will “present” database queries to the database server in one of several specific formats that allow the database to compile the query once and then reuse it, without recompiling it, whenever the application requests the same or similar information later. There is a large savings in CPU utilization from this technique, and it makes the database server run faster and support more users. This is a well-known concept in the database world, and has been true for some time – it#%92s not new to SQL Server 2005. The mechanisms to gain this advantage are variously called Stored Procedures, Parameterized (or auto-parameterized) Queries, or Prepared Statements.

    The “Initech” application, unfortunately, does not make good use of this mechanism, with the consequence that nearly every request it presents to the database server – even very similar queries – has to be first compiled and then executed, instead of executed directly from an existing compiled query plan. “Initech#%92s” approach to this problem is basically to assume the database server will be able to handle the load, and they have not taken steps to improve the application in this area. So because “Initech” has a relatively poor query-plan-reuse profile, the database server spends a lot of CPU time compiling practically all incoming requests before running them. This is a scalability issue we#%92ve known about for some time, so it#%92s not really news.

    For example, our existing SQL Server 2000 setup performs around 240 SQL Compilations per second on average, during a busy period, which is a huge number. This number is not related to the size of databases or quantity of data, but instead to the number of users working in InitechWeb at the same time, and to the fact that it does not reuse the existing compiled queries very much,. As another, similar measure, the server is faced with as many as 140,000 “SQL Batches” per minute at peak times, and if all of them require compilation and cannot be reused, it is extremely busy compiling the code for each one before it can execute the work required.

    2.Following on 1, SQL Server 2005 includes deliberate/designed enhancements that, unfortunately, exacerbate the interaction with “Initech” InitechWeb. Specifically, the SQL Server 2005 engine is “more careful” in the SQL compilation process to find the very most optimal query plan for each incoming query – again, partly on the assumption that the resulting plan is likely to be reused, and therefore that it makes sense to expend the resources to make it better in quality. The underpinning of this is “better (but more intensive) compilation > faster execution.” If the application did reuse the plans, this would indeed be a good tradeoff. However, the specifics of how “Initech” works forces the server to mainly discard the plans and compile new ones from scratch. So the database server is compiling better quality query plans, which would probably execute its “actual” work faster, but is not allowed to use them more than once, in many cases. The result is increased CPU activity, where the application cannot take advantage of the results that would otherwise be realized from that increase.

    3.Finally, on the compilation issue, there is a “limiter” or “governor” built into SQL Server 2005 that may be contributing to the problem: the SQL Server team, again using the same set of assumptions about best practice for application design, assumes that a reasonable load of simultaneous SQL compilations is expected, but that a huge number constitutes a problem. They have built in a resource limiter that will kick in if the server is faced with an astronomical load of simultaneous requests for compilation, which shows as a lock on SQL Server#%92s internal processes called “RESOURCE_SEMAPHORE_QUERY_COMPILE.” During our failed upgrade, we saw this lock often, and one theory I have is that the normal/expected load of SQL Compilation generated by InitechWeb, under heavy load, might trigger this limiter in SQL Server 2005, because it exceeds what would reasonably be expected. It is not entirely certain, however, whether the presence of that lock type in our specific situation was the limiter or just a side-effect of the pegged CPUs.

    4.We also tested, in our one-hour, last-ditch effort to save this upgrade, a SQL Server configuration called “Forced Parameterization.” This is a server-side feature that essentially sets the SQL Server 2005 engine to deliberately ignore the format of the SQL queries coming in from the application insofar as it will force as many of them as possible to be converted into reusable query plans. In our test this seemed to help the figures for query plan reuse somewhat, but did not bring the overall server performance back to an acceptable level. It remains to be seen in detail why that is, but it is true that whole classes of queries generated by “Initech” are not eligible even for this form of forced parameterization, due to the “Initech” mechanism around “Proprietary_xxx” views used for the enforcement of row-level security, and various details of the structure of queries.

    5.Lastly, unrelated to the inner workings of SQL Server detailed above, we had an issue with the Windows service lsass.exe that appears when using the “Initech”/SQL Server 2005 combination under heavy load, which does not appear with SQL Server 2000. This is, again, a result of an interaction between the two products that only appears with a large number of users. Here I have had to make some educated guesses, because I cannot find a really authoritative explanation. The facts are these:

    a.Under SQL Server 2000, the lsass.exe service is not busy.
    b.Under SQL Server 2005, this service uses CPU resource in proportion to the number of logins per second, peaking at about 12-15% of total CPU utilization on the server. This represents a dramatic increase.
    c.The “Initech” InitechWeb application generates a LOT of login traffic between the database server and the web servers: about 100 logins per second, on average, for a busy period.

    What I think is happening is this: from SQL Server 2000 to 2005, I believe the login process added a layer for SSL encryption that was not there before, and is an effort to better secure the database server. In the larger scheme of things, this is a welcome enhancement. However, this encryption layer probably adds a small amount of overhead for each login to the server. For an application that uses a reasonable number of logins over time, the additional overhead would be trivial. However, if “Initech” InitechWeb causes in excess of 100 logins per second, it#%92s possible that the overhead for the SSL encryption for that frequency of logins actually adds up to a sizeable CPU load on the database server, which then competes with the SQL Server process itself for CPU time.

    What can be done

    There is no immediate fix for these issues, because the underlying causes are deeply and architecturally embedded in both systems (“Initech” and SQL Server). However, over the long term it seems imperative that “Initech” address these issues:

    1.Query plan reuse: “Initech” has to move their application to some system allowing query plan reuse at the server, whether that be though prepared statements, stored procedures, or some other mechanism. This will entail some revision of existing code. Failing that, the application will scale less well on SQL Server 2005 than it does running on SQL Server 2000, resulting in additional hardware expense/requirements. The new “Forced Parameterization” feature in SQL Server 2005 should be tested carefully as a workaround, but in my view is not really a solution to the underlying problem. XYZ Corp does not have the resources to adequately test that functionality with real workloads.

    2.The interaction of logins with lsass.exe: “Initech” will need to examine how the InitechWeb web application authenticates against the database server, determine why we see in excess of 100 logins per second, and whether that is the cause of the additional CPU load generated by the lsass.exe service when using SQL Server 2005. It may be necessary to redesign how InitechWeb connects to the database server, to make more efficient use of persistent and/or pooled connections.


  16. merrillaldrich New Member

    wingman - at first glance it looks like you're in a much better position than I was; for comparison, our system demands an average 250 compilations per second vs. your 24-50 and 100 logins per second to your 0.5; it was running on a two socket/single core 3 ghz Xeon setup at the time I created this post.

    How much headroom do you have for CPU utilization? What's the average now?
  17. WingSzeto Member

    I have one more question. How do you roll back from SQL 2005 to SQL 2000, particularly the user databases? Based on your description, you have gone live for at least one day and then decided to roll back to SQL 2000. How do you bring the user database back from SQL 2005 to SQL 2000? What tool did you use to do that?

    wingman

    quote:Originally posted by merrillaldrich

    The following is a report I created summarizing the reasons our upgrade failed. It's been anonymized to protect the players involved; when reading this, XYZ Corp is my employer, Initech is the vendor that created the software we use. For reference, this is all referencing an HP server having 2 Intel Xeon 3ghz processors, with two external RAID 10 arrays of 8 drives each and 24 GB RAM. The database(s) total about 150GB of information, and serve about 500-1000 users each day via a web app:

    “Initech” / SQL Server 2005 Post Mortem

    Prepared 2/19/2007 by Merrill Aldrich

    This document summarizes the reasons why our attempt to move our primary “Initech” database server from SQL Server 2000 to 2005 failed during the week of February 12, 2007. The information contained here is a combination of analysis by Microsoft SQL Server support engineer Mr. ____ and performance data collected here at XYZ Corp though our usual performance monitoring techniques. Detailed backup for these conclusions is available.

    What Happened

    1.XYZ Corp conducted internal testing of “Initech” running on a SQL Server 2005 test server for eight months, from about May of 2006 to January 2007. Those tests were successful within the limitations of what equipment was available: specifically, it was not possible to fully load test the combination of “Initech” and SQL Server 2005 with the same workload that we have on the production server, because there were not funds available to obtain a test server having the same hardware as the production server. Obtaining such a server as a test platform is truly cost-prohibitive. So the “Initech”/SQL Server 2005 combination was logic tested, and it was load tested up to the meaningful capacity of the machine on which it ran, and both tests were successful. Because in the past the production server has been “RAM starved,” and because SQL Server 2005 has some advantages in that area, it was hoped that the production server would have better overall performance after being upgraded. [This turned out to be an error, as the behavior of the “Initech” / SQL Server 2005 combination under a full load of users on the production server presented some surprises that we could not have reproduced on the test server, as we#%92ll see below.]

    2.On 2/10/2007 our main production server “InitechDB” was taken offline for the upgrade. The operating system (Windows Server 2003 Enterprise) was brought up to date with service packs and patches, including Service Pack 1.

    3.An in-place upgrade of SQL Server, from SQL Server 2000 Standard Edition SP 3 to SQL Server 2005 Standard Edition SP 1 was performed.

    4.The “Initech” databases and InitechWeb were successfully brought back online in “compatibility mode 80,” which emulates SQL Server 2000 for the client application, and ran smoothly.

    5.On Monday, 2/12/2007, the normal number of users logged on to the InitechWeb application, beginning with those in the east coast time zone (US). We did see the advantages we had hoped that SQL Server 2005 would provide. The application ran successfully until about 9:30 AM Pacific time, at which point the total user load for all four time zones came online. Unfortunately, at this point, the server#%92s CPU utilization was above 98%, and the application began to slow dramatically. Through the day, we determined that the server would not be able to serve data fast enough to keep the required number of users working, and that it was continuously “pegged.” The source of the bottleneck appeared to be CPU utilization.

    6.A service ticket was opened with Microsoft#%92s SQL Server professional support, which ticket was handled by engineer Mr. ____. Mr. ____ assisted us by running a data collection tool (pssdiag.exe) that collected trace information and performance statistics on the running server during its peak usage, and analyzed that data exhaustively.

    7.On Tuesday, 2/13/2007, we ran a final one-hour test of the SQL Server 2005 setup, in hopes that some configuration changes might mitigate the CPU problem. In particular, the server was adjusted to a) Turn off a function called “AWE,” which provides the SQL Server process with additional RAM above 3 GB, and b) Turn on a new feature available in SQL Server 2005 called “Forced Query Parameterization,” in an effort to reduce the CPU load caused by SQL Compilations.

    8.During this final test we collected performance data; however, we made the determination that the changes did not provide enough relief, and that the system could not be made to run satisfactorily in the time we had available. We would have to “roll back” the databases to SQL Server 2000 to keep the business functioning.

    9.On the afternoon of 2/13, we took the system offline once again, removed SQL Server 2005 and performed a complete rollback of the SQL Server software and data to the state it was in on the preceding Saturday, before the upgrade. The Windows operating system updates and patches were left in place.

    10.In the days following there was one lingering side effect of the change, which was an interaction between a Windows Server 2003 SP1 security function and the “Initech” InitechWeb application; we were able to work around that with a registry change on the morning of Wednesday, 2/14. (There were other issues, by coincidence, the same week, but all were the result of essentially bad timing, and had to do with “Initech” licensing and authentication, and the setup of new business. All in all, however, the week was what one might call a “Initech Perfect Storm.”)

    Why it Happened

    This was obviously a disastrous situation, but if there is any silver lining, it#%92s that we were able to gather very useful information about why this happened and what “Initech#%92s” development group can expect about the interaction between their application and SQL Server 2005.

    The crux of the problem was an increase in database server CPU utilization per concurrent user on the system: that is, the system worked in principle, but as more and more end-users connected to it and began using the InitechWeb application simultaneously, the demand placed on the CPUs in the database server increased dramatically. Ultimately the CPUs could not keep up, and the result was a very slow end-user experience. Despite the additional data and procedure cache memory that SQL Server 2005 provided, the overall performance was much worse than with 2000.

    After analyzing the measurements taken from the database server, we can trace that CPU activity to two underlying causes: the volume of “SQL Compilations” being performed and the behavior of a Windows/operating system component called lsass.exe. These two factors arose out of an interaction between the “Initech” InitechWeb application and SQL Server 2005 that was only apparent on a system under full production load. That is, it was neither “SQL Server#%92s fault” nor particularly “Initech#%92s fault” but instead an outcome of differences between how the SQL Server engineers imagine that a scalable application is constructed and how “Initech” is actually built. So while neither one technically “broke,” they could not function together at the very demanding performance level required for the number of users we support.

    Specifically, here are the differences:

    1.SQL Server is constructed in such a way that it assumes a large/scalable application (like “Initech” InitechWeb) that supports many users and large sets of data will take advantage of a concept called “Query Plan Re-use.” Essentially this means that the application will “present” database queries to the database server in one of several specific formats that allow the database to compile the query once and then reuse it, without recompiling it, whenever the application requests the same or similar information later. There is a large savings in CPU utilization from this technique, and it makes the database server run faster and support more users. This is a well-known concept in the database world, and has been true for some time – it#%92s not new to SQL Server 2005. The mechanisms to gain this advantage are variously called Stored Procedures, Parameterized (or auto-parameterized) Queries, or Prepared Statements.

    The “Initech” application, unfortunately, does not make good use of this mechanism, with the consequence that nearly every request it presents to the database server – even very similar queries – has to be first compiled and then executed, instead of executed directly from an existing compiled query plan. “Initech#%92s” approach to this problem is basically to assume the database server will be able to handle the load, and they have not taken steps to improve the application in this area. So because “Initech” has a relatively poor query-plan-reuse profile, the database server spends a lot of CPU time compiling practically all incoming requests before running them. This is a scalability issue we#%92ve known about for some time, so it#%92s not really news.

    For example, our existing SQL Server 2000 setup performs around 240 SQL Compilations per second on average, during a busy period, which is a huge number. This number is not related to the size of databases or quantity of data, but instead to the number of users working in InitechWeb at the same time, and to the fact that it does not reuse the existing compiled queries very much,. As another, similar measure, the server is faced with as many as 140,000 “SQL Batches” per minute at peak times, and if all of them require compilation and cannot be reused, it is extremely busy compiling the code for each one before it can execute the work required.

    2.Following on 1, SQL Server 2005 includes deliberate/designed enhancements that, unfortunately, exacerbate the interaction with “Initech” InitechWeb. Specifically, the SQL Server 2005 engine is “more careful” in the SQL compilation process to find the very most optimal query plan for each incoming query – again, partly on the assumption that the resulting plan is likely to be reused, and therefore that it makes sense to expend the resources to make it better in quality. The underpinning of this is “better (but more intensive) compilation > faster execution.” If the application did reuse the plans, this would indeed be a good tradeoff. However, the specifics of how “Initech” works forces the server to mainly discard the plans and compile new ones from scratch. So the database server is compiling better quality query plans, which would probably execute its “actual” work faster, but is not allowed to use them more than once, in many cases. The result is increased CPU activity, where the application cannot take advantage of the results that would otherwise be realized from that increase.

    3.Finally, on the compilation issue, there is a “limiter” or “governor” built into SQL Server 2005 that may be contributing to the problem: the SQL Server team, again using the same set of assumptions about best practice for application design, assumes that a reasonable load of simultaneous SQL compilations is expected, but that a huge number constitutes a problem. They have built in a resource limiter that will kick in if the server is faced with an astronomical load of simultaneous requests for compilation, which shows as a lock on SQL Server#%92s internal processes called “RESOURCE_SEMAPHORE_QUERY_COMPILE.” During our failed upgrade, we saw this lock often, and one theory I have is that the normal/expected load of SQL Compilation generated by InitechWeb, under heavy load, might trigger this limiter in SQL Server 2005, because it exceeds what would reasonably be expected. It is not entirely certain, however, whether the presence of that lock type in our specific situation was the limiter or just a side-effect of the pegged CPUs.

    4.We also tested, in our one-hour, last-ditch effort to save this upgrade, a SQL Server configuration called “Forced Parameterization.” This is a server-side feature that essentially sets the SQL Server 2005 engine to deliberately ignore the format of the SQL queries coming in from the application insofar as it will force as many of them as possible to be converted into reusable query plans. In our test this seemed to help the figures for query plan reuse somewhat, but did not bring the overall server performance back to an acceptable level. It remains to be seen in detail why that is, but it is true that whole classes of queries generated by “Initech” are not eligible even for this form of forced parameterization, due to the “Initech” mechanism around “Proprietary_xxx” views used for the enforcement of row-level security, and various details of the structure of queries.

    5.Lastly, unrelated to the inner workings of SQL Server detailed above, we had an issue with the Windows service lsass.exe that appears when using the “Initech”/SQL Server 2005 combination under heavy load, which does not appear with SQL Server 2000. This is, again, a result of an interaction between the two products that only appears with a large number of users. Here I have had to make some educated guesses, because I cannot find a really authoritative explanation. The facts are these:

    a.Under SQL Server 2000, the lsass.exe service is not busy.
    b.Under SQL Server 2005, this service uses CPU resource in proportion to the number of logins per second, peaking at about 12-15% of total CPU utilization on the server. This represents a dramatic increase.
    c.The “Initech” InitechWeb application generates a LOT of login traffic between the database server and the web servers: about 100 logins per second, on average, for a busy period.

    What I think is happening is this: from SQL Server 2000 to 2005, I believe the login process added a layer for SSL encryption that was not there before, and is an effort to better secure the database server. In the larger scheme of things, this is a welcome enhancement. However, this encryption layer probably adds a small amount of overhead for each login to the server. For an application that uses a reasonable number of logins over time, the additional overhead would be trivial. However, if “Initech” InitechWeb causes in excess of 100 logins per second, it#%92s possible that the overhead for the SSL encryption for that frequency of logins actually adds up to a sizeable CPU load on the database server, which then competes with the SQL Server process itself for CPU time.

    What can be done

    There is no immediate fix for these issues, because the underlying causes are deeply and architecturally embedded in both systems (“Initech” and SQL Server). However, over the long term it seems imperative that “Initech” address these issues:

    1.Query plan reuse: “Initech” has to move their application to some system allowing query plan reuse at the server, whether that be though prepared statements, stored procedures, or some other mechanism. This will entail some revision of existing code. Failing that, the application will scale less well on SQL Server 2005 than it does running on SQL Server 2000, resulting in additional hardware expense/requirements. The new “Forced Parameterization” feature in SQL Server 2005 should be tested carefully as a workaround, but in my view is not really a solution to the underlying problem. XYZ Corp does not have the resources to adequately test that functionality with real workloads.

    2.The interaction of logins with lsass.exe: “Initech” will need to examine how the InitechWeb web application authenticates against the database server, determine why we see in excess of 100 logins per second, and whether that is the cause of the additional CPU load generated by the lsass.exe service when using SQL Server 2005. It may be necessary to redesign how InitechWeb connects to the database server, to make more efficient use of persistent and/or pooled connections.


  18. WingSzeto Member

    On average between 7:00 am and 5:00 am, CPU is about 30%. Sometimes it could hit 60~80% maybe once or twice per hour. I sent you another question 5 five minutes ago, which I am also extremely interested. Please reply asap if possible. Thanks.


    quote:Originally posted by merrillaldrich

    wingman - at first glance it looks like you're in a much better position than I was; for comparison, our system demands an average 250 compilations per second vs. your 24-50 and 100 logins per second to your 0.5; it was running on a two socket/single core 3 ghz Xeon setup at the time I created this post.

    How much headroom do you have for CPU utilization? What's the average now?
  19. satya Moderator

    Check what kind of queries are running during at 30% and 80% CPU spike, take help from profiler and Perfmon to see the system resource usage.

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  20. merrillaldrich New Member

    quote:Originally posted by WingSzeto

    I have one more question. How do you roll back from SQL 2005 to SQL 2000, particularly the user databases? Based on your description, you have gone live for at least one day and then decided to roll back to SQL 2000. How do you bring the user database back from SQL 2005 to SQL 2000? What tool did you use to do that?

    wingman


    We could not save the data. Two days of work were lost; we restored a backup from before the upgrade. That is why this issue is so serious.
  21. WingSzeto Member

    You mention about the data lost because of the rollback. I am thinking from the data standpoint, would it be another option to create a new SQL 2K database with all the tables with no data, then either use bcp or DTS to transfer the data back in? It will require some downtime,however and I can also see if it will take time depending on how much data involved. It is either dealing with the data lost or down time for transfer the data. Is the DTS a potential option? I am doing upgrade also and just want to find a contingency plan in case if we are having the same problem.


    quote:Originally posted by merrillaldrich

    The following is a report I created summarizing the reasons our upgrade failed. It's been anonymized to protect the players involved; when reading this, XYZ Corp is my employer, Initech is the vendor that created the software we use. For reference, this is all referencing an HP server having 2 Intel Xeon 3ghz processors, with two external RAID 10 arrays of 8 drives each and 24 GB RAM. The database(s) total about 150GB of information, and serve about 500-1000 users each day via a web app:

    “Initech” / SQL Server 2005 Post Mortem

    Prepared 2/19/2007 by Merrill Aldrich

    This document summarizes the reasons why our attempt to move our primary “Initech” database server from SQL Server 2000 to 2005 failed during the week of February 12, 2007. The information contained here is a combination of analysis by Microsoft SQL Server support engineer Mr. ____ and performance data collected here at XYZ Corp though our usual performance monitoring techniques. Detailed backup for these conclusions is available.

    What Happened

    1.XYZ Corp conducted internal testing of “Initech” running on a SQL Server 2005 test server for eight months, from about May of 2006 to January 2007. Those tests were successful within the limitations of what equipment was available: specifically, it was not possible to fully load test the combination of “Initech” and SQL Server 2005 with the same workload that we have on the production server, because there were not funds available to obtain a test server having the same hardware as the production server. Obtaining such a server as a test platform is truly cost-prohibitive. So the “Initech”/SQL Server 2005 combination was logic tested, and it was load tested up to the meaningful capacity of the machine on which it ran, and both tests were successful. Because in the past the production server has been “RAM starved,” and because SQL Server 2005 has some advantages in that area, it was hoped that the production server would have better overall performance after being upgraded. [This turned out to be an error, as the behavior of the “Initech” / SQL Server 2005 combination under a full load of users on the production server presented some surprises that we could not have reproduced on the test server, as we#%92ll see below.]

    2.On 2/10/2007 our main production server “InitechDB” was taken offline for the upgrade. The operating system (Windows Server 2003 Enterprise) was brought up to date with service packs and patches, including Service Pack 1.

    3.An in-place upgrade of SQL Server, from SQL Server 2000 Standard Edition SP 3 to SQL Server 2005 Standard Edition SP 1 was performed.

    4.The “Initech” databases and InitechWeb were successfully brought back online in “compatibility mode 80,” which emulates SQL Server 2000 for the client application, and ran smoothly.

    5.On Monday, 2/12/2007, the normal number of users logged on to the InitechWeb application, beginning with those in the east coast time zone (US). We did see the advantages we had hoped that SQL Server 2005 would provide. The application ran successfully until about 9:30 AM Pacific time, at which point the total user load for all four time zones came online. Unfortunately, at this point, the server#%92s CPU utilization was above 98%, and the application began to slow dramatically. Through the day, we determined that the server would not be able to serve data fast enough to keep the required number of users working, and that it was continuously “pegged.” The source of the bottleneck appeared to be CPU utilization.

    6.A service ticket was opened with Microsoft#%92s SQL Server professional support, which ticket was handled by engineer Mr. ____. Mr. ____ assisted us by running a data collection tool (pssdiag.exe) that collected trace information and performance statistics on the running server during its peak usage, and analyzed that data exhaustively.

    7.On Tuesday, 2/13/2007, we ran a final one-hour test of the SQL Server 2005 setup, in hopes that some configuration changes might mitigate the CPU problem. In particular, the server was adjusted to a) Turn off a function called “AWE,” which provides the SQL Server process with additional RAM above 3 GB, and b) Turn on a new feature available in SQL Server 2005 called “Forced Query Parameterization,” in an effort to reduce the CPU load caused by SQL Compilations.

    8.During this final test we collected performance data; however, we made the determination that the changes did not provide enough relief, and that the system could not be made to run satisfactorily in the time we had available. We would have to “roll back” the databases to SQL Server 2000 to keep the business functioning.

    9.On the afternoon of 2/13, we took the system offline once again, removed SQL Server 2005 and performed a complete rollback of the SQL Server software and data to the state it was in on the preceding Saturday, before the upgrade. The Windows operating system updates and patches were left in place.

    10.In the days following there was one lingering side effect of the change, which was an interaction between a Windows Server 2003 SP1 security function and the “Initech” InitechWeb application; we were able to work around that with a registry change on the morning of Wednesday, 2/14. (There were other issues, by coincidence, the same week, but all were the result of essentially bad timing, and had to do with “Initech” licensing and authentication, and the setup of new business. All in all, however, the week was what one might call a “Initech Perfect Storm.”)

    Why it Happened

    This was obviously a disastrous situation, but if there is any silver lining, it#%92s that we were able to gather very useful information about why this happened and what “Initech#%92s” development group can expect about the interaction between their application and SQL Server 2005.

    The crux of the problem was an increase in database server CPU utilization per concurrent user on the system: that is, the system worked in principle, but as more and more end-users connected to it and began using the InitechWeb application simultaneously, the demand placed on the CPUs in the database server increased dramatically. Ultimately the CPUs could not keep up, and the result was a very slow end-user experience. Despite the additional data and procedure cache memory that SQL Server 2005 provided, the overall performance was much worse than with 2000.

    After analyzing the measurements taken from the database server, we can trace that CPU activity to two underlying causes: the volume of “SQL Compilations” being performed and the behavior of a Windows/operating system component called lsass.exe. These two factors arose out of an interaction between the “Initech” InitechWeb application and SQL Server 2005 that was only apparent on a system under full production load. That is, it was neither “SQL Server#%92s fault” nor particularly “Initech#%92s fault” but instead an outcome of differences between how the SQL Server engineers imagine that a scalable application is constructed and how “Initech” is actually built. So while neither one technically “broke,” they could not function together at the very demanding performance level required for the number of users we support.

    Specifically, here are the differences:

    1.SQL Server is constructed in such a way that it assumes a large/scalable application (like “Initech” InitechWeb) that supports many users and large sets of data will take advantage of a concept called “Query Plan Re-use.” Essentially this means that the application will “present” database queries to the database server in one of several specific formats that allow the database to compile the query once and then reuse it, without recompiling it, whenever the application requests the same or similar information later. There is a large savings in CPU utilization from this technique, and it makes the database server run faster and support more users. This is a well-known concept in the database world, and has been true for some time – it#%92s not new to SQL Server 2005. The mechanisms to gain this advantage are variously called Stored Procedures, Parameterized (or auto-parameterized) Queries, or Prepared Statements.

    The “Initech” application, unfortunately, does not make good use of this mechanism, with the consequence that nearly every request it presents to the database server – even very similar queries – has to be first compiled and then executed, instead of executed directly from an existing compiled query plan. “Initech#%92s” approach to this problem is basically to assume the database server will be able to handle the load, and they have not taken steps to improve the application in this area. So because “Initech” has a relatively poor query-plan-reuse profile, the database server spends a lot of CPU time compiling practically all incoming requests before running them. This is a scalability issue we#%92ve known about for some time, so it#%92s not really news.

    For example, our existing SQL Server 2000 setup performs around 240 SQL Compilations per second on average, during a busy period, which is a huge number. This number is not related to the size of databases or quantity of data, but instead to the number of users working in InitechWeb at the same time, and to the fact that it does not reuse the existing compiled queries very much,. As another, similar measure, the server is faced with as many as 140,000 “SQL Batches” per minute at peak times, and if all of them require compilation and cannot be reused, it is extremely busy compiling the code for each one before it can execute the work required.

    2.Following on 1, SQL Server 2005 includes deliberate/designed enhancements that, unfortunately, exacerbate the interaction with “Initech” InitechWeb. Specifically, the SQL Server 2005 engine is “more careful” in the SQL compilation process to find the very most optimal query plan for each incoming query – again, partly on the assumption that the resulting plan is likely to be reused, and therefore that it makes sense to expend the resources to make it better in quality. The underpinning of this is “better (but more intensive) compilation > faster execution.” If the application did reuse the plans, this would indeed be a good tradeoff. However, the specifics of how “Initech” works forces the server to mainly discard the plans and compile new ones from scratch. So the database server is compiling better quality query plans, which would probably execute its “actual” work faster, but is not allowed to use them more than once, in many cases. The result is increased CPU activity, where the application cannot take advantage of the results that would otherwise be realized from that increase.

    3.Finally, on the compilation issue, there is a “limiter” or “governor” built into SQL Server 2005 that may be contributing to the problem: the SQL Server team, again using the same set of assumptions about best practice for application design, assumes that a reasonable load of simultaneous SQL compilations is expected, but that a huge number constitutes a problem. They have built in a resource limiter that will kick in if the server is faced with an astronomical load of simultaneous requests for compilation, which shows as a lock on SQL Server#%92s internal processes called “RESOURCE_SEMAPHORE_QUERY_COMPILE.” During our failed upgrade, we saw this lock often, and one theory I have is that the normal/expected load of SQL Compilation generated by InitechWeb, under heavy load, might trigger this limiter in SQL Server 2005, because it exceeds what would reasonably be expected. It is not entirely certain, however, whether the presence of that lock type in our specific situation was the limiter or just a side-effect of the pegged CPUs.

    4.We also tested, in our one-hour, last-ditch effort to save this upgrade, a SQL Server configuration called “Forced Parameterization.” This is a server-side feature that essentially sets the SQL Server 2005 engine to deliberately ignore the format of the SQL queries coming in from the application insofar as it will force as many of them as possible to be converted into reusable query plans. In our test this seemed to help the figures for query plan reuse somewhat, but did not bring the overall server performance back to an acceptable level. It remains to be seen in detail why that is, but it is true that whole classes of queries generated by “Initech” are not eligible even for this form of forced parameterization, due to the “Initech” mechanism around “Proprietary_xxx” views used for the enforcement of row-level security, and various details of the structure of queries.

    5.Lastly, unrelated to the inner workings of SQL Server detailed above, we had an issue with the Windows service lsass.exe that appears when using the “Initech”/SQL Server 2005 combination under heavy load, which does not appear with SQL Server 2000. This is, again, a result of an interaction between the two products that only appears with a large number of users. Here I have had to make some educated guesses, because I cannot find a really authoritative explanation. The facts are these:

    a.Under SQL Server 2000, the lsass.exe service is not busy.
    b.Under SQL Server 2005, this service uses CPU resource in proportion to the number of logins per second, peaking at about 12-15% of total CPU utilization on the server. This represents a dramatic increase.
    c.The “Initech” InitechWeb application generates a LOT of login traffic between the database server and the web servers: about 100 logins per second, on average, for a busy period.

    What I think is happening is this: from SQL Server 2000 to 2005, I believe the login process added a layer for SSL encryption that was not there before, and is an effort to better secure the database server. In the larger scheme of things, this is a welcome enhancement. However, this encryption layer probably adds a small amount of overhead for each login to the server. For an application that uses a reasonable number of logins over time, the additional overhead would be trivial. However, if “Initech” InitechWeb causes in excess of 100 logins per second, it#%92s possible that the overhead for the SSL encryption for that frequency of logins actually adds up to a sizeable CPU load on the database server, which then competes with the SQL Server process itself for CPU time.

    What can be done

    There is no immediate fix for these issues, because the underlying causes are deeply and architecturally embedded in both systems (“Initech” and SQL Server). However, over the long term it seems imperative that “Initech” address these issues:

    1.Query plan reuse: “Initech” has to move their application to some system allowing query plan reuse at the server, whether that be though prepared statements, stored procedures, or some other mechanism. This will entail some revision of existing code. Failing that, the application will scale less well on SQL Server 2005 than it does running on SQL Server 2000, resulting in additional hardware expense/requirements. The new “Forced Parameterization” feature in SQL Server 2005 should be tested carefully as a workaround, but in my view is not really a solution to the underlying problem. XYZ Corp does not have the resources to adequately test that functionality with real workloads.

    2.The interaction of logins with lsass.exe: “Initech” will need to examine how the InitechWeb web application authenticates against the database server, determine why we see in excess of 100 logins per second, and whether that is the cause of the additional CPU load generated by the lsass.exe service when using SQL Server 2005. It may be necessary to redesign how InitechWeb connects to the database server, to make more efficient use of persistent and/or pooled connections.


  22. satya Moderator

    Contingency plan during the upgrade is only way by the backups, so once you have upgraded the application is sailing on the new version try to fine tune where the performance is lacking.

    We have had similar approach for a half a TB database upgraded from SQL 2K to 2k5 without any issues, before that I have spent most of my time with UPgrade advisor by looking at the counters and tested the application thorougly. Also you do have other tools to stress test the application with the new hardware and software, this way you can eliminate few issues like performance or downgrade of a simple query execution.

    http://www.sql-server-performance.com/jk_sql_server_upgrade.asp - BP on upgrade
    Few blog post on the subject:
    http://sqlserver-qa.net/blogs/bc/ar...y-easy-to-upgrade-from-previous-versions.aspx
    http://sqlserver-qa.net/blogs/tools...-performance-after-an-upgrade-is-so-slow.aspx
    http://sqlserver-qa.net/blogs/perft...ow-after-upgrading-from-sql-2000-to-2005.aspx

    Satya SKJ
    Microsoft SQL Server MVP
    Writer, Contributing Editor & Moderator
    http://www.SQL-Server-Performance.Com
    This posting is provided AS IS with no rights for the sake of knowledge sharing. The greatest discovery of my generation is that a human being can alter his life by altering his attitudes of mind.
  23. merrillaldrich New Member

    Ah, key thing is you say "new hardware and software." I had exactly one server <img src='/community/emoticons/emotion-1.gif' alt=':)' />
  24. georgej New Member

    merrillaldrich;

    You have good exapmle on the trouble situation. Looks like it is needed to perform significant tests before upgrade to SQL 2005 from SQL 2000.
    I will carefully consider my upgrade plan now.

  25. CG2000 New Member

    Hmmm..

    I see that there was an in-place upgrade instead of a side-by-side upgrade path.

    I see a SP upgrade along with a application upgrade on a production box on what seems like a critical app/db, in the same time frame.

    I see that the developer claims to have throughly tested the application.

    After reading the report and having to assign responsibility I would be much harsher on the developer than on Microsoft (honestly I really can't see how Microsoft was specifically to blame in any of what you reported). The developer recommended the upgrade, it sounds like they claim to be competent developers (or you have chosen to use them on that basis), they claim to have thoroughly tested the application/database, then you find out the application has basic design flaws that effect the application performance and results in a failed upgrade.

    Having been in the same type of situation before I would not hesitate to give no end of grief to the developer for a failed upgrade and what looks like the possible loss of a few hundred man hours of work and a possible headache in data migration from those hours it was in use to flip back to the SQL 2000 instance, then roll back to the old app - with problems. There are a lot of half-assed developers and development companies who are MS "gold", "platnium", "..Uranium" partners who have no clue in how to deliver a quality applications with good customer service - but you know they all yap endlessly about how great they are.

    In this case I would be skeptical about their claims of testing, given the errors involved they should have been fairly apparent even on a lower-spec'ed system with end-to-end load testing. In which case they should have never rolled it out. It doen't even sound like they were that involved in trouble shooting when it all hit the fan - calling MS to help run performance testing really shouldn't be the developers first stage in troubleshooting a failed installation, that smells a little of desperation.

    But in a production enviroment you have to expect half-assed developers, you have to plan for them within *your* process as a systems/DBA resource. The developer/vendor wants to upgrade an application or database - have them knock it up on a spare server (and it doesn't have to be a monster server) and do some pre-production testing with a limited group of users (end-to-end testing). I have had developers/vendors swear up and down how reliable their app is and how many simular companies use it in simular enviroments use it without any issues - and they are angry that you are impuning their good name to force testing on a pre-production enviroment - then they set it up and it fails miserably. Of course there is always an excuse on their end and no end of promises and guarentees of how they will fix it - but what does that matter if your critical app would have been down for a day and everyone in the company is barking at "another" IT/IS failure (even your first failure seems to always turn into "another" failure statement).

    When it comes to upgrades have a defined BU plan for switching back to the working production system rapidly. If the situation involves upgrades to O/S and dbs - have system as well as db backup's ready - as well as defined timeframe for service implimentation. If the performance does not fall within acceptable agreed upon performance parameters within a defined timeframe (ie "If the damn application doesn't work at least as well as the current version and you can't get it working properly by noon - I am tossing you bodily out of my server room - and doing a full system restore"). You can't completely eliminate problems but you sure can minimize them.

    I don't want this to come off as some diatribe, it is an informative report, and it shows real world problems - without a fluffy ending like you get say in a MS (marketing?) case senario. Depending on who the forensic analysis was written for maybe some details were obscured. But in this case I don't think blaming MS for their role in this situation would be any more warrented than blaming HP for theirs.

    My 2cents...
  26. merrillaldrich New Member

    CG2000 - I agree with your assessment; the vendor (developers) providing this app is less than stellar, didn't / couldn't provide really complete performance data. They are, thankfully, improving their process somewhat since this happened. I don't really blame MS, because they have been recommending certain practices for a long time when building scalable systems, and this vendor has willfully ignored those recommendations. I like SS 2005 very much, and it was really the vendor's architectural decisions (or lack of) that are to blame.

    I do wish, though, that I'd had more information indicating that apps with heavy ad-hoc SQL might regress in performance with 2005, which was the main reason to post this story.
  27. CG2000 New Member

    Merrillalddrich;<br /><br />The "140,000 “SQL Batches” per minute at peak times" for 100 users per second starts to raise flags about what they are doing on the development side, and how much overhead could be reduced with tighter code (mind you I have no idea how complicated the app is or what it is doing). <br /><br />MS does warn in a number of it's MS Press books about the negative impact of Ad Hoc queries (so does Oracle for that matter) - but it is difficult to judge the imapact in a canned app. And you can't really lecture the developer about using "better" coding when you can't judge between better and worse for the same app in any qualitative way. However you can talk to the developer about benchmarks and performance - when they talk about features, you focus on increased performance and make them understand it is a priority and that it will make and break the continuation of the the product /service contract. If it is big enough contract you can build performance benchmarks into the product/service contract and hold it over their head if things go wrong - or if the developer has a history problems with delivering.<br /><br />Its always a fight though, seems like more and more companies that have decent IT/IS capabilities will sandbox everything before it goes into production. Because enviroments are so complex and internal service agreements are tight (only a certain % of allowed downtime) you can't have an app go down or impact other systems. Some developers get it and some don't, a lot of them are so marketing focused that they push bad products and versions with the idea as long as it is sold and in place they can always fix it later - without regard to the internal cost to the customer.<br /><br />Anyway every implementation in the can is a good learning experience, and you always learn more from the bad ones - which usually result in better strategies for the next time - or a new job at a new company...whatever makes more sense....<br /><br /><img src='/community/emoticons/emotion-5.gif' alt=';-)' /><br /><br />
  28. vich New Member

    Merrill - Thanks for the post. You're quite the writer. I was gripped by you in-depth executive summary.
    It's interesting that SS 2005 SQL compiler is taking so much longer. It surely must have crossed their minds about the possibible ramification - and perhaps provided a switch that could tone it back, or perhaps keep statistics and only do the more-thorough compilation on queries proven to be repeated (so do a level-2 recompile on the 3nd or 4th use).
    I think I would shed some blame on them for not recognizing that your disaster could be foretold, particularly considering no roll-back path other than restore (and therefore loosing data)!
    Further - it occurs to me that Microsoft isn't always forthcoming with suggesting all available solutions. In the heat of the moment, I wonder if that engineer didn't have time to discover some switch like that that may have been designed into it. I might be worth the inquiry.
    It also crossed my mind that, surely, you must be hitting a CPU wall even with just SQL 2000. Even a less intense SQL compiler would cry at 250 compilations per second. I would think that your natural upgrade path would be additional, more modern, CPUs. Also; now that you're aware of the issues, you're in a much better position for recreating the test threasholds in preparation for your next upgrade.
    Finally; I want to ask about the special SQL 2000 compatibility feature you mentioned? I am currently deciding on an upgrade path for our fairly small database. I'm a 1-man-show in for a small manufacturer and I have a serious deadline to complete implementation of an MRP/Inventory module (read: double the DB load), and we're currently approaching "the wall" on our server.
    First, I'm going to upgrade our IBM x235 2 Zeon processor system (2.6GHz, 2 processor, 3GB RAM) to something substantially better (Zeon quad core, 8GB RAM, faster 15KRPM arrays). But; I want to use AWE yet don't want to waste money installing SQL 2000 Enterprise, just to upgrade to SQL 2005 Enterprise next year when I have breathing time. (however; it just occurred to me that if I buy MS Software Assurance, I'll get a free upgrade)
    (rambling, sorry). So; this SQL 2005 compatibility feature sounds perfect! Any further advice on that? Is it known to work? Are you certain it's not related to your issues?
  29. merrillaldrich New Member

    Hi - not sure I completely understand the question, but I'll give it a shot:
    SQL Server 2005 has a "compatibility mode" that you can set "back" to 8.0 (SQL Server 2000). This will NOT change the internal workings of 2005, with the new optimizer and so on -- it is not running as if it were 2000. What it does is make the new version mimic the client interface of the old one, so that the client software connecting to it "imagines" that there is a SQL Server 2000 instance at the other end of the connection, when in fact it is a 2005 instance. Very handy.
    As far as I know, that works well. I have a couple systems running in that mode. I did not see any relationship between that function and my issues with the upgrade described here.


  30. voron999 New Member

    Found this old, but usefull thread..............
    [quote user="merrillaldrich"]
    ........SQL Server 2005 has a "compatibility mode" that you can set "back" to 8.0 (SQL Server 2000). ..........

    [/quote]
    Merill, I do not beleive there exists such a thing as "SQL SERVER compatibility mode". The wording above maybe confusing to some.
    I assume you are talking about the "database compatibility mode". On the same instance you may have databases in compatibility modes of 90 and 80 at once. You can change those back and forth, usually (code permitting).
    Whereas you can not switch the engine itself between different modes. There is nothing to switch.
    Any case, keep up the good work. Good writeup.
    We getting ready to push out in-place upgrade to SQL 2005 on about 40-50 servers. Including one cluster instance. Including live transactional replication across the system. Including many cross-server calls using linked servers. Sounds pretty scary. Fortunatelly, production load is not that high.
    Setting up a test lab at the moment for some practice runs. Hence, I am very interested in any cases of failures and rollbacks. :)
    Cheers.
  31. satya Moderator

    Its a general assumption that when you talk about SQL Server compatibility, it applies to the database level only as you need to use SP_DBCMPTLEVL..... statement.
  32. ginger8990 New Member

    Hi, there:
    I have some issues when doing SQL 2005 upgrade. After I enter windwos authentication mode, the error message comes up --UpgradeAdvisor returned -1
    error message then bring me to MS web site:
    you selected mixed Mode authentication but didn't provide a strong password. I even enter Sa logn 15 charactor password, it still get the same errro.
    We are running server 2003 sp2.
    I did fine on another test server with the same application and settings
    Any idea?

Share This Page