SQL Server Performance

Database Snapshot (EMC SAN)

Discussion in 'Performance Tuning for Hardware Configurations' started by simondm, Apr 4, 2005.

  1. simondm New Member

    This is a little bit specific but I'm really running out of places to turn:<br /><br />I've got an EMC CX-700 SAN which I host several clusters from.<br /><br />The servers are all Win 2003 Ent (no SP) with SQL 2000 (Build 81<img src='/community/emoticons/emotion-11.gif' alt='8)' />. Active/Passive clusters with 2 nodes.<br /><br />Several of the clusters perform snapshots and clones with no problems at all. However one of them hangs everytime between the FREEZEIO statement and the THAWIO statement. When I say hangs, you cannot connect to SQL, but you can ping the server. If you have a sessions and are running a query it "waits" and only completes when the snapshot is finished, <br /><br />We have a "consultant" in from EMC - but he doesn't have a clue and keeps suggesting we talk to MS - but I know MS won't talk to us because this is unsupported - EMC have an agreement with them to use this functionality.<br /><br />No errors are logged and the clone or snapshot appear to complete sucessfully.<br /><br />As it happend this server has very high disk IO. By far the highest off all the clusters. I'm not sure how the FREEZEIO command works or if there is any logging it does.<br /><br />Can anyone suggest anything I can check? Or even if I should turn to MS or leave it with EMC.<br /><br />Thanks
  2. joechang New Member

    this is required for the Snapshot,
    the SQL Server database must be put into a consistent state,
    at given point in time, the database must be made consistent, so for all completed transactions, dirty buffers must be flushed to disk, during which no new transactions can complete.
    i assume only the busy system exhibits the freeze, while snapshots completes quickly on the inactive systems?
    basically, this is not a "free" operation, depending on the time for the consistency op
  3. simondm New Member

    I can understand that the snapshot requires some resources (if not alot), but hanging the system for 3-4 minutes seems rather extreme.

    It was sold to us as a "seemless and instant" backup solution.

    Whilst the other clusters do not have as much disk activity they are also quite busy, they are far from idle - maybe 1500 transcations per second. Whilst the troublesome server is 3000 per second.

  4. joechang New Member

    believe it or not, EMC does have good SQL Server expertise, but it is very well hidden from most EMC front-line people.
    there is suppose to be a very good EMC whitepaper regarding SQL Server performance, but it may not be publicly available, and most probably your consultant has never read it or does not believe it.

    another consideration is the disk loading on your current system.
    i do know that likes to sell mis-configured systems, ie, the big 700 with too few disks.

    what is the # reads & write/sec
    r&w avg queue depth
    avg sec / r & w
    for each LUN on the good & bad SNAPSHOT system

  5. simondm New Member

    Hopefully this posts ok:



    Good Bad
    AvgMaxAvgMax
    Disk Reads/Sec 1122357211
    Disk Writes/Sec 423121790
    Avg Disk Read Queue Length0.082.80.31.2
    Avg Disk Write Queue Length0.0030.30.052.3



    These servers have many LUNs (about 9 each) so I haven't provided stats for all - I've just taken the busiest drive on each machine. These LUN's are RAID 1+0 with caching enabled. Using the SQL StressIO tool from MS they performed incredibly well in our test environment compared to our old servers which used external array's.

    To me the the stats on the server with problems are not too bad - there is no major queue at least.

  6. joechang New Member

    it the disk latency values are very important, Avg Disk sec/Read & Write
    as these values are not dependent on the size of the array.
    but clearly the bad one has more writes, (how many dedicated physical disks for the above LUN?)

    without knowing exactly what your app does, and the exact config of the SAN (# of disks, # of RAID groups, how LUNs are allocated to each server) it does seem that 3-4min is severe.
    ie, someone did not size this correctly

    i would start asking EMC for some one who knows something about this aspect of their product & SQL Server.
    This definitely needs to be resolved by EMC as it is most likely a sizing and configuration issue on the CX700, so unless it there is an error message in the SQL Server log or something like that,

    to be fair, EMC has a very large product line, and each individual consultant/engineeer may only know a few areas, but still, he should admit what he does not know and find the right person.
    You might ask your EMC sales rep to start looking for another technical resource that specializes in this subject,
    I know a couple of people that have worked with this more than i have, but the work for Unisys, not EMC
  7. simondm New Member

    Thanks for your help.

    When you say "ie, someone did not size this correctly" - how do you mean. LUN config or the actual SAN itself?

    As it happens we intentionally went for lots of 36GB drives rather than 73GB or 146GB so that we had maximum IO throughput - IO has always been an issue on this database, and we have done a lot to lower it.

    The main data drive is very near capacity (about 92% of the 100GB) - could this cause a performance decrease when cloning? I'm working on freeing up space over the next few days.

    It's proving very hard to find anyone at EMC who can help, they seem to have very little resources in the UK. The one and only consultant we've been able to get is travelling 200 miles everyday to get here (and we're in London!). Needless to say I agree with you that we need someone from EMC to find the answers - we didn't invest nearly a million pounds for this.

    Thanks again
  8. derrickleggett New Member

    I feel your pain on this one. If it's any comfort to you, the consultants they send us in the USA for SQL Server aren't any better. [<img src='/community/emoticons/emotion-1.gif' alt=':)' />] We had two consultants here for three weeks who weren't even Windows people. We ended up solving half the issues we had because they couldn't figure them out, and weren't able to reach anyone at EMC with answers. <br /><br />I'm assuming you're using ERM. What version are you on? What is the size of your reserve LUN? Do you have multiple databases on the LUN giving you trouble? Did you buy the monitoring tool for the EMC?<br /><br />As a side question, how are you doing the clones? Do you have the process written down in a format you could share with people?<br /><br />MeanOldDBA<br />derrickleggett@hotmail.com<br /><br />When life gives you a lemon, fire the DBA.
  9. simondm New Member

    Thanks Derrick,

    We have two 100GB LUNS the live is RAID 1+0 the clone is RAID5 (cost saving). The version of ERM Manager is 2.2.0.2 - is this what you were after?

    In theory all our software versions should be the latest as EMC have only just installed it. We did purchase the monitoring software - however it's not up and running yet!

    You maybe gathering that this project has been somewhat rushed!

    The server has 4 databases. 2 are trival 200MB databases and share two drives (one data & one logs). The other two databases are large - one is 350GB and the other 180GB. They are on seperate disks except for one file from each (a data file) that is currently sharing a disk due to some unpredicted growth. Thinking about it this could be the cause of our problem??? Wouldn't this mean that SQL would have to freeze IO on both the large databases at once??

    If this could be the problem I'm in trouble because that file is going to be very hard to move - the db is replicated so I can't detach it. This would mean adding a file to the filegroup and forcing the data over - not good news. Adding to which it's 24/7.

    Do you think this maybe the problem?


  10. derrickleggett New Member

    Well, the reason we didn't end up using ERM was because the last version didn't support multiple databases on a LUN. It wasn't able to effectively freeze the IO on both databases effectively. They were supposed to fix this on the most recent version; HOWEVER, it's pretty much brand new functionality for them since it's never worked before. That would definitely be somethine worth checking into.

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  11. simondm New Member

    That's very useful information. Can I ask what you use in it's place? Our intention was to snapshot every 4 hours and clone every night - the clone would then be backed up to tape for our off site backup.

    On the other clusters that are working we do have LUN's that have files from different databases on. However, the IO throughput is far lower on these servers.

    One last thing. What do you make of this:

    http://support.microsoft.com/?kbid=890544

    It would seem MS acknowledge that SQL may appear to hang during a snapshot. Whilst it's obviously not by design it is interesting that they haven't flagged this as a bug - they haven't flagged it as anything in fact!
  12. derrickleggett New Member

    I don't know what to make of it. [<img src='/community/emoticons/emotion-1.gif' alt=':)' />] I just know it didn't work for us. Our databases are fairly high transaction though. Let me know what you find out. I would be very interested in seeing the results. I think you're going to have to work with MS and EMC on this though, as much as that sucks.<br /><br />MeanOldDBA<br />derrickleggett@hotmail.com<br /><br />When life gives you a lemon, fire the DBA.
  13. simondm New Member

    Just a quick post to let you know the outcome of the issues we saw on ou SAN.

    Firstly the time between the FREEZE IO and THAW IO requests has dropped from 3 minutes to between 1 and 15 seconds. This was done eventually by EMC releasing a patch, for the record the patch version is EMC Replication Manager Local Version 2.2.0.3 (Patch 3). This solved our problems with clones.

    We have found a seperate problem for snapshots. It's turns out EMC have a "unwritten rule" about snapshoting disks with more than a 20% data change. Our log drives were seeing 50%-60% data change in the two hour period between snapshots. Thus we have had to stop snapshots on certain databases which is very disappointing. Note that this does not effect clones.

    If anyone knows where I could obtain a copy of this EMC whitepaper on SQL Server & SAN's that joechang mentioned it would be greatly appreciated. Our EMC rep handed me a 2 page sales pdf that says it sings and dances....

    Anyway thanks for your help - at least the cloning issue is solved and I actually have good backups!
  14. joechang New Member

    thanks for providing the info that ultimately resolved the issue.
    SANs are notorious for being tempermental with respect to SW patches, each specific combination really needs to be verified together.

    Why is it necessary to snapshot the logs? I would have thought that a checkpoint should have been issued prior to the snapshot.

    Keep pestering your EMC reps to provide all the SQL Server specific documents, training material and performance reports. Tell them you will keep it confidential. then post it here
  15. derrickleggett New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />Firstly the time between the FREEZE IO and THAW IO requests has dropped from 3 minutes to between 1 and 15 seconds. This was done eventually by EMC releasing a patch, for the record the patch version is EMC Replication Manager Local Version 2.2.0.3 (Patch 3). This solved our problems with clones.<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />About time they get this one fixed. [<img src='/community/emoticons/emotion-1.gif' alt=':)' />]<br /><br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />We have found a seperate problem for snapshots. It's turns out EMC have a "unwritten rule" about snapshoting disks with more than a 20% data change. Our log drives were seeing 50%-60% data change in the two hour period between snapshots. Thus we have had to stop snapshots on certain databases which is very disappointing. Note that this does not effect clones.<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />Depending on how you are using the clones, it can affect them greatly. If you are just maintaining a copy of the database, and using clones to periodically "catch up" the cloned image, then it can have drastic implications. Many of the places we were using clones can do a full backup/restore in about the same time it takes us to refresh our clones. That's because such a huge volume of the disk map is changing between the clone synchronizations.<br /><br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />If anyone knows where I could obtain a copy of this EMC whitepaper on SQL Server & SAN's that joechang mentioned it would be greatly appreciated. Our EMC rep handed me a 2 page sales pdf that says it sings and dances....<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />I'll look for my whitepapers. I keep meaning to postlinks to them somewhere so people can access them.<br /><br />MeanOldDBA<br />derrickleggett@hotmail.com<br /><br />When life gives you a lemon, fire the DBA.

Share This Page