SQL Server Performance Forum – Threads Archive

Failover explanation

I have a production database with 500 GB of usable information which need to be online 24x7x365. We never had problemes before, and I do a full backup every 2 days and transaction log backups every 4 hours into a Dedicated backup server (2 GB SATA drives) over gigabit network. Yesterday one of the most important databases turned into SUSPECT. For some reason various index references were lost, and the log could not be used to rollback the transactions. I proceeded to try and reopen the databas with no luck. I backed up the tail of the log. I went to my backups, FULL restore first, then apllied archived transactions until 2 minutes before the crash (point-in-time recovery). After I applied the last tail of the log using point-in-time recovery the database turned to LOADING/SUSPECT state. My question in failover will be, what would have happened to the second node if this corruption happened? It means both my nodes will be down and I will need to recover? My question in log shipping, what would have happened to the stand-by node? Will it turn to be corrupt due to the last log applied? What should I implement in order to avoid the tremendous downtime I had yesterday? Seems that in theory neither implementation of clustering or stand-by would have sabed me. Feedback please PS. I will explain my workaround in DBA General Forum

quote:
My question in failover will be, what would have happened to the second node if this corruption happened? It means both my nodes will be down and I will need to recover?

Nothing would happen since only one node at the time is the owner of a specific sql instance and it’s shared disks in the cluster. The other node is passive when it comes to this instance. A failover would not be triggered by data corruption issues unless it would cause the sql service to fail as well. Or if the corruption was caused by hardware or driver related issues with storage or memory etc then that issue could trigger a failover to the other server. But data corruption in itself in a database does not trigger failover.

quote:
My question in log shipping, what would have happened to the stand-by node? Will it turn to be corrupt due to the last log applied?

It depends on what caused the corruption. For example the times I’ve had corruption issues it’s often been related to hardware issues with the storage. When the backups and transaction logs for these databases where restored on a "healthy" system without hardware issues with the storage, the transactions in the log backups where re-applied on this system and the databases were ok again. So in a sense log shipping to an external system could have helped you here but it depends on what caused the corruption in the first place. If it was due to a bug in sql server with for example reindexing or something that could happen on the other system as well. In any case if things like dropping/recreating indexes, dbcc repair without data loss, transfering the data to new tables etc does not work on the original server I would always try and restore the database on another system and try and fix it there.

quote:
Nothing would happen since only one node at the time is the owner of a specific sql instance and it’s shared disks in the cluster. The other node is passive when it comes to this instance. A failover would not be triggered by data corruption issues unless it would cause the sql service to fail as well. Or if the corruption was caused by hardware or driver related issues with storage or memory etc then that issue could trigger a failover to the other server. But data corruption in itself in a database does not trigger failover.

If nothing will happen, does it mean that the DB will stay as Suspect and never failover? Because the Cluster is sharing the same I/O susbsystem I will assume that the corruption will be reflected to all nodes. Am I correct?

quote:
It depends on what caused the corruption. For example the times I’ve had corruption issues it’s often been related to hardware issues with the storage. When the backups and transaction logs for these databases where restored on a "healthy" system without hardware issues with the storage, the transactions in the log backups where re-applied on this system and the databases were ok again. So in a sense log shipping to an external system could have helped you here but it depends on what caused the corruption in the first place. If it was due to a bug in sql server with for example reindexing or something that could happen on the other system as well. In any case if things like dropping/recreating indexes, dbcc repair without data loss, transfering the data to new tables etc does not work on the original server I would always try and restore the database on another system and try and fix it there.

The corruption was caused by an internal SQL server problem (possibly a bug) and lost inde x entries which could not be found. Why do I assume so? because there were no I/O log errors and the SCSI controller reported healthy disks. I run production in 0+1 Raid I/O subsystem. In case of I/O errors, the good disk would have retrieved the correct information as it has done before. Remeber, SQL server generated a corrupt transaction log backup when I Backed up the tail of the log. That information will be required for standby database anyways, in order to proceed. But this tail marked the DB in LOADING/SUSPECT mode instead of just LOADING or just OPEN. That is why I assume that I would be in problems anyways regardles of the solution I had. You see my point?

quote:Originally posted by ozamora
If nothing will happen, does it mean that the DB will stay as Suspect and never failover? Because the Cluster is sharing the same I/O susbsystem I will assume that the corruption will be reflected to all nodes. Am I correct?

It would stay suspect and not failover unless the error causing the corruption also would trigger a failover (like hardware issues). If you fail over manually to another node it would the same scenario as if you just stopped and started the sql service. The db would still be suspect.

quote:
The corruption was caused by an internal SQL server problem (possibly a bug) and lost inde x entries which could not be found. Why do I assume so? because there were no I/O log errors and the SCSI controller reported healthy disks. I run production in 0+1 Raid I/O subsystem. In case of I/O errors, the good disk would have retrieved the correct information as it has done before. Remeber, SQL server generated a corrupt transaction log backup when I Backed up the tail of the log. That information will be required for standby database anyways, in order to proceed. But this tail marked the DB in LOADING/SUSPECT mode instead of just LOADING or just OPEN. That is why I assume that I would be in problems anyways regardles of the solution I had. You see my point?

If you already know this was the case no scenario would have helped you. I would still have tested it on another server to rule out buggy storage drivers that could cause issues without reporting an error.

]]>