Compare SQL Server Databases with sp_CompareDB

  • If you work with SQL Server 2000, you might have a problem with collations when comparing data. “Collation” is a concept used to identify code page for Unicode and non-Unicode data, and sorting and comparing rules for both Unicode and non-Unicode data types. Working with SQL Server 2000, you can hit an unexpected and surprising restriction, which you never had working with SQL Server 7.0.

    Suppose your company manufactures software that uses SQL Server 7.0 as its database. You supply your clients with databases containing English-coded data, French-coded data, and German-coded data. Everything perfectly fits into code page 1252, dictionary sort order, case insensitive, accent sensitive. All clients must have SQL Servers with the same default settings: 1252-CI-AS.

    Then you start to work with SQL Server 2000. Now your clients can have installed on their servers: Windows collation Latin1_General_CI_AS (for German and English locales), or Windows collation French_CI_AS (for French Windows locale), or SQL Server collation SQL_Latin1_General_CP1_CI_AS. This means that the databases created in SQL Server 7.0 to be installed both on SQL Server 7.0 and SQL Server 2000 (with conversion while being attached) now won’t work on servers with Latin1_General_CI_AS or French_CI_AS. You will receive this error message:

    Server: Msg 446, Level 16, State 9, Line 1
    Cannot resolve collation conflict for equal to operation.

    You either have to support all collations or force your clients to reinstall SQL Server with the collation you chose. It is not quite clear why French data in French_CI_AS collation and Latin1_General_CI_AS collation are incompatible. All collation properties are the same:

    A. Code page for non-Unicode data: 
    French_CI_AS has 1252
    Latin1_General_CI_AS has 1252
    i.e., the bit patterns for all non-Unicode symbols are absolutely the same.

    B. Sort order for non-Unicode data: the same by collation definition. The order of symbols is exactly the same. If you have any doubt, run an unsophisticated script getting data from a table with two columns, one column in Latin1_General_CI_AS and the second column in French_CI_AS. Use ORDER BY Latin1_General column in one query, French in the second query, and compare code orders. There are no differences.

    C. Code page for Unicode data. According to the SQL Server Books Online, “All Unicode data uses the same Unicode code page. Collations do not control the code page used for Unicode columns, only attributes such as comparison rules and case sensitivity. “

    D. What about comparison rules and case sensitivity? Run:

    SELECT COLLATIONPROPERTY (<CollationName>,’ComparisonStyle’)

    The result is: 196609 for both. It means case insensitive, accent sensitive, kana insensitive, width insensitive.

    Even more curious is why the data in Latin1_General_CI_AS and SQL_Latin1_General_CP1_CI_AS are incompatible. In this case even the collation designator (Latin1_General) is the same for both. Execute:

    SELECT * FROM ::fn_helpcollations () WHERE name = ‘Latin1_General_CI_AS’ OR name = ‘SQL_Latin1_General_CP1_CI_AS’

    to see it.

    However, if you run:

    CREATE TABLE ChkCollate (
    SQL_Latin1_General_Column char(10) COLLATE SQL_Latin1_General_CP1_CI_AS,
    Latin1_General_Column char(10) COLLATE Latin1_General_CI_AS)

    SELECT * FROM ChkCollate WHERE SQL_Latin1_General_Column = Latin1_General_Column

    you’ll see the familiar:

    Server: Msg 446, Level 16, State 9, Line 1
    Cannot resolve collation conflict for equal to operation.

    It’s bizarre that you can enforce collation compatibility working with two databases on different servers if you have run the following for the linked server:

    USE master
    EXEC sp_serveroption ‘LinkedServerName’, ‘collation compatible’, ‘true’

    while you cannot do it working with two databases on the same server.

    Conclusion: The databases you are comparing must have the same collations (the same collation designator is not enough).

  • Performance is obviously a main concern when you are writing a complicated script. I guessed that my  SP wouldn’t be particularly fast, especially if there are tables without unique keys in the database, but I wanted to get the execution time as far from frustrating as possible. Due to performance considerations, I chose not to use this construction checking for nulls:

    <db1.ColumnName> IS NULL AND <db2.ColumnName> IS NULL OR <db1.ColumnName> IS NOT NULL AND <db2.ColumnName> IS NOT NULL AND <db1.ColumnName>=<db2.ColumnName>

    and used instead:

    ISNULL(<db1.ColumnName>,’!#null$’) = ISNULL(<db2.ColumnName>,’!#null$’)

    I did this because the SQL Server optimizer uses a NESTED LOOP JOIN in the first case and a HASH JOIN in the second one. The performance ratio is impressive. For some tables, a NESTED LOOP is 200-500 times slower. You cannot enforce HASH JOIN in a query with a predicate like that in the first case, as SQL Server can use hashing only for equijoins.

    The second construction is the only solution from a performance point of view. However, this construction looks perfectly short and innocent only for string-like data types. We cannot use this construction for other data types. For example, if we use it for the tinyint data type:

    ISNULL(<db1.TinyintColumnName>,0)= ISNULL(<db2.TinyintColumnName>,0) 

    then if there is 0 in one database and a NULL in the another one, the result will be “No difference”, which is absolutely not true. What else can you use if not 0? Tinyint has 256 possible values and any of them could be in the database. The bit data type is even more restrictive, with only two possible values. We must find another solution. The solution for this particular case is to use the CASE statement. To achieve good performance and to get a correct result, we must process almost every data type specifically. What alternatives we have to compare all possible data types and not to forget about null values? We can:

    A. Compare the data “as is” if nulls are not possible for the data type. It can be applied only to timestamp data type.

    B. Use ISNULL with a replacement of the same data type, but using a value that practically will never appear in the data. It’s for char, varchar, nchar, nvarchar data types. The replacement value is ‘!#null$’. Example of the predicate: 

    ISNULL(d1.<CharCol>,’!#null$’)=ISNULL(d2.<CharCol>,’!#null$’)

    Continues…
  • Leave a comment

    Your email address will not be published.