SQL Server Performance Forum – Threads Archive

comparing varchar columns

I have two tables that contain the same info (for whatever reason). I need to compare particular columns to insure the data is the same. Now for the tricky part. If the data has different syntax but the same meaning I don’t need it in my result set. Example:
Same meaning different syntax
‘123 Anywhere St.’ vs ‘123 Anywhere Street’ Different meaning different syntax
‘123 Anywhere Drive’ vs ‘123 Anywhere Street’ What would be the best way to approach this. I am currently parsing each word and comparing. Is there an easier way? Thanks,
Bob

hi,
read about difference() and soundex() string functions in BOL,
if it helps you in comparision

Bob, You could use the replace function to replace certain words with their alternative word. For example replace(‘123 Anywhere St.’,’St.’,’Street’) The above example will replace the occurance of the string ‘St.’ with ‘Street’. Your query might look something like this;
select *
from table1 t1
join table2 t2
where replace(t2.Col1,’St.’,’Street’) <> replace(t1.Col1,’St.’,’Street’)
However, there are several caveats which will most likely make such a solution unworkable. ‘St.’ might actually be a valid part of the string, in which case you wouldn’t want to replace it. For example ‘St. Mary Street’. It will only work if you don’t have many strings that you need to replace. Otherwise you will need to use mulitple replace statements and the whole thing will start getting really messy. Using the replace function in the where clause will force the optimizer into running a table scan so if you’re performing these queries on large tables you’ll lose performance – but it might still be quicker than parsing each word. If you have more complex examples than the one you provided, then the replace function becomes more and more difficult to use. I’m not sure if this is an option but if possible, you might want to try and clean up the data beforehand – perhaps by putting clean data in a seperate table. Once you have your clean data you can do standard comparisons. This depends on your particular environment though as to whether it’s a suitable option. If the data in these tables is changing constantly then it probably isn’t an option. Other than that there isn’t much I can think of. Hope it helps or at least gives you a couple of ideas. Karl

quote:Originally posted by rsell1 I have two tables that contain the same info (for whatever reason). I need to compare particular columns to insure the data is the same. Now for the tricky part. If the data has different syntax but the same meaning I don’t need it in my result set. Example:
Same meaning different syntax
‘123 Anywhere St.’ vs ‘123 Anywhere Street’ Different meaning different syntax
‘123 Anywhere Drive’ vs ‘123 Anywhere Street’ What would be the best way to approach this. I am currently parsing each word and comparing. Is there an easier way? Thanks,
Bob

Karl Grambow www.sqldbcontrol.com

Thanks to both of you for the quick replies. Karl, that is my issue, the data entry is not consistant. This result set is supposed to try to identify when there are differences so I can address them with the user. There are two different systems that I am trying to "keep in synch". One has a SQL Server backend the other uses files. So part of my routine is to open when a file is modifed, read certain datapoints within the file and compare to the data in the database. My business user doesn’t know which data is the "master" until manual reserch is performed. Otherwise I could update one from the other. I didn’t think about your example, ‘St. Mary’, so that throws a whole different bone in the mix. I didn’t want this to be a major effort. Due to the nature of how things are done here, it might have to be. I guess I have some "tinkering to do". Thanks again!

One method I’ve used that seems to catch a lot of near-duplicate addresses is to search on just the leading characters of the address: 123 Main Street and 123 Main St. both start out the same way, and most of the variation in addresses is toward the end of the string. There are obviously exceptions, and it’s not perfect, but it does find many. If you have a large data set, I find it useful to also constrain by zip code — left( addr1, 10 ) and left( zipcode, 5 ) There is still manual effort in cleaning up the data. Hopefully your client is eliminating one or the other data set in this process?

1. you have to standardize the domain so you can implement a matching strategy. Follow Karl’s suggestion, use the REPLACE function.
2. perform a match and place the match with the strength indicator in a table. i.e. = operator has the highest strength, next is partial match with second highest strength., etc. you can then decide what criteria to use when identifying duplicates. May the Almighty God bless us all!
www.empoweredinformation.com

I would probably do that kind of data scrubbing at the client rather than the server. Seems to be better suited for such tasks to me. After that, you might want to follow Cmd. Skywalker’s suggestion and fix your modelling bug. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a> Ich unterstuetze PASS Deutschland e.V. <a target="_blank" href=http://www.sqlpass.de>http://www.sqlpass.de</a>)

]]>