SQL Server Performance Forum – Threads Archive
Processing/Parsing Large Amounts of DataWe currently have 8 million rows that need to be processed. This used to be run in FoxPro
and didn’t take too long. FoxPro actually did 6 passes using different columns/lengths to represent "similar". I’m running a sample of 450K rows with the solution I came up with and this has taken 20+ hours so far which is obviously WAY too slow. We have some addresses that are missing Zip4s and some that have Zip4s. We cannot use commercial software to get these Zip4s so we want to compare unknown rows to see if they are similar to known rows. For the purposes of simplicity assume that we use the full zipcode and the first 3 characters of the address to compare. For example: 123 MainlyDoesntExist Avenue, 12345, <blank>
123 Main Avenue, 12345, 1234 Would see that both rows have "123" as Address(1,3) and 12345 as Zipcode. So we want to assign 1234 to the <blank> zip4. This is what I came up with: ———– index the table based on Address, Zipcode, Zip4 create a cursor with Address, Zipcode, Zip4 and Order by Address, Zipcode, Zip4
— the ordering will put rows with matching columns next to each other. blank zipcodes
— will be encountered before rows with zipcodes fetch from cursor while fetch if tempaddress info = new address info
if the new address info has a zip4
insert into tablevariableupdate zip4 and rowIDs from tablevariablesame
— since I might have two+ rows that are missing zip4s, I store the information in a
— tablevariable so all can be updated
insert into tablevariablesame rowIDs
— since we’re ordered by the "comparison" if we reach a new address, we don’t have a
— row with a zip4 so we can "start over"
delete from tablevariablesame
store new address info as tempaddress info end cursor update main table information joining with tablevariableupdate clean up cursor ———— 1) Can SQL even do this sort of processing in a decent amount of time?
2) Are there any suggestions for changing the high-level architecture that I have? (One big one is probably to nuke the cursor).
3) What types of performance tuning things should I be looking for to "tweak" this? (I’m assuming I can do some index tweaking, maybe some other things to gain small incremental performance improvements).
Any help or suggestions on how to proceed would be greatly appreciated. Thanks for reading my post! /Rob
Write your process using set logic instead of cursors and it should zip (pardon the pun) right through that <img src=’/community/emoticons/emotion-1.gif’ alt=’‘ /><br /><br />I suggest maybe think about it this way:<br /><br />Don’t worry about the order of the rows or the order of the process or what is "next" to what …<br /><br />Instead, create a single SELECT that will find the pairs of addresses you need:<br /><br />select cols<br />from thetable t1 <br />inner join thetable t2 on ( matching citeria to make pairs of addresses )<br /><br />Once that works, change it into an Update that uses that same logic to set the missing ZIP.
Once I realized I was thinking like a programmer instead of in sets I made the change and things are much faster. Thanks for the knock on the head!
Any time <img src=’/community/emoticons/emotion-1.gif’ alt=’‘ />