SQL Server Performance

Searching for string in Text data

Discussion in 'T-SQL Performance Tuning for Developers' started by benwilson, Nov 28, 2004.

  1. benwilson New Member

    Hi All,

    I am working on a function that searches through data in field of data type text for a string, and returns some characters found after the string.

    When i hard-code the search string or use a char data type the same length as the search string, i am able to find where the search string begins fine using the following code:

    DECLARE @TransactionId AS nvarchar (30),
    @ApplicantType As nvarchar (20),
    @EnquiryDate AS datetime,
    @CreditEnquirer AS char (17)

    SET @TransactionId = 'FN000000046065'
    SET @ApplicantType = 'Primary'
    SET @EnquiryDate = (SELECT TOP 1 EnquiryDate FROM tbl_ApplicantEnquiryCreditFileCreditEnquiry WHERE TransactionId = @TransactionId)
    SET @CreditEnquirer = 'HOMELOANS LTD NSW'


    SELECT TOP 1 CHARINDEX(@CreditEnquirer, Report)
    FROM tbl_ApplicantEnquiry
    WHERE (TransactionId = @TransactionId) AND (ApplicantType = @ApplicantType) AND
    (@EnquiryDate = EnquiryDate)

    THe problem is that the length of the search string varies so @CreditEnquirer can not be a char (17)- it needs to be a varchar, or i need to set the length of the char dynamically to the length of the search string.

    I cant make it a varchar as SQL Server has an error when comparing text to varchar, and i cant seem to set the length of the char dynamically. Does anyone have any suggestions???

    Much appreciated,
    Ben
  2. benwilson New Member

    ok, i have solved my immediate problem...turns out the data in the text field always has spacing after it to make up at least 50 characters, so i can just use a char (50) for the search text...

    but, if anyone knows how to set the length of the char dynamically, i am still keen to know!
  3. vbkenya New Member

    quote:
    turns out the data in the text field always has spacing after it to make up at least 50 characters

    If it is within your control, things may be a little bit easier if you change that TEXT column to varchar(50). Text type columns are actually mean't to hold stuff that would require more than 8K characters.

    You can then have your CreditEnquirer variable as varchar(50) and compare it without pain.



    Nathan H.O.
    Moderator
    SQL-Server-Performance.com
  4. Adriaan New Member

    Perhaps this gives you a clue what you need to do - no need to change data types on a any table:
    DECLARE @Test VARCHAR(50), @Reference CHAR(100)

    SET @Reference = 'kuala lumpur'
    SET @Test = 'lu'

    SELECT CHARINDEX(@Test, @Reference)
    SELECT SUBSTRING(@Reference, CHARINDEX(@Test, @Reference) + LEN(@Test), LEN(@Reference))
  5. FrankKalis Moderator

    quote:Originally posted by Adriaan

    Perhaps this gives you a clue what you need to do - no need to change data types on a any table:
    Just be aware of this behaviour of CHARINDEX.


    create table foobar
    (
    c1 text
    )
    declare @mystring varchar(8000)
    set @mystring = replicate('0', 8000)
    declare @mystring2 varchar(8000)
    set @mystring2 = replicate('1', 8000)

    exec ('insert foobar values(''' + @mystring + @mystring2 + ''')')
    select
    charindex('0',c1)
    , charindex('1', c1)
    , patindex('%1%',c1)
    , datalength(c1)
    from
    foobar
    drop table foobar


    ----------- ----------- ----------- -----------
    1 0 8001 16000

    (1 row(s) affected)
    [:0]

    -----------------------
    --Frank
    http://www.insidesql.de
    -----------------------
  6. Adriaan New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by FrankKalis</i><br /><br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by Adriaan</i><br /><br />Perhaps this gives you a clue what you need to do - no need to change data types on a any table:<hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Just be aware of this behaviour of CHARINDEX.<br /><pre><br />create table foobar<br />(<br />c1 text<br />)<br />declare @mystring varchar(8000) <br />set @mystring = replicate('0', 8000)<br />declare @mystring2 varchar(8000)<br />set @mystring2 = replicate('1', 8000)<br /><br />exec ('insert foobar values(''' + @mystring + @mystring2 + ''')')<br />select <br />charindex('0',c1)<br />, charindex('1', c1)<br />, patindex('%1%',c1)<br />, datalength(c1) <br />from <br />foobar<br />drop table foobar<br /><br /> <br />----------- ----------- ----------- ----------- <br />1 0 8001 16000<br /><br />(1 row(s) affected)<br /></pre><br />[:0]<br /><br />-----------------------<br />--Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />-----------------------<br /><br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Three cheers for Frank![<img src='/community/emoticons/emotion-5.gif' alt=';)' />]
  7. FrankKalis Moderator

    Actually I learnt it the hard way [<img src='/community/emoticons/emotion-6.gif' alt=':(' />]<br /><br />-----------------------<br />--Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />-----------------------<br />
  8. Adriaan New Member

    But I guess in the end the suggestion is to use PATINDEX, just in case you're working with TEXT or NTEXT fields?
  9. FrankKalis Moderator

    It actually looks better. However, after I realized that in my case I'm not really interested in finding the first occurance, but rather if the string is contained at all. So I use CONTAINS along with full-textsearches on TEXT columns.

    -----------------------
    --Frank
    http://www.insidesql.de
    -----------------------
  10. Adriaan New Member

    Yes, but Ben will need PATINDEX for his SUBSTRING call.
  11. FrankKalis Moderator

    Oops, forgot it. Sorry about that.

    -----------------------
    --Frank
    http://www.insidesql.de
    -----------------------
  12. benwilson New Member

    Thankyou all for your help!<br /><br />Firstly, the column does need to be text- it is usually more than 8000 characters...<br /><br />I appreciate the tip re. pat index...it will be a good lesson to remember, but not sure if it will work for me as i need to specify where to start the search from and PATINDEX only seems to accept 2 parameters...<br /><br />Here is some more detail on what i am doing. We are working towards getting rid of one of the databases my compay uses, so have added some new tables to our main database to store the important data from the other database...the new tables store the applicants/customers credit file. <br /><br />In order to have some historic data in these tables, i wrote some scripts to copy data from the old database to my new tables. Unfortunately, the old database does not store all the data we need in the new tables, so i now have to retrieve some of the details from a text column. here is an example of the data in the text column (unfortunately the formatting doesnt seem to be maintained here):<br /><br /> <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />DataLink PC 32 3.4 - Baycorp Advantage -------------------------------------- Individual Consumer Value Added Report -------------------------------------- File Number: 504903975 Name: TESTER, TONY First Reported: 28/11/2003 Date of Birth: 05/05/1970 Sex: MALE Drivers Licence: 345678U Consumer Address(es) First Reported -------------------- -------------- 45 TEST RD, TEST, NSW, 2222 28/11/2003 Consumer Employment ------------------- Date: 28/11/2003 Employer: TEST Summary Information ------------------- Defaults: 0 Directorships: 0 Court Judgements: 0 File Notes: 0 Prev. Directships:0 Court Writs: 0 Credit Providers: 0 Disq. Directships:0 Securities: 0 Enquiries: 0 Proprietorships: 0 Employers: 1 Auth. Agent Enqs: 13 Bankruptcy Act: 0 Addresses: 1 Cross References: 0 Bureau Score Result ------------------- Relative Risk: 4.05 times better than Baycorp Advantage average. Scorecard: Value Added Consumer Recommended Action: This application has been scored using the Value Added Consumer bureau scorecard and is compared with the Baycorp Advantage sub-population. The odds of recording an adverse with Credit Baycorp within 12-24 months of a Baycorp Advantage enquiry is 19:1. Baycorp Advantage's scores are based on samples of historical information in Credit Advantage's database. Credit Advantage uses reasonable efforts to ensure that samples are statistically valid, and that proven methods are used to develop scores. However, a score is only additional information on which to base a decision. A score is not a replacement for any other information, or for your decision making policies and procedures. Accordingly, Credit Advantage does not accept liability for any lending decision you make using a score. Consumer Credit Application ------------------------- Date: 21/09/2004 Amount: $20000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20040921670 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 21/09/2004 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20040921710 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 18/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031218260 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 15/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031215040 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 12/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212460 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 17/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212660 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 24/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212550 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 02/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212160 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 22/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212620 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 08/12/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031212150 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 28/11/2003 Amount: $10000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031128100 0110 Association Code: PRINCIPAL'S ACCOUNT Date: 11/11/2003 Amount: $15000 Member: FINANCE COMPANY Account: LOAN CONTRACT Reference: 20031128090 0110 Association Code: PRINCIPAL'S ACCOUNT <br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />One of the new tables stores data on 'Consumer Credit Applications'. There is a field in the table called 'Role' that stores a number that corresponds to the 'Association Code' in the above text...this is one of the fields that was not in the old database, so could not be copied. As the role can not be null, when i copied the data over, i entered UNK in this field. I have now been told that we need this field too, so have to get the association code for a particular application and update the field. As you can see from the data, an applicant can have many enquiries. <br /><br />Fortunately, the formatting of the text is somewhat consistent in terms of the number of characters between each heading (eg Member: and Account<img src='/community/emoticons/emotion-1.gif' alt=':)' /> and the first 3 headings for each enquiry are always Date: , Amount:, and Member: .After much stuffing around, the strategy i decided on was to use a cursor to step through each record in the table where the role is = UNK, and build a search string containing the details for Date, amount and <br />member. I then search through the text for this applicant for this string. <br /><br />Unfortunately, it is possible the same search string will appear prior to the 'Consumer Credit Applications' so i need to start my search from that major heading. Also, there are sometimes more headings before Association Code, so i cant then just grab the text x characters after the search string. Instead, i save 400 charcters into a variable, and then search through these characters for "Association Code" and grab the text after that as the role! (sorry this is so longwinded!)<br /><br />I am now using the following code to do the above:<br /><br /><pre>DECLARE @TransactionId AS nvarchar (30),<br />@ApplicantType As nvarchar (20),<br />@EnquiryDate AS datetime,<br />@CreditEnquiryDate as datetime,<br />@CreditEnquiryDateText as nvarchar (10),<br />@EnquiryAmount as nvarchar (20),<br />@EnquiryAmountTrimmed as char (10),<br />@CreditEnquirer AS char (54),<br />@SearchString as char (151),<br />@Role as nvarchar (400),<br />@UniqueId as bigint<br /><br />DECLARE CreditEnquiryRoleCursor CURSOR FOR<br />SELECT TransactionId, ApplicantType, EnquiryDate, CreditEnquiryDate, EnquiryAmount, CreditEnquirer, UniqueId<br />FROM tbl_ApplicantEnquiryCreditFileCreditEnquiry<br />WHERE Role = 'UNK' AND EnquiryType = 'Credit Application'<br /><br />OPEN CreditEnquiryRoleCursor<br /><br />FETCH NEXT FROM CreditEnquiryRoleCursor into @TransactionId, @ApplicantType, @EnquiryDate, @CreditEnquiryDate, @EnquiryAmount, @CreditEnquirer, @UniqueId<br /><br />WHILE @@FETCH_STATUS = 0<br />BEGIN<br />--Check to see if more than one enquiry was made to the same lender on the same date for the same amount for the applicant. If there was more than<br />--1 enquiry, can not easily find the correct role in the Bureau Data, so ignore<br />IF 1 = (SELECT COUNT (ContainerName) FROM tbl_ApplicantEnquiryCreditFileCreditEnquiry<br />WHERE TransactionID = @TransactionId AND ApplicantType = @ApplicantType AND EnquiryDate = @EnquiryDate AND <br />CreditEnquirer = @CreditEnquirer AND (CreditEnquiryDate = @CreditEnquiryDate) AND (EnquiryAmount = @EnquiryAmount))<br />BEGIN<br />--Convert the enquiry date to text<br />SET @CreditEnquiryDateText = convert(nvarchar(20), @CreditEnquiryDate, 103)<br />--Trim the decimal point and anything after it from the EnquiryAmount<br />SET @EnquiryAmountTrimmed = left(@EnquiryAmount, (Len(@EnquiryAmount)-3))<br />--Build string to search for in the data<br />SET @SearchString = '%Date: ' + @CreditEnquiryDateText + ' Amount: $' + @EnquiryAmountTrimmed + char(10) + 'Member: ' + @CreditEnquirer + '%' <br /><br />--Set the role to be the 400 characters found after the begining of the search string.<br />--SET @Role = 400 characters begining at the search string. Look for search string after the begining of the words 'Consumer Credit Application'<br />SET @Role = (SELECT TOP 1 SUBSTRING(Report, ((SELECT CHARINDEX(@SearchString, Report, (SELECT TOP 1 CHARINDEX('Consumer Credit Application', Report)<br /> FROM tbl_ApplicantEnquiry<br /> WHERE (TransactionId = @TransactionId) AND (ApplicantType = @ApplicantType) AND<br /> (@EnquiryDate = EnquiryDate)))<br />FROM tbl_ApplicantEnquiry<br />WHERE (TransactionId = @TransactionId) AND (ApplicantType = @ApplicantType) AND<br />(@EnquiryDate = EnquiryDate)) + 76) , 400)<br />FROM tbl_ApplicantEnquiry<br />WHERE (TransactionId = @TransactionId) AND (ApplicantType = @ApplicantType) AND<br />(@EnquiryDate = EnquiryDate))<br /><br />--Set role to be the 1st 25 characters found after the text 'Association Code'<br />SET @Role = (SELECT SUBSTRING (@Role, ((SELECT CHARINDEX('Association Code',@Role)) + 22),25))<br /><br />--If the role is one of the allowed values listed, update tbl_APplicantEnquiryCrediFilleCreditEnquiry.Role with the corresponding number<br />IF @Role IN ('PRINCIPAL''S ACCOUNT', 'CO-BORROWER & SPOUSE', 'JOINT', 'GUARANTOR')<br />BEGIN<br /><br />UPDATE tbl_ApplicantEnquiryCreditFileCreditEnquiry<br />SET Role = CASE @Role<br />WHEN 'PRINCIPAL''S ACCOUNT' THEN '1'<br />WHEN 'CO-BORROWER & SPOUSE' THEN '2'<br />WHEN 'JOINT' THEN '3'<br />WHEN 'GUARANTOR' THEN '4'<br /> END<br />WHERE UniqueId = @UniqueId<br /><br />END<br /><br />END<br /><br />FETCH NEXT FROM CreditEnquiryRoleCursor into @TransactionId, @ApplicantType, @EnquiryDate, @CreditEnquiryDate, @EnquiryAmount, @CreditEnquirer, @UniqueId<br /><br />END<br /><br />CLOSE CreditEnquiryRoleCursor<br />DEALLOCATE CreditEnquiryRoleCursor</pre><br /><br />Does anyone have any suggestions on how to improve this- it takes a long time to run(there are over 100000 records the cursor steps through)...took about an hour in my testing environment<br /><br />Thanks again, <br /><br />Ben
  13. benwilson New Member

    ok, i have found a problem with the code...Query analyser wont look past character 8000 in the text column...any suggestions for getting around this?
    Ben
  14. Adriaan New Member

    What you're probably seeing is that query analyser doesn't print the entire length of a text column. Of course T-SQL does see past character 8000, it's just that you shouldn't really use QA as a client application for consulting data.

    Check this option in QA: Tools>Options>Results>Maximum characters per column. Any length between 30 and 8,192 characters is allowed.

    One possible issue with PATINDEX is that it returns an INT value, so you're actually limited to 2,147,483,647 characters - which not without good reason is the maximum length of a TEXT column too.

    As for your query, I suggest you look into aggregate queries, instead of looping through a cursor.
  15. benwilson New Member

    Hi Adriaan,
    I am not trying to view the data in the text field- the reason i said it wasnt looking past the 8000 character point was that my code seems to find all the association codes prior to character 8000, and none of them after...

    You mentioned I shouldn't really use QA as a client application for consulting data- can you recommend an alternative??

    Thanks
    Ben
  16. Adriaan New Member

    You're still using CHARINDEX, which as Frank pointed out is limited to 8,000 characters - so you have to use PATINDEX instead (check BOL for the full details).

    Can't bring myself to fully analyze your query right now, but from what you're saying there can be more than one association code in the text column, right?

    Overall, I get the impression that you're 'unmerging' kind of like 'mail merge' documents. Why isn't the normalized data available?
  17. benwilson New Member

    Ahhhhh....I didnt quite understand the point regarding charindex until the last post from Adriaan...Yes, there are multiple Association codes in each column (hence i cant just search for an occurance of Association Code.

    I guess i am kind of un-merging mail merge type documents...the data comes into our system as a long string which is then interpretted to give the type of text i included in an earlier post. I have done some checking and the string that comes in to our system is available (stored in another database), and due to the format of the data, i will be able to build a different search string using patindex that should work!

    Fingers crossed!

Share This Page