SQL Server Performance Forum – Threads Archive

SQL Server compiles wrong execution plan with udf

hi
I have a problem with a sql including a udf
the sql syntax goes like this
select top 100 document.*
from document inner join folder on document.folderid = folder.system_id
where folder.parent in (1,2,3,4,5,6) and 1= udf_check(document.system_id,folder.system_id) here the udf returns 1 if the row should be returned
The table consists of 500000 rows, and sqlserver now calculates the udf on all rows before applying the where in.
Is there any way i can hint to always calculate the udf last??
(the query returns 3 rows)

Try:
select top 100 document.*
from folder
join document on document.folderid = folder.system_id and 1= udf_check(document.system_id,folder.system_id)
where folder.parent in (1,2,3,4,5,6)
option (force order)

If you filter on the result of a UDF that takes values from a row as a parameter, then ALL rows WILL be processed. Another big mistake is to assume that the TOP 100 clause stops the query after 100 rows – it stops returning data after row #100, after having processed all rows – at least in this case, because of the UDF you’ve included. Now how can you reduce the number of calls to the UDF? That depends on what the UDF actually does with the parameters. If it’s an evaluation that involves external factors that cannot be made available within the base query, then indeed you can only use a UDF. But if it’s a simple evaluation, especially if it is "sargeable", then by all means do the evaluation in the main query! If the number of distinct combinations of the parameters that you feed to the UDF is considerably smaller than the unfiltered number of rows returned by the main query, then you can add an INNER JOIN on a derived table that does a SELECT on the UDF for the values returned by an embedded derived table that returns the distinct combinations of the parameter values, like so: SELECT T1.a, T2.b
FROM T1
INNER JOIN T2 ON T1.id = T2.id
INNER JOIN
(SELECT Embedded.a, Embedded.b, MyUDF(Embedded.a, Embedded.b) AS MyResult
FROM (SELECT DISTINCT T1.a, T2.b FROM T
INNER JOIN T2 ON T1.id = T2.id) AS Embedded) AS UDFResults
ON UDFResults.a = T1.a AND UDFResults.b = T2.b
WHERE UDFResults.MyResult = 1

quote:If you filter on the result of a UDF that takes values from a row as a parameter, then ALL rows WILL be processed. Another big mistake is to assume that the TOP 100 clause stops the query after 100 rows – it stops returning data after row #100, after having processed all rows – at least in this case, because of the UDF you’ve included.

Adriaan, I don’t think that’s correct. I think that would be true if order by clause was involved, otherwise I expect rows processing to stop after 100th row is found.

quote:But if it’s a simple evaluation, especially if it is "sargeable", then by all means do the evaluation in the main query!

This is recommendation I totally agree. If author posts udf code, we may give you recommendation how to avoid it (if possible).

Okay, I’ve spoken a little too soon. I did some tests indicate that the TOP clause can reduce the number of calls to a UDF in the WHERE clause, but only if the WHERE clause evaluates to true for a enough rows as you go along. If you are filtering for isolated exceptions, there is no point in using the TOP clause. I created a temporary table #T with about 1 million rows, and a UDF fnMyTest that simply returns the identity value that was passed on as the parameter. I ran sets of three queries like this:
set statistics io on select *
from #t
where dbo.fnMyTest(#t.MyIdent) = #t.MyIdent select top 10 *
from #t
where dbo.fnMyTest(#t.MyIdent) = #t.MyIdent select top 10 dbo.fnMyTest(#t.MyIdent)
from #t set statistics io off
With the = in the WHERE statement, obviously the TOP clause minimized the logical reads. But with <> in the WHERE clause, the TOP clause made no difference whatsoever. If you would change the UDF to return a different value, the TOP clause might or might not help – it would depend on the probability of a positive evaluation of the WHERE clause.

To correct myself, ordery by cluse would probably make difference if no appropriate index is used.

quote:Okay, I’ve spoken a little too soon. I did some tests indicate that the TOP clause can reduce the number of calls to a UDF in the WHERE clause, but only if the WHERE clause evaluates to true for a enough rows as you go along.

That’s a good point.

thx guys, im afraid there is no way to avoid the udf, the force order did work and putting the udf nested did also work, but…
unfortunatly i have a similar search select top 100 document.*
from document inner join folder on document.folderid = folder.system_id
where folder.parent in (1,2,3,4,5,6,9,10) and 1= udf_check(document.system_id,folder.system_id)
order by document.createddate This search gives approx 10000 hits but only 100 is returned (the caller recieving this needs 100, so i cannot change it)
Now the udf is executed for all hits no matter what, If i remove the order by clause so they are returned as stored, it executes fine
So i was thinking about making a clustered view with createdate as key
Or do you guys have other ideas?

You can move the ORDER BY into a derived table that orders the document table by createddate, which untangles the ORDER BY and TOP clauses: select top 100 document.* from (SELECT document.* FROM document ORDER BY document.createddate) AS document inner join folder on document.folderid = folder.system_id where folder.parent in (1,2,3,4,5,6,9,10) and 1= udf_check(document.system_id,folder.system_id)) This does not guarantee that the results are ordered by createddate, so you need to add this around the above query: SELECT TMP.* FROM (<above query&gt<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ /> AS TMP ORDER BY TMP.createddate

Try: <pre id="code">select t.* from ( select top 100 document.* from folder join document on document.folderid = folder.system_id and 1= udf_check(document.system_id,folder.system_id) where folder.parent in (1,2,3,4,5,6,9,10) option (force order) ) as t order by t.createddate</pre id="code"> [Edited]Oops, Adriaan you were faster. [Edited again] but it actually doesn’t solve the problem [<img src=’/community/emoticons/emotion-6.gif’ alt=’

‘ />]

The best solution I’ve found so far is:
select top 100 <column list without f_system_id>
from (
select document.*, folder.system_id as f_system_id
from folder
join document on document.folderid = folder.system_id
where folder.parent in (1,2,3,4,5,6,9,10)
) as t
where 1= udf_check(t.system_id,t.f_system_id)
order by t.createddate

Mirko, Not sure why you say it doesn’t solve the problem? The question was to get the TOP 100 from Document ordered by CreatedDate, for the given criteria. We start by ordering the Document table by CreatedDate, then JOIN that with the Folder table, we add the criteria, and finally SELECT TOP 100 from that – which will stop after 100 rows in order of DateCreated have matched the criteria.

Adriaan, there are a few problems with your solution and first one I offered: 1. Your solution will scan all rows in document table, fortunatelly it will not test udf against them.
2. It will test udf against all 1000 hits based on in condition.
3. It will order by date just 100 rows returned, not first 100 matching rows ordered by createdDate.

Mirko, #1 – Good idea: move the criteria for the parent column to the derived table. #2 – Incorrect, it will test at least 100 rows in the outer query, until the point where 100 rows have matched the criteria. #3 – Incorrect, it is processing the rows based on the order of the derived table, which is ordered on the createddate column.

quote:#2 – Incorrect, it will test at least 100 rows in the outer query, until the point where 100 rows have matched the criteria.

You are right, I missread the query, you have actually top clause in inner query (but not the most inner).

quote:#3 – Incorrect, it is processing the rows based on the order of the derived table, which is ordered on the createddate column.

I see your intention, but can you actually have order by in derived table definition? If I remember well you need to put top 100 percent clause to avoid syntax error. However I am not sure about that. Anyway, order is still not guaranteed, it depends on join algorithm too. I would fix it by forcing order, forcing index and loop join. Again, sorry for missreading your query, I replaced the whole query inside the code with TMP … in my mind, but made an mistake doing that.

Yes, you may well be right about the order after the JOIN – and then all we’re doing here is to show the difficulties that a UDF can introduce in a basically simple query. By the way, I was intrigued by your use of option (force order) – and from what I understand from BOL is that this is with regards to the order of the JOINs being applied, not of the rows. Nice try, but no cigar.[<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]

quote:By the way, I was intrigued by your use of option (force order) – and from what I understand from BOL is that this is with regards to the order of the JOINs being applied, not of the rows. Nice try, but no cigar.

I used force order to make condition on folder table evaluated first and nested loop join done next. That way condition with udf will be applied only on rows that match condition in where clause.

Hm, doesn’t my solution of moving the criteria for the parent column to the derived table do the same?

Using derived table does force join order if this is the question. However your query forces access to document table first and reads all rows there.

Yes, but that is unavoidable, since that is where we need to do the sorting, which needs to be done before the criteria. If the ordering was not an issue, then this wouldn’t be a problem at all. Which reminds me … Pelle: does the CreatedDate column have an index?

My last solution is to first find all rows where in (…) is true. Execution plan is going to be: find all folder rows from the list and then join them with matching document rows. After that rows will be ordered and then udf criteria will be applied until 100th row is found. If "in" list is selective enough much less rows will be processed. Index on CreatedDate will not help my query performance at all. It will help your query performance, only if it is clustred.

hi
thx for all the answers
The createddate does not have an index but is part of an index
I agree it only seems to help if the createddate is clustered otherwise the udf is executed with all 10000 hits
I will try to fiddle with the indexes to see if i can avoid a clustered index

]]>