SQL Server Performance

Tricky Gap Fill SQL Algorithm

Discussion in 'T-SQL Performance Tuning for Developers' started by dhilditch, Jul 16, 2004.

  1. dhilditch New Member

    I have a requirement to insert a continuous stream of data in a sql table. There is a data feed with the main columns of concern here being the StartTStamp column and the Duration column. The requirement is that if there are any gaps then these should be filled with placeholders. This would be very easy with a cursor but I don't want to use that route. I have come up with a nested query solution but now need to add another requirement - if a gap is filled then there is a maximum duration for each gap fill. i.e if the gap is 1100 and the max duration is 600 then 2 gap rows should be created one with a duration of 600 and one with 500.

    A gap can be tested for by checking if StartTStamp + duration <> NextStartTStamp. NB there is no identity column in this table.

    I have come up with the following so far - doesn't handle the max gap size though!! And I think there must be a way to make it simpler.



    select gaps.StartTStamp, faults_datafeed.StartTStamp - gaps.StartTStamp duration
    from
    (
    select StartTStamp + duration as StartTStamp
    from faults_datafeed
    where StartTStamp not in
    (
    select a.StartTStamp as NextTStamp from faults_datafeed a
    inner join faults_datafeed b
    on a.StartTStamp + a.duration = b.StartTStamp
    )

    ) gaps inner join faults_datafeed
    on faults_datafeed.StartTStamp = (select min (StartTStamp)
    from faults_datafeed
    where StartTStamp > gaps.StartTStamp)
    union
    select StartTStamp, Duration
    from faults_datafeed
    Source Data


    StrtTS Duration
    2800100
    290010
    4000100
    Result using above


    StrtTS Duration
    2800100
    290010
    29101090
    4000100
    Is it even possible to max that duration and have 2 rows for it? ie. 2910:600, 3510:490

    Dave.
  2. vaxman New Member

    For the max interval solution:<br /><br /><pre><br />DECLARE @Maxinterval int<br />SET @Maxinterval = 600<br /><br />SELECT sd + sd - sd + CASE WHEN ID &lt;= ((nxt - sd) / @Maxinterval) THEN ID * @Maxinterval ELSE (nxt - sd) % @Maxinterval END StartTStamp,<br />sd - sd + case WHEN ID &lt;= ((nxt - sd) / @Maxinterval) THEN ID * @Maxinterval ELSE (nxt - sd) % @Maxinterval END Duration<br />FROM<br />(<br />SELECT TOP 100 PERCENT StartTStamp, StartTStamp+Duration sd, Duration, <br />(SELECT top 1 StartTStamp FROM faults_datafeed WHERE StartTStamp &gt; o.StartTStamp ) nxt<br />FROM faults_datafeed o <br />ORDER BY StartTStamp<br />) x, vnumbers<br />WHERE nxt - sd &gt; 0 AND ID &gt; 0<br />AND ID &lt;= ((nxt - sd) / @Maxinterval) + 1<br /><br />UNION ALL<br />SELECT StartTStamp, Duration FROM faults_datafeed<br /><br />ORDER BY 1<br /><br /><br />StartTStamp Duration <br />----------- ----------- <br />2800 100<br />2900 10<br />3400 490<br />3510 600<br />4000 100<br /></pre><br /><br />It requires a table of numbers. If you don't have one you can build one<br /><pre><br />SELECT TOP 8000 ID = IDENTITY(int, 1, 1)<br />INTO Numbers<br />FROM pubs..authors t1, pubs..authors t2, pubs..authors t3<br /></pre><br />or define a view to dynamically provide it:<br /><pre><br />Create view vNumbers AS<br /> select (a0.id + a1.id + a2.id) ID FROM <br />(select 0 id union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) a0, <br />(select 0 id union select 10 union select 20 union select 30 union select 40 union select 50 union select 60 union select 70 union select 80 union select 90) a1, <br />(select 0 id union select 100 union select 200 union select 300 union select 400 union select 500 union select 600 union select 700 union select 800 union select 900) a2<br /></pre><br /><br />To answer just the first part of the question, the code below will generate a single entry for each missing duration. I'm not sure it's any faster than your solution though. (Edit: I decided to test it and it's about 3 times more efficient than your original code - assuming it works [<img src='/community/emoticons/emotion-5.gif' alt=';)' />])<br /><pre><br />SELECT sd, nxt - sd Duration FROM<br />(<br />SELECT TOP 100 PERCENT StartTStamp, StartTStamp+Duration sd, Duration, <br />(SELECT top 1 StartTStamp FROM faults_datafeed WHERE StartTStamp &gt; o.StartTStamp ) nxt<br />FROM faults_datafeed o <br />ORDER BY StartTStamp<br />) x<br />WHERE nxt - sd &gt; 0<br /><br />UNION ALL<br /><br />SELECT StartTStamp, Duration <br />FROM faults_datafeed<br />ORDER BY 1<br /></pre><br /><br />Cheers
  3. dhilditch New Member

    I don't get the same results when I try a performance test using your code (the second part) - admittedly this was with a larger data set as I'm working off a copy of the real data.

    I love your gap filler though! Can I ask, why do you do "sd + sd - sd" rather than just "sd" and likewise, why do you do "sd - sd + CASE..." instead of just CASE? Is there some performance boost gained here that I am unaware of?

    Also, TOP 100 Percent...? (I think I remember somewhere reading about this giving a couple of performance points speed boost for some reason)

    Thanks for the ideas!

    Dave.
  4. vaxman New Member

    The sd + sd - sd was just an accident as I was building the output a piece at a time while I was looking at it and forgot to eliminate the redundant operation.

    Top x percent is required if you want to use ORDER in a subselect. Not sure if it would have been in StartTStamp order by default. It might have.

    If you post enough data to demonstrate the problem I'll be glad to take a look at it. Does the second part produce the same results as your original code on your larger dataset?
  5. dhilditch New Member

    I've fixed a couple of bugs in the code you posted for the Max Duration Gap filler. There was one bug where if the gap was bigger (or equal) to 2 times the max duration, the duration of that gap was coming out as N * max duration.

    There was also a problem with the starttstamp with it coming out multiple times as the same value when I tested it against my data.

    I've also fixed the problem I mentioned before with it producing the duration for the previous timestamp per row. It's looking really good now. On 55,000 source rows and a few gaps in there, it completes in 2 seconds on my dev box. Cost of 228 in QA.

    Here's the final code:


    -- code below uses the idea of joining onto a numbers table with plain numbers on one side and modulo of the max duration on the other
    -- this means if there is a gap of 1200 seconds then the modulos of this are 1 and 2 (actually 0,1,2 but 0 ignored in where clause)
    -- this forms creation of additional rows required - it's then a simple matter to calculate the relevant startTStamps for these new rows
    -- i.e. StartTStamp + modulonumber * maxduration e.g. 1000 + 1 * 600, followed by 1000 + 2 * 600 etc
    -- duration is calculated in a similar way - whenever ID less than the max ID possible for that group of rows then use max duration
    -- else use modulo remainder.
    -- NB. Common items that are repeated - read this list first to make query easy to read
    --
    -- (StartNextRealTimeStamp - StartGapTimeStamp) = Gap Duration
    -- ((StartNextRealTimeStamp - StartGapTimeStamp) / @Maxinterval) = Total max duration Rows in this Gap e.g. gap of 1250 would = 2 rows for 600s duration

    DECLARE @Maxinterval int
    SET @Maxinterval = 600

    SELECT NodeID,
    StartGapTimeStamp + @MaxInterval * (ID - 1) StartGapTimeStamp,
    CASE WHEN ID <= ((StartNextRealTimeStamp - StartGapTimeStamp) / @Maxinterval) THEN @Maxinterval ELSE (StartNextRealTimeStamp - StartGapTimeStamp) % @Maxinterval END Duration_secs
    FROM
    (
    SELECT TOP 100 PERCENT NodeID, ts, ts+Duration_Secs StartGapTimeStamp, Duration_Secs,
    (SELECT top 1 ts FROM faultdata_datafeed WHERE ts > o.ts and NodeID = o.NodeID order by ts) StartNextRealTimeStamp
    FROM faultdata_datafeed o
    -- Following additional line provides speed boost depending on clustered index used - sometimes faster sometimes slower
    --where ts + Duration_secs <> (SELECT top 1 ts FROM faultdata_datafeed WHERE ts > o.ts and NodeID = o.NodeID order by ts)
    ORDER BY ts
    )x, vnumbers
    WHERE StartNextRealTimeStamp - StartGapTimeStamp > 0 and ID > 0
    and id <= ((StartNextRealTimeStamp - StartGapTimeStamp) / @MaxInterval) + 1
    --order by 1

    UNION ALL

    SELECT NodeID, ts, Duration_Secs
    FROM faultdata_datafeed
    ORDER BY 1,2

    There are a few interesting things - I had to add in the Order by clause in the inner query as there is no guarantee this clustered index will remain. Also, no need for the first case statement you provided. Also, second case statement has changed slightly.

    With the Order By clause included, if I use a clustered index on ts followed by nodeid, the order by clause slows the query down to 19 seconds. With the same clustered index, removing the order by produces the correct results (as it's already ordered) but in 2 seconds.

    With clustered index on nodeid, ts - performance is 3 seconds though 2 without the optional line on the inner query. So it's faster with the optional line, yet the cost in Execution plan is almost twice as high! Any ideas why that might be?

    I'd rather not remove the Order By clause as the clustered index could change in future.

    Cheers for all the help - your idea works brilliantly, though I guess to a maximum gap size currently of 10,000 * 600s = 70 days but that's fine - there will be bigger problems if this ends up being an issue! I'm going to try with a static table rather than a view for the numbers table.

    Dave.

Share This Page