SQL Server Performance Forum – Threads Archive

Group By change in grouping column

We have a situation where there are a bundle of categories for different aspects of a row. There are 3 category columns to group on. The data also has a timestamp on each row to indicate for which timeperiod this row reflects. We run reports against this data and group on the 3 category columns but rather than bundle all of Category X into one big group it is also broken down to the period of time it remained as Category X. So the 3 grouping columns indicate the ‘state’ of the row and when grouping I need to group all the individual rows for each state for a period of time into one row in the report. e.g. (with just one state/category column)
timestamp category
10 0
20 0
30 1
40 0
50 1
When grouping, I need to group timestamps 10,20 into one group, 30 into another group and 40 into another group – as opposed to having 10,20 and 40 all in one group. The way we’re doing this just now is to have a kind of sequence number (actually the timeStamp is used) as an additional column which remains the same until the category changes. So above is actually:
timestamp category categorystartts
10 0 10
20 0 10
30 1 30
40 0 40
50 1 50
So we group on categorystartts instead – keeps the groups seperate based on the state changing. The problem is, this additional column isn’t really relevant to the data, and it is becoming a pain to maintain since the data can be edited so e.g. if someone edits the category at timestamp 30 the categorygroupts must be updated for timestamp 30 and 40 to be 10. Another way I’ve come up with of doing this (in dev environment anyway) is to perform a skewed query at the time of report to join this table to itself but 1 timestamp earlier and calculate the beginning of the state period on the fly and group on this. Problem is, it’s pretty slow. I’ve searched everywhere on google and here in this forum and some other places and racked my brains to think of another neat solution to this, but can’t come up with anything as quick as storing the groupingid in the table, and yet that’s a pain because of the edits. Feasibly, a single edit could result in 1000s of rows being updated to merge the groups. I feel there must be a standard approach to this problem, and I get the feeling the two I’ve described above are it, but maybe not… anyone know of a good way of doing this? Dave.

Not sure I completly understand the function of category and how it relates to your expected output. If a 1 indicates the start of an event, I would expect 40 to belong to the categorystarts group 30. If I am right, the following query should very quickly provide the correct value for categorystats if there are indexes on timestamp and category.
select timestamp, category,
(select top 1 timestamp from events where (category = 1 or
timestamp = (select min(timestamp) from events )) and
timestamp <= o.timestamp order by timestamp desc) categorystarts
from events o
I would also expect the first entry in the table to have a category of 1 to indicate the start of all events. If that is true the query would be even simpler:
select timestamp, category,
(select top 1 timestamp from events where category = 1 and
timestamp <= o.timestamp order by timestamp desc) categorystarts
from events o

No, the categories can be any number. A category of 0 doesn’t mean anything special in the sense of there being no event. There is always some state the system is in, so 0 actually means ‘Normal Operation’, whereas 1 could be ‘stopped due to x’, 2 = ‘stopped due to y’, 3 = ‘running half speed due to z’ etc The report needs to group on these states, but not to group everything together in one lump e.g. if the system runs normally but at two different points in time its state changes to ‘running half speed due to z’ then I want to see those two states seperately in my report, but group by in it’s usual operation will group all of these states together which is why we came up with a sequence number to help group the groups properly. As stated above however, this is now causing additional headaches so would like to find out if there is a better way of doing this that I’m overlooking. Dave.

Would it not be possible to generate the grouping sequence dynamically in your select, rather than having it stored in the table and having to maintain it ? a bit like this..
Note that this code doesnt relate to your problem or schema, its just to demonstrate how to group by the dynamic field. Also you could put the CASE into a user defined function to make the SELECT more readable if you wished

Code:

select
  MIN(DE_ID), MAX(DE_ID), COUNT(DE_ID),
CASE
WHEN DE_ID BETWEEN 0 and 5000 THEN 1
WHEN DE_ID BETWEEN 5000 and 9000 THEN 2
WHEN DE_ID BETWEEN 9001 and 15000 THEN 3
ELSE 4
END
FROM
Der
GROUP BY
CASE
WHEN DE_ID BETWEEN 0 and 5000 THEN 1
WHEN DE_ID BETWEEN 5000 and 9000 THEN 2
WHEN DE_ID BETWEEN 9001 and 15000 THEN 3
ELSE 4
END

I don’t think I understand. The problem I have is that I have a time series, so each row in the database has a timestamp value. These run consecutively. Based on another column, I want to group on the ‘Runs’ of data. So if the column I am concerned with remains the same for a period of time, group all those rows into one row. Then group the next run into one row, then the next run into another row. Standard GROUP BY doesn’t work, because if two seperate runs have the same ‘state’ then it will group them together. Imagine the analogy of a football team, where they can win, lose or draw. They could have a winning streak of 10 games followed by a losing run of 3 games followed by a winning streak of 5 games. I want my report to show 3 seperate groups in this case and not group all the wins into one group. The final problem is that these ‘state’ columns can hold pretty much any value, and there are 3 of them – new state if any of them change. Dave.

This is what I come up with – can’t figure out anything to make this faster, but there must be a common solution for grouping runs! Solution below based on grabbing the timestamp value from the last (previous ordered by time) row where the category/state has changed. SELECT NodeID, ts, Duration_Secs, AvailCatID,
(SELECT top 1 ts FROM faultdata_datafeed WHERE ts < currentfault.ts and NodeID = currentfault.NodeID and (AvailCatID <> currentfault.AvailCatID or EventDefnID <> currentfault.EventDefnID or DowntimeID <> currentfault.DowntimeID) order by ts desc) FaultGroup
FROM faultdata_datafeed currentfault Anybody got anything that’s better than what’s above? Cheers,
Dave.

Depending on how many rows you have, this has the potential of being much faster. (It will probably do 20,000 rows/second.) A corelated subquery will be extreamly slow. create table faultdata_datafeed (nodeid int identity(1,1), timestamp int, category int, FaultGroup int Null) insert faultdata_datafeed (timestamp, category) select 10 ,0
insert faultdata_datafeed (timestamp, category) select 20 ,0
insert faultdata_datafeed (timestamp, category) select 30 ,1
insert faultdata_datafeed (timestamp, category) select 40 ,0
insert faultdata_datafeed (timestamp, category) select 50 ,1 Declare @RunGroup int, @Category int Update faultdata_datafeed
Set @RunGroup = FaultGroup =
Case When category <> Coalesce(@Category,-1)
Then timestamp
Else Coalesce(@RunGroup,timestamp)
End,
@Category = category SELECT timestamp, category, FaultGroup
FROM faultdata_datafeed timestamp category FaultGroup
———– ———– ———–
10 0 10
20 0 10
30 1 30
40 0 40
50 1 50

Thanks vaxman – that’s twice you’ve helped me out! I’ll have to modify this slightly since the NodeID I’m using refers to one of X fixed number of nodes per project that I’m reading data from so everything has to be broken down by NodeID. So long as I order by NodeID, time_stamp I should be ok. Dave.

No problem. It depends on there being a clustered index that is ordered by your timestamps. If you have that already (or you can make one) you don’t need an identity. SQL’s behavior of updateing rows in clustered index order is not exactly documented and I suppose there may come a day when SQL does it differently.

]]>