Reduce Aggravating Aggregation: Improve the Performance of History or Status Tables

To avoid changing the application code that creates orderStatus rows, it’s possible to mimic the insert behavior of the original design by adding a trigger to the orderStatusCurrent table. The trigger will, upon insert, move any existing rows for the same orders to the orderStatusHistory table, and then insert the new rows. In this way, the application can continue with the simple model of inserting status rows as needed, but the database engine will automatically separate the latest rows from the older ones.

The trigger code looks like this:

CREATE TRIGGER moveOrderStatusToHistory
ON orderStatusCurrent
INSTEAD OF INSERT
AS
INSERT INTO orderStatusHistory( orderid, statusDate, status )
    SELECT existing.orderid,
        existing.statusDate,
        existing.status
    FROM orderStatusCurrent existing
    INNER JOIN inserted ON existing.orderid = inserted.orderid
DELETE FROM orderStatusCurrent
    FROM orderStatusCurrent existing
    INNER JOIN inserted ON existing.orderid = inserted.orderid
INSERT INTO orderStatusCurrent( orderid, statusDate, status )
    SELECT orderid,
        statusdate,
        [status]
    FROM inserted
GO

A side note on triggers: whenever creating trigger code, though it can be challenging, make absolutely certain the code is set based, and can handle multiple rows. It’s a mistake ­ and I have seen it in commercial, production code ­ to assume that a table will only be manipulated one row at a time and create a trigger based on that assumption. Triggers must correctly handle sets of rows.

The trigger has three steps: it first inserts any rows that already exist in the orderStatusCurrent table, for the same orders as the rows being inserted, into the orderStatusHistory table; it then removes those rows from orderStatusCurrent. Finally, it inserts the new, incoming status rows. In effect the new rows replace the old ones, and the old ones get “archived.”

To test the trigger, we can run the same sequence of inserts against the new orderStatusCurrent table, and watch to make sure that a) exactly one row per order remains in the table, and b) the older rows are getting moved to orderStatusHistory correctly:

/* Test against one order */
SELECT * FROM orderStatusCurrent
INSERT INTO orderStatusCurrent( orderid, statusDate, status )
VALUES ( 1000, GETDATE(), 'Fulfillment' )
SELECT * FROM orderStatusCurrent
INSERT INTO orderStatusCurrent( orderid, statusDate, status )
VALUES ( 1000, GETDATE(), 'Stocking' )
SELECT * FROM orderStatusCurrent
SELECT * FROM orderStatusHistory
INSERT INTO orderStatusCurrent( orderid, statusDate, status )
VALUES ( 1000, GETDATE(), 'Packaging' )
SELECT * FROM orderStatusCurrent
SELECT * FROM orderStatusHistory
INSERT INTO orderStatusCurrent( orderid, statusDate, status )
VALUES ( 1000, GETDATE(), 'Shipping' )
SELECT * FROM orderStatusCurrent
SELECT * FROM orderStatusHistory
DELETE FROM orderStatusCurrent
DELETE FROM orderStatusHistory
GO

If that works as expected, then we can run similar insert code to the first example, to populate the status records for all orders with test data:

INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT orderid, orderdate, 'Fulfillment' FROM orders
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking'
    FROM orders ORDER BY NEWID()
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatusCurrent.statusdate ), 'Packaging'
    FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID
        AND orderStatusCurrent.status = 'Stocking'
    ORDER BY NEWID()
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipping'
    FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID
        AND orderStatusCurrent.status = 'Packaging'
    ORDER BY NEWID()
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipped'
    FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID
        AND orderStatusCurrent.status = 'Shipping'
    ORDER BY NEWID()
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] )
    SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Received'
    FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID
        AND orderStatusCurrent.status = 'Shipped'
    ORDER BY NEWID()

After these inserts, the current status value for each order will be in the target table, and the trigger will have moved the previous status values to the history table. Now we can run some comparisons to see what sort of performance improvement this yields.

First, a comparison of a large query reporting the current status of each order:

/* Select using the separate history table */
SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatusCurrent os
    ON o.orderID = os.orderID
OPTION( MAXDOP 1 )
/* Select using the original single status table */
SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN (
    SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid
) lastStatusDates ON o.orderID = lastStatusDates.orderid
INNER JOIN orderStatus os
    ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate
OPTION( MAXDOP 1 )

Note: I am using MAXDOP 1 to keep these test cases single-threaded, for a simpler apples-to-apples comparison. Analyzing them with parallelism is outside the scope of this exercise.

The optimizer reports that the first query, with the separate history table, has an estimated cost around 15; the original design shows a cost of 33.4, so we are getting something around twice the performance from splitting the table.

Another sample query involves fetching all orders having a specific status:

SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatusCurrent os
    ON o.orderID = os.orderID
WHERE os.status = 'Packaging'
OPTION( MAXDOP 1 )
SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN (
    SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid
) lastStatusDates ON o.orderID = lastStatusDates.orderid
INNER JOIN orderStatus os
    ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate
WHERE os.status = 'Packaging'
OPTION( MAXDOP 1 )

Here the difference is even more: the split status tables have a query plan cost of 14, compared to 53 for the original design. Finally, orders for a specific customer:

SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatusCurrent os
    ON o.orderID = os.orderID
WHERE o.customerID = 12345
OPTION( MAXDOP 1 )
SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN (
    SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid
) lastStatusDates ON o.orderID = lastStatusDates.orderid
INNER JOIN orderStatus os
    ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate
WHERE o.customerID = 12345
OPTION( MAXDOP 1 )

Cost for the original design 0.096; with the status table split, that improves to 0.063.

The one case that becomes slightly tricky is gathering all the status values for a particular order. In the original design, this is simply

SELECT o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatus os
    ON o.orderID = os.orderID
WHERE o.orderID = 12345

But with the new design we are forced to do some sort of union to get both the current and past values, like these two examples:

SELECT o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    st.statusDate,
    st.status
FROM orders o
INNER JOIN ( select * from orderStatusCurrent os
    UNION ALL
    select * from orderStatusHistory ) st
    ON o.orderID = st.orderID
WHERE o.orderID = 12345
SELECT o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatusCurrent os
    ON o.orderID = os.orderID
WHERE o.orderID = 12345
UNION ALL
SELECT o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN orderStatusHistory os
    ON o.orderID = os.orderID
WHERE o.orderID = 12345

Here, unsurprisingly, the original design costs about 30% less, but (this did surprise me) of the other two examples, the first is about 25% faster than the second. The optimizer does a very good job at refactoring the derived table containing the union.

There is some cost for this improvement in select performance, as one would expect. Instead of a single insert, a status change for an order record would require two inserts and a delete, moving the old record from one table to the other. So if the speed of inserts is the only concern, this change might not be advisable; however, on many systems (including the one I was working with) the select/reporting traffic outweighs the transactional load, often many times over, so this change would probably help overall.

What I took away from this study was this: for scalability, it’s obviously not wise to mix rows of different kinds in the same table ­ but that rule can apply even when the difference is as subtle as current vs. past values, that otherwise have exactly the same structure.

]]>

Leave a comment

Your email address will not be published.