To avoid changing the application code that creates orderStatus rows, it’s possible to mimic the insert behavior of the original design by adding a trigger to the orderStatusCurrent table. The trigger will, upon insert, move any existing rows for the same orders to the orderStatusHistory table, and then insert the new rows. In this way, the application can continue with the simple model of inserting status rows as needed, but the database engine will automatically separate the latest rows from the older ones.
The trigger code looks like this:
CREATE TRIGGER moveOrderStatusToHistory ON orderStatusCurrent INSTEAD OF INSERT AS INSERT INTO orderStatusHistory( orderid, statusDate, status ) SELECT existing.orderid, existing.statusDate, existing.status FROM orderStatusCurrent existing INNER JOIN inserted ON existing.orderid = inserted.orderid DELETE FROM orderStatusCurrent FROM orderStatusCurrent existing INNER JOIN inserted ON existing.orderid = inserted.orderid INSERT INTO orderStatusCurrent( orderid, statusDate, status ) SELECT orderid, statusdate, [status] FROM inserted GO
A side note on triggers: whenever creating trigger code, though it can be challenging, make absolutely certain the code is set based, and can handle multiple rows. It’s a mistake and I have seen it in commercial, production code to assume that a table will only be manipulated one row at a time and create a trigger based on that assumption. Triggers must correctly handle sets of rows.
The trigger has three steps: it first inserts any rows that already exist in the orderStatusCurrent table, for the same orders as the rows being inserted, into the orderStatusHistory table; it then removes those rows from orderStatusCurrent. Finally, it inserts the new, incoming status rows. In effect the new rows replace the old ones, and the old ones get “archived.”
To test the trigger, we can run the same sequence of inserts against the new orderStatusCurrent table, and watch to make sure that a) exactly one row per order remains in the table, and b) the older rows are getting moved to orderStatusHistory correctly:
/* Test against one order */ SELECT * FROM orderStatusCurrent INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Fulfillment' ) SELECT * FROM orderStatusCurrent INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Stocking' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Packaging' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Shipping' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory DELETE FROM orderStatusCurrent DELETE FROM orderStatusHistory GO
If that works as expected, then we can run similar insert code to the first example, to populate the status records for all orders with test data:
INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT orderid, orderdate, 'Fulfillment' FROM orders INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking' FROM orders ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatusCurrent.statusdate ), 'Packaging' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Stocking' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipping' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Packaging' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipped' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Shipping' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Received' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Shipped' ORDER BY NEWID()
After these inserts, the current status value for each order will be in the target table, and the trigger will have moved the previous status values to the history table. Now we can run some comparisons to see what sort of performance improvement this yields.
First, a comparison of a large query reporting the current status of each order:
/* Select using the separate history table */ SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID OPTION( MAXDOP 1 ) /* Select using the original single status table */ SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate OPTION( MAXDOP 1 )
Note: I am using MAXDOP 1 to keep these test cases single-threaded, for a simpler apples-to-apples comparison. Analyzing them with parallelism is outside the scope of this exercise.
The optimizer reports that the first query, with the separate history table, has an estimated cost around 15; the original design shows a cost of 33.4, so we are getting something around twice the performance from splitting the table.
Another sample query involves fetching all orders having a specific status:
SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE os.status = 'Packaging' OPTION( MAXDOP 1 ) SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate WHERE os.status = 'Packaging' OPTION( MAXDOP 1 )
Here the difference is even more: the split status tables have a query plan cost of 14, compared to 53 for the original design. Finally, orders for a specific customer:
SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE o.customerID = 12345 OPTION( MAXDOP 1 ) SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate WHERE o.customerID = 12345 OPTION( MAXDOP 1 )
Cost for the original design 0.096; with the status table split, that improves to 0.063.
The one case that becomes slightly tricky is gathering all the status values for a particular order. In the original design, this is simply
SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatus os ON o.orderID = os.orderID WHERE o.orderID = 12345
But with the new design we are forced to do some sort of union to get both the current and past values, like these two examples:
SELECT o.orderID, o.customerID, o.orderDate, o.description, st.statusDate, st.status FROM orders o INNER JOIN ( select * from orderStatusCurrent os UNION ALL select * from orderStatusHistory ) st ON o.orderID = st.orderID WHERE o.orderID = 12345 SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE o.orderID = 12345 UNION ALL SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusHistory os ON o.orderID = os.orderID WHERE o.orderID = 12345
Here, unsurprisingly, the original design costs about 30% less, but (this did surprise me) of the other two examples, the first is about 25% faster than the second. The optimizer does a very good job at refactoring the derived table containing the union.
There is some cost for this improvement in select performance, as one would expect. Instead of a single insert, a status change for an order record would require two inserts and a delete, moving the old record from one table to the other. So if the speed of inserts is the only concern, this change might not be advisable; however, on many systems (including the one I was working with) the select/reporting traffic outweighs the transactional load, often many times over, so this change would probably help overall.
What I took away from this study was this: for scalability, it’s obviously not wise to mix rows of different kinds in the same table but that rule can apply even when the difference is as subtle as current vs. past values, that otherwise have exactly the same structure.
]]>