A third-party system that I manage has “history” or “status” tables following a common pattern: in the database, there is some entity (in my case, apartment units) that is labeled with a value indicating the entity’s status, and the value changes over time. Each of these entities has exactly one value representing the current status, but each also has a series of past values and dates or timestamps, tracing the history of the entity status and when it changed. For apartment units, for example, a given unit might be vacant, then reserved for a new resident, then occupied, then occupied but with the resident having given notice, then vacant again, and so on. It’s vital to track the dates when the status value changes, so the vendor uses a history table to store the unit status, and looks up the maximum date when interrogating the database for the current status.
I was facing a performance problem rooted in that design, which got me thinking about a general case. -What follows illustrates the issue using a generic orders/order status example, as opposed to the apartment units in my real case, in an effort to make this more broadly applicable.
Suppose we are faced with the canonical set of “order” records, and as the orders are processed in our company, each one has a status value that advances through a sequence, let’s say from “Fulfillment” to “Stocking,” “Packaging,” “Shipping,” and so on. At any given time each order has only one status, but we need to keep track of when each order had each status and allow the database to record the dates for each.
A highly simplified version might be implemented with tables orders, customers, statuses and orderStatus, like this:
CREATE TABLE orders ( orderID int NOT NULL IDENTITY(1, 1), customerID int NOT NULL, orderDate datetime NOT NULL, description nvarchar(255) NOT NULL ) GO ALTER TABLE orders ADD CONSTRAINT PK_orders PRIMARY KEY (orderID) GO CREATE INDEX I_orders_customerid ON orders (customerID) GO CREATE TABLE customers ( customerID int NOT NULL IDENTITY(1, 1), firstName nvarchar(100) NOT NULL, lastName nvarchar(100) NOT NULL ) GO ALTER TABLE customers ADD CONSTRAINT PK_customers PRIMARY KEY (customerID) GO ALTER TABLE orders ADD CONSTRAINT FK_orders_customers FOREIGN KEY (customerID) REFERENCES customers (customerID) GO CREATE TABLE statuses ( status nvarchar(50) NOT NULL ) GO ALTER TABLE statuses ADD CONSTRAINT PK_statuses PRIMARY KEY (status) GO CREATE TABLE orderStatus ( orderID int NOT NULL, statusDate datetime NOT NULL, status nvarchar(50) NOT NULL ) GO ALTER TABLE orderStatus ADD CONSTRAINT PK_orderStatus PRIMARY KEY (orderID, statusDate) GO ALTER TABLE orderStatus ADD CONSTRAINT FK_orderStatus_orders FOREIGN KEY (orderID) REFERENCES orders (orderID) GO ALTER TABLE orderStatus ADD CONSTRAINT FK_orderStatus_statuses FOREIGN KEY (status) REFERENCES statuses (status) GO
As a test case, I’ve populated the tables with sample data as follows. In another database, I have two tables of random first and last names, created from files the US Census bureau publishes on the web. Using those tables, I created 100,000 customer records in the test tables with random combinations of first and last names:
INSERT INTO customers ( firstName, lastName ) SELECT TOP 100000 first.name, last.name FROM ( SELECT TOP 1000 [name] FROM sampdatafnames ORDER BY NEWID() ) [first] CROSS JOIN ( SELECT TOP 1000 [name] FROM sampdatalnames ORDER BY NEWID() ) [last] ORDER BY NEWID()
I also have a small table of “sample item names,” with which I generated 1,000,000 sample orders, with dates spanning the last year, associated to random customers:
DECLARE @i INT SET @i = 0 BEGIN TRAN WHILE @i < 1000000 BEGIN INSERT INTO orders ( customerID, orderDate, [description] ) SELECT -- Random customer: CAST( 100000 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)) AS INT ) customerid, -- Random recent date: DATEADD( day, -365 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)), GETDATE() ) orderDate, -- Random order item: ( SELECT TOP 1 name FROM sampdataitems ORDER BY NEWID() ) [description] SET @i = @i + 1 END COMMIT
When using this design, each time the status of an order is changed, the order’s new status is inserted, with a date stamp, into the orderStatus table. So, for example, if each order begins as “Fulfillment,” then a row would be inserted for every order, at its creation, with the status “Fulfillment.” When an order advances to the next status value, a second row is inserted into orderStatus with the new status value and the later date stamp. This process repeats, with the orderStatus record having the maximum date stamp representing the current status of the order.
We can simulate this situation in bulk by populating the orderStatus table with inserts, each having a date offset from the order date, thus:
/* Create initial status records for all the orders */ INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT orderid, orderdate, 'Fulfillment' FROM orders /* For a subset of the orders, add the next status, a few days later */ INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking' FROM orders ORDER BY NEWID() /* For a subset of those last orders, add the next status, a few days later */ INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatus.statusdate ), 'Packaging' FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID AND orderStatus.status = 'Stocking' ORDER BY NEWID() /* Repeat… */ INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipping' FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID AND orderStatus.status = 'Packaging' ORDER BY NEWID() INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipped' FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID AND orderStatus.status = 'Shipping' ORDER BY NEWID() INSERT INTO orderStatus ( orderid, statusdate, [status] ) SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Received' FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID AND orderStatus.status = 'Shipped' ORDER BY NEWID()
When those inserts complete, we have a good set of data simulating this design: many orders, with a variety of status values over a time span.
One key bit of context here is that we most often want to know the current status of an order or set of orders and, while the past values are important, they don’t make up the majority of queries against this data. Many queries are concerned with the current status of orders, or with finding orders that have a particular value as their current status. So the type of thing we typically want has this sort of structure:
/* Typical query to find the latest order status date */ SELECT MAX(statusdate) FROM orderStatus GROUP BY orderid /* or, more specifically: */ SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate
Notice the aggregation required.- We have to go into the table and find the max(statusDate) for each order, then fetch the corresponding status value. This requires two trips to the table and two joins for many queries. It works, but it’s not the best performer, and on a busy system all that additional aggregation and join traffic might slow things down.
It’s possible to get much better select performance if we partition out the current status rows (they are what we care about most of the time) from the history rows, using two tables and a simple trigger. In place of the single orderStatus table, I’ll create a pair of tables, orderStatusCurrent and orderStatusHistory. This will give us a very basic form of “poor man’s partitioning,” sans the complexity and cost of the Enterprise feature by that name.
CREATE TABLE orderStatusCurrent ( orderid int NOT NULL, statusDate datetime NOT NULL, status nvarchar(50) NOT NULL ) GO ALTER TABLE orderStatusCurrent ADD CONSTRAINT PK_orderStatusCurrent PRIMARY KEY (orderid) GO ALTER TABLE orderStatusCurrent ADD CONSTRAINT FK_orderStatusCurrent_orders FOREIGN KEY (orderid) REFERENCES orders (orderID) GO ALTER TABLE orderStatusCurrent ADD CONSTRAINT FK_orderStatusCurrent_statuses FOREIGN KEY (status) REFERENCES statuses (status) GO CREATE TABLE orderStatusHistory ( orderID int NOT NULL, statusDate datetime NOT NULL, status nvarchar(50) NOT NULL ) GO ALTER TABLE orderStatusHistory ADD CONSTRAINT PK_orderStatusHistory PRIMARY KEY (orderID, statusDate) GO ALTER TABLE orderStatusHistory ADD CONSTRAINT FK_orderStatusHistory_orders FOREIGN KEY (orderID) REFERENCES orders (orderID) GO ALTER TABLE orderStatusHistory ADD CONSTRAINT FK_orderStatusHistory_statuses FOREIGN KEY (status) REFERENCES statuses (status) GO