Reduce Aggravating Aggregation: Improve the Performance of History or Status Tables

A third-party system that I manage has “history” or “status” tables following a common pattern: in the database, there is some entity (in my case, apartment units) that is labeled with a value indicating the entity’s status, and the value changes over time. Each of these entities has exactly one value representing the current status, but each also has a series of past values and dates or timestamps, tracing the history of the entity status and when it changed. For apartment units, for example, a given unit might be vacant, then reserved for a new resident, then occupied, then occupied but with the resident having given notice, then vacant again, and so on. It’s vital to track the dates when the status value changes, so the vendor uses a history table to store the unit status, and looks up the maximum date when interrogating the database for the current status.

I was facing a performance problem rooted in that design, which got me thinking about a general case. -What follows illustrates the issue using a generic orders/order status example, as opposed to the apartment units in my real case, in an effort to make this more broadly applicable.

Suppose we are faced with the canonical set of “order” records, and as the orders are processed in our company, each one has a status value that advances through a sequence, let’s say from “Fulfillment” to “Stocking,” “Packaging,” “Shipping,” and so on. At any given time each order has only one status, but we need to keep track of when each order had each status and allow the database to record the dates for each.

A highly simplified version might be implemented with tables orders, customers, statuses and orderStatus, like this:

CREATE TABLE orders (
        orderID        int NOT NULL IDENTITY(1, 1),
        customerID     int NOT NULL,
        orderDate      datetime NOT NULL,
        description    nvarchar(255) NOT NULL
)
GO
ALTER TABLE orders
    ADD CONSTRAINT PK_orders PRIMARY KEY (orderID)
GO
CREATE INDEX I_orders_customerid ON orders (customerID)
GO
CREATE TABLE customers (
        customerID     int NOT NULL IDENTITY(1, 1),
        firstName      nvarchar(100) NOT NULL,
        lastName       nvarchar(100) NOT NULL
)
GO
ALTER TABLE customers
    ADD CONSTRAINT PK_customers PRIMARY KEY (customerID)
GO
ALTER TABLE orders
    ADD CONSTRAINT FK_orders_customers
    FOREIGN KEY (customerID) REFERENCES customers (customerID)
GO
CREATE TABLE statuses (
        status         nvarchar(50) NOT NULL
)
GO
ALTER TABLE statuses
    ADD CONSTRAINT PK_statuses PRIMARY KEY (status)
GO
CREATE TABLE orderStatus (
        orderID        int NOT NULL,
        statusDate     datetime NOT NULL,
        status         nvarchar(50) NOT NULL
)
GO
ALTER TABLE orderStatus
    ADD CONSTRAINT PK_orderStatus PRIMARY KEY (orderID, statusDate)
GO
ALTER TABLE orderStatus
    ADD CONSTRAINT FK_orderStatus_orders
    FOREIGN KEY (orderID) REFERENCES orders (orderID)
GO
ALTER TABLE orderStatus
    ADD CONSTRAINT FK_orderStatus_statuses
    FOREIGN KEY (status) REFERENCES statuses (status)
GO

As a test case, I’ve populated the tables with sample data as follows. In another database, I have two tables of random first and last names, created from files the US Census bureau publishes on the web. Using those tables, I created 100,000 customer records in the test tables with random combinations of first and last names:

INSERT INTO customers ( firstName, lastName )
    SELECT TOP 100000 first.name, last.name
    FROM ( SELECT TOP 1000 [name] FROM sampdatafnames ORDER BY NEWID() ) [first]
    CROSS JOIN ( SELECT TOP 1000 [name] FROM sampdatalnames ORDER BY NEWID() ) [last]
    ORDER BY NEWID()

I also have a small table of “sample item names,” with which I generated 1,000,000 sample orders, with dates spanning the last year, associated to random customers:

DECLARE @i INT
SET @i = 0
BEGIN TRAN
    WHILE @i < 1000000 BEGIN
        INSERT INTO orders ( customerID, orderDate, [description] )
        SELECT
            -- Random customer:
            CAST( 100000 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)) AS INT ) customerid,
            -- Random recent date:
            DATEADD( day, -365 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)), GETDATE() ) orderDate,
            -- Random order item:
            ( SELECT TOP 1 name FROM sampdataitems ORDER BY NEWID() ) [description]
        SET @i = @i + 1
    END
COMMIT

When using this design, each time the status of an order is changed, the order’s new status is inserted, with a date stamp, into the orderStatus table. So, for example, if each order begins as “Fulfillment,” then a row would be inserted for every order, at its creation, with the status “Fulfillment.” When an order advances to the next status value, a second row is inserted into orderStatus with the new status value and the later date stamp. This process repeats, with the orderStatus record having the maximum date stamp representing the current status of the order.

We can simulate this situation in bulk by populating the orderStatus table with inserts, each having a date offset from the order date, thus:

/* Create initial status records for all the orders */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT orderid, orderdate, 'Fulfillment' FROM orders
/* For a subset of the orders, add the next status, a few days later */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking'
    FROM orders ORDER BY NEWID()
/* For a subset of those last orders, add the next status, a few days later */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatus.statusdate ), 'Packaging'
    FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
        AND orderStatus.status = 'Stocking'
    ORDER BY NEWID()
/* Repeat… */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipping'
    FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
        AND orderStatus.status = 'Packaging'
    ORDER BY NEWID()
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipped'
    FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
        AND orderStatus.status = 'Shipping'
    ORDER BY NEWID()
INSERT INTO orderStatus ( orderid, statusdate, [status] )
    SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Received'
    FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
        AND orderStatus.status = 'Shipped'
    ORDER BY NEWID()

When those inserts complete, we have a good set of data simulating this design: many orders, with a variety of status values over a time span.

One key bit of context here is that we most often want to know the current status of an order or set of orders and, while the past values are important, they don’t make up the majority of queries against this data. Many queries are concerned with the current status of orders, or with finding orders that have a particular value as their current status. So the type of thing we typically want has this sort of structure:

/* Typical query to find the latest order status date */
SELECT MAX(statusdate) FROM orderStatus GROUP BY orderid
/* or, more specifically: */
SELECT    o.orderID,
    o.customerID,
    o.orderDate,
    o.description,
    os.statusDate,
    os.status
FROM orders o
INNER JOIN (
    SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid
) lastStatusDates ON o.orderID = lastStatusDates.orderid
INNER JOIN orderStatus os
    ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate

Notice the aggregation required.- We have to go into the table and find the max(statusDate) for each order, then fetch the corresponding status value. This requires two trips to the table and two joins for many queries. It works, but it’s not the best performer, and on a busy system all that additional aggregation and join traffic might slow things down.

It’s possible to get much better select performance if we partition out the current status rows (they are what we care about most of the time) from the history rows, using two tables and a simple trigger. In place of the single orderStatus table, I’ll create a pair of tables, orderStatusCurrent and orderStatusHistory. This will give us a very basic form of “poor man’s partitioning,” sans the complexity and cost of the Enterprise feature by that name.

CREATE TABLE orderStatusCurrent (
        orderid       int NOT NULL,
        statusDate    datetime NOT NULL,
        status        nvarchar(50) NOT NULL
)
GO
ALTER TABLE orderStatusCurrent
    ADD CONSTRAINT PK_orderStatusCurrent PRIMARY KEY (orderid)
GO
ALTER TABLE orderStatusCurrent
    ADD CONSTRAINT FK_orderStatusCurrent_orders
    FOREIGN KEY (orderid) REFERENCES orders (orderID)
GO
ALTER TABLE orderStatusCurrent
    ADD CONSTRAINT FK_orderStatusCurrent_statuses
    FOREIGN KEY (status) REFERENCES statuses (status)
GO
CREATE TABLE orderStatusHistory (
        orderID       int NOT NULL,
        statusDate    datetime NOT NULL,
        status        nvarchar(50) NOT NULL
)
GO
ALTER TABLE orderStatusHistory
    ADD CONSTRAINT PK_orderStatusHistory PRIMARY KEY (orderID, statusDate)
GO
ALTER TABLE orderStatusHistory
    ADD CONSTRAINT FK_orderStatusHistory_orders
    FOREIGN KEY (orderID) REFERENCES orders (orderID)
GO
ALTER TABLE orderStatusHistory
    ADD CONSTRAINT FK_orderStatusHistory_statuses
    FOREIGN KEY (status) REFERENCES statuses (status)
GO

Continues…

Leave a comment

Your email address will not be published.