SQL Server Performance

Eliminating Hash Match Aggregate or Sort Distinct on Unique Data

Discussion in 'SQL Server 2005 T-SQL Performance Tuning' started by phlogiston, Jul 16, 2009.

  1. phlogiston New Member

    Hello, Folks,
    We have a main table like this, with a little more than 200 million rows:
    CREATE TABLE dbo.MainTable
    (
    DateHourStamp DATETIME NOT NULL,
    RecordIdentifier NCHAR(17) NOT NULL,
    LatestStatusDateTime DATETIME NULL,
    LatestStatusValue NCHAR(2) NULL,
    ...
    )
    ALTER TABLE dbo.MainTable
    ADD CONSTRAINT PK_MainTable
    PRIMARY KEY CLUSTERED
    (
    DateHourStamp ASC,
    RecordIdentifier ASC
    )
    CREATE UNIQUE NONCLUSTERED INDEX IX_MainTable_RecordIdentifier
    ON dbo.MainTable
    (
    RecordIdentifier ASC
    )
    We have a status table like this, with a little more than 2 million rows:
    CREATE TABLE dbo.StatusTable
    (
    RecordIdentifier NCHAR(17) NOT NULL,
    StatusDateTime DATETIME NOT NULL,
    StatusRank TINYINT NOT NULL,
    StatusValue NCHAR(2) NOT NULL
    )
    ALTER TABLE dbo.StatusTable
    ADD CONSTRAINT PK_StatusTable
    PRIMARY KEY CLUSTERED
    (
    RecordIdentifier ASC,
    StatusDateTime DESC,
    StatusRank DESC
    )
    The main table is populated once an hour with one row per record identifier, and the distribution of record identifiers is pretty even and mostly ever increasing.
    The status table is populated once an hour (but later than the main table) with multiple rows per record identifier, and the distribution of record identifiers could be pretty random.
    We use the following statement to update the main table with the latest status for a record identifier:
    WITH NewStatuses AS
    (
    SELECT
    RecordIdentifier,
    StatusDateTime,
    StatusValue,
    ROW_NUMBER() OVER
    (
    PARTITION BY
    RecordIdentifier
    ORDER BY
    StatusDateTime DESC,
    StatusRank DESC
    )
    AS RowNumber
    FROM
    dbo.StatusTable
    )
    UPDATE
    dbo.MainTable
    SET
    LatestStatusDateTime = NewStatuses.StatusDateTime,
    LatestStatusValue = NewStatuses.StatusValue
    FROM
    dbo.MainTable
    INNER JOIN
    NewStatuses
    ON (MainTable.RecordIdentifier = NewStatuses.RecordIdentifier)
    WHERE
    NewStatuses.RowNumber = 1;
    MainTable and StatusTable have a one-to-many relationship. The ROW_NUMBER() function allows us to filter the StatusTable to make a pseudo one-to-one relationship.
    Here is the problem. Even though we make that pseudo one-to-one relationship, SQL Server still treats it like a one-to-many, and throws a Hash Match Aggregate in the plan to eliminate possible duplicates before the update.
    The number of rows input to the Hash Match Aggregate is always equal to the number of rows output. And because the StatusTable and Hash create a pretty random mess, the actual update to the table causes a ton of random disk access.
    We tried adding OPTION (ORDER GROUP) to the end, but that just changed the Hash Match Aggregate into a Sort Distinct. Granted, the random disk access was reduced. The hash takes about 90 minutes, the sort takes about 45 minutes.
    My question is, how can we eliminate the Hash Match Aggregate or Sort Distinct all together, when we know that the number of rows input to the operator is always equal to the number of rows output? Thanks.
  2. Luis Martin Moderator

    Did you try DTA suggestion?
  3. Adriaan New Member

    I know I skipped a lot of information here, but my guess is that statistics will reflect the first two observations, not the last statement.
    Also, how come you have a primary key for MainTable on (DateHourStamp, RecordIdentifier) when the name of the second column suggests it is the primary key by itself? And since you are joining on the second column, you're probably not helping the db engine take advantage of the index.
    Since you have made this a clustered index, can you confirm that the DateHourStamp column is set only on insert, and never updated?
  4. rohit2900 Member

    [quote user="phlogiston"]
    CREATE TABLE dbo.MainTable
    (
    DateHourStamp DATETIME NOT NULL,
    RecordIdentifier NCHAR(17) NOT NULL,
    LatestStatusDateTime DATETIME NULL,
    LatestStatusValue NCHAR(2) NULL,
    ...
    )
    ALTER TABLE dbo.MainTable
    ADD CONSTRAINT PK_MainTable
    PRIMARY KEY CLUSTERED
    (
    DateHourStamp ASC,
    RecordIdentifier ASC
    )
    CREATE UNIQUE NONCLUSTERED INDEX IX_MainTable_RecordIdentifier
    ON dbo.MainTable
    (
    RecordIdentifier ASC
    )
    We have a status table like this, with a little more than 2 million rows:
    CREATE TABLE dbo.StatusTable
    (
    RecordIdentifier NCHAR(17) NOT NULL,
    StatusDateTime DATETIME NOT NULL,
    StatusRank TINYINT NOT NULL,
    StatusValue NCHAR(2) NOT NULL
    )
    ALTER TABLE dbo.StatusTable
    ADD CONSTRAINT PK_StatusTable
    PRIMARY KEY CLUSTERED
    (
    RecordIdentifier ASC,
    StatusDateTime DESC,
    StatusRank DESC
    )
    [/quote]
    The first thing which I'll suggest is instead of having composite primary key add a identity column make it a primary key clustered and then add the unique indexes. And then I'll suggest you to add new index on NewStatuses.RecordIdentifier apart from your existing indexes (Add composite unique index for the existing primary key combination)
    Tell me if this helps...

Share This Page