Natural vs. Surrogate Keys in SQL Server : Getting the Proper Perspective

I once walked into a bar, and saw two construction workers pounding
each other to a pulp.  The argument was over what was the better tool—a  hammer
or screwdriver.  I feel a similar sensation when I see SQL developers arguing
over whether to use natural or surrogate keys.    Few other arguments in
database design can cause tempers to flare so quickly.  Fans of surrogates
consider anyone who uses a natural key a drooling idiot, whereas natural key
acolytes believe the use of a surrogate warrants burning alive at the stake.    

Which side is right?  Neither.  Both natural and surrogate
keys have their own place, and a developer who doesn’t use both as the
situation demands is shortchanging both himself and his applications.

Definitions

A natural key is simply a column or set of columns in a
table that uniquely identifies each row.   Natural keys are a feature of the data,
not the database, and thus have business meaning on their own.  Quite often a
natural key is more than a single column.  For instance, the natural key for a
table of addresses might be the five columns: street number, street name, city,
state, and zip code.

What is a surrogate key?  Most people will define it as some
variation of “a system-generated value used to uniquely identify a row”.    Unlike
natural keys, surrogates have no business meaning.   In SQL Server, by far the
most common technique for generating surrogate values is the ubiquitous
IDENTITY column.

While the above definition is true, there’s another very
important part to it.    The value of a surrogate must never be exposed to the
outside world.   Users should never be allowed to see the key, under any
conditions.   Display the value of a surrogate key on a report, allow it to be
viewed on a form or even used as a search term – these are all forbidden.   Once
you expose a surrogate key, it immediately begins acquiring business meaning.  

Smart Keys: The Worst of Both Worlds

A smart key is an artificial key with one or more parts that
contain business meaning.  For instance, an employee table where each primary
key begins with the initial of the employee (“JD1001” for John Doe), or a table
of paint products, where the key identifies the can size, color, and paint
type, i.e.  “1G-RED- LATEX”.

Smart keys are a sort of hybrid between natural and
surrogate keys.  They’re seductively attractive to many developers, but should
be avoided like the plague.  They tend to make your design very brittle and
subject to failure as business rules change.

Note: if your data already contains meaningful
product codes or other keys such as those described above, then they are simply
natural keys and the above caveat doesn’t apply.     It’s a smart key only when
the value is constructed by the developer.

Benefits of Natural Keys

A natural key is…well, natural.  Since its values already
exist in the data, using a natural key means you don’t have to add and maintain
a new column.  This also means smaller tables and less storage requirements.  As
more rows fit on a database page, it can sometimes mean greater performance.
(It can also mean less- more on this later) .  However, in practical terms, the
space savings are minor, except for very narrow tables.   Using a surrogate key
usually means an additional index is required, though. 

Generating sequential key values is inherently a serial
process, so using a natural key rather than an IDENTITY column can be a
performance boost for inserts, especially in OLTP environments.

Since natural key values are used as foreign keys in child
tables, it can mean the elimination of joins for queries that require no other
columns from the parent other than the natural key.

One benefit of natural keys often claimed by its proponents is
that they can aid in self-documenting your database schema.   Explicitly naming
each natural key documents what specifically identifies each row in a table,
and joining on natural keys helps to identify the natural relationships between
tables.  This argument based on elegance appeals strongly to those
working in academia; it may or may not have much value for developers working
in the dirt and grime of real production systems.

Benefits of Surrogate Keys

Since surrogate values are controlled by the system, you
never have to worry about duplicate, missing, or changing values.  They’re also
an easy and reliable way to join tables; when writing queries, you never need
worry about remembering which combination of columns is the natural key.  However,
some of these benefits are less compelling than they seem.   I’ll discuss separately
each case for using a surrogate, and whether or not it holds up.

When no natural key on the table exists – Yes.  

If the table has no unique identifier, then you must create one.  Unkeyed tables
are in general a very bad idea.   There are exceptions such as logging or
summary tables in which rows are only inserted, never updated.  Otherwise, if
your table doesn’t have a unique key—create one.

When the natural key can change- Sometimes.  

Immutability is certainly a desirable feature for a key, but it’s by no
means a requirement.   Using this as an excuse is the trademark of a lazy
developer.  Data changes.   If the key changes, update it.  If it’s being used
as a foreign key in another table – update that table too.  Cascading updates
exist for a reason, after all.

Obviously if a key changes very
often, or will be used as a FK for many other tables, there can be performance
or concurrency implications to making it the primary key .  In this case, you
do want to consider a surrogate, and use it if appropriate. 

There’s one particular case where
the stability of a surrogate key actually works against you.  For lookup
tables, particular those containing selection options for given fields, changes
to the lookup value are often not meant to be cascaded into child tables.   For
example, an application may store the “referral source” for new customers or
marketing leads, whether they were generated via an ad in a newspaper or
magazine, a yellow pages entry, word of mouth, etc.   These referral codes can
be very specific, and are normally stored in a lookup table.  Once set, the
code should be preserved historically, even if the original lookup table value
is updated or removed.   This is behavior very difficult to achieve with a
surrogate key, but trivial with a natural key.

When natural key values can be missing or duplicated
– No


This is probably the most misunderstood aspect of the natural vs. surrogate
debate.   Natural keys are unique by definition.   If it isn’t a serious error
for a value to be missing or duplicated, then that value isn’t a natural key to
begin with.  And if it is an error, then you’re almost always
better off trapping that error at the database level,  rather than allowing that
bad data into your DB. 

As example, consider a table of
employees keyed off Social Security Number.   Users are complaining the
database throws errors when they don’t have a SSN or mistakenly enter a
duplicate.   So you replace the SSN PK with a surrogate and smugly conclude
you’ve solved the problem.   But have you?   Now some employees don’t have
SSNs, and the accounting module starts failing when printing tax records … or
worse, collates all the NULL SSN entries together, reporting them as a single
employee.     The search function starts returning the wrong rows because some
employees are sharing the same SSNs, and the new hire in the mailroom gets the
boss’s paycheck, because someone in HR accidentally cut and pasted a SSN.  

In reality, all you’ve done is
short-circuit out the data integrity safeguards in your database, and pass the responsibility
for the problem up to the application level. Bad move.

These sorts of problems exist
because most tables have a uniqueness requirement at the tion
level.    A surrogate key only solves the uniqueness problem  at the database
level, but users (who cannot and should not see the surrogate value) still don’t
have a way to uniquely identify each record.  This also explains why when, even
if you choose to use a surrogate key, you will usually want to also add a
unique constraint on the original natural key, since uniqueness is no longer
being automatically enforced by the PK.

But wait a minute!  What if your
business rules specifically require employees to be input before you have their
SSN data?  Or what if your table holds overseas employees that may not have a
SSN at all?  Does that mean you can’t use a natural key?  Maybe.  One
possibility is to assign your own unique values in these cases.   One system I’ve
seen used randomly generated alphabetic values for temporary SSNs, whereas the
standard numeric value identified a “real” one.   Better yet is to examine your
table for some other column or columns that can be used as a natural key.  Or
maybe you really do want to drop natural keys altogether.  The point here is
not that surrogates should never be used, but simply that if your natural key
isn’t unique, you are going to have problems beyond those that a surrogate will
solve.

When the natural key is very wide, or a composite of multiple
columns – Sometimes


Wide keys make for fat indexes.  Fat indexes
have performance implications.  A very wide key  can hurt performance far more
than the extra space required by a surrogate.   Replacing a composite key with
a surrogate also simplifies your queries, but this should never be a primary
consideration.  It’s poor form to replace the natural key of a two CHAR(2) columns
with am INT IDENTITY, for no other reason than it makes your queries prettier.

One common case where a
multi-column natural key should always be used is the so-called junction table:
a table used to implement a many-many relationship between two other tables. 
Most junction tables have only two columns, each a FK back to a parent table. 
The combination of these two FKs is itself the primary key for the table.  
Adding a surrogate to a table like this is just asking for trouble.

For performance reasons.  

This is the trickiest question of
all.   Replacing a wide key with a narrower value means smaller indexes, and
more values retrieved from every index page read.  This does boost
performance.  However,  you’ll usually retain the index on the natural key (to
enforce uniqueness if nothing else) and that means another index to maintain. 
If your table is very narrow, the additional column for the surrogate can
noticeably impact performance and storage requirements.  Finally, some queries
that may have not required JOINs with a natural  foreign key may now need
them.  For instance, our employee SSN example might have a child table
containing reported hours worked:

Table: ReportedHours

Start_Time

DATETIME

Stop_Time

DATETIME

EmployeeID

(Foreign key to Employee table)

If the EmployeeID  FK is SSN, then
we can retrieve a list of total hours by SSN from this table alone:

SELECT EmployeeID, SUM(DATEDIFF(hr,StopTime,StartTime))

FROM ReportedHours

GROUP BY
EmployeeID

With EmployeeID as a surrogate key,
however, we must JOIN back to the EMPLOYEE table:

SELECT SSN,
SUM(DATEDIFF(hr,StopTime,StartTime))

FROM ReportedHours h

JOIN Employees e ON h.EmployeeID = e.EmployeeID

GROUP BY
e.SSN

Continues…

Pages: 1 2




Array

No comments yet... Be the first to leave a reply!