SQL Server Performance

Surrogate or Composite KEY

Discussion in 'T-SQL Performance Tuning for Developers' started by TheGios, Aug 23, 2004.

  1. TheGios New Member

    Consider the following...

    A JOB describes the processment of a document.
    Each document can exist in two versions: English and French.
    A JOB can have 1 or 2 TASK, each describing the processement of either the English or French version.
    So we have the following:

    A: JOB (JobNum [PK], DocReference, StartDate, EndDate, ...)
    B: TASK (JobNum [PK] [FKa], Version [PK], Priority, ...)

    that is there is an identifying 1:M (where maxium allowed for m is 2) relationship between JOB and TASK; TASK being identified by JobNum and Version (where the domain for Version is {E, F}).

    Each TASK may require a TRANSLATION sub_task.
    Each TASK may require a TYPING sub_task.
    Each TASK may require a DISTRIBUTION sub_task.

    For example, for a given doc, the English TASK requires TRANSLATION and DISTRIBUTION, while the French only DISTRIBUTION.

    That is, there is a 1:1 not-required relationship between TASK and TRANSLATION, TYPING and DISTRIBUTION.
    So we have the following:

    A: JOB (JobNum [PK], DocReference, StartDate, EndDate, ...)
    B: TASK (JobNum [PK] [FKa], Version [PK], Priority, ...)

    C: TRANSLATION (JobNum [PK] [FKb], Version [PK] [FKb], DueDate, ...)
    D: TYPING (JobNum [PK] [FKb], Version [PK] [FKb], DueDate, ...)
    E: DISTRIBUTION (JobNum [PK] [FKb], Version [PK] [FKb], Copies, ...)

    As you can see I am using the PK of TASK as FK and PK for each of the three SUB_TASKs.

    To complicate things, each SUB_TASK has one or more assignments. The assignments for each SUB_TASK records different information from the others.
    So we have...

    A: JOB (JobNum [PK], DocReference, StartDate, EndDate, ...)
    B: TASK (JobNum [PK] [FKa], Version [PK], Priority, ...)

    C: TRANSLATION (JobNum [PK] [FKb], Version [PK] [FKb], DueDate, ...)
    D: TYPING (JobNum [PK] [FKb], Version [PK] [FKb], DueDate, ...)
    E: DISTRIBUTION (JobNum [PK] [FKb], Version [PK] [FKb], Copies, ...)

    F: TRA_ASSIGN (JobNum [PK] [FKc], Version [PK] [FKc], Index [PK], Translator, ...)
    G: TYP_ASSIGN (JobNum [PK] [FKd], Version [PK] [FKd], Index [PK], Typyst, ...)
    H: REP_ASSIGN (JobNum [PK] [FKe], Version [PK] [FKe], Index [PK], Pages, ...)

    that is there is an identifying 1:M relationship between each SUB_TASK and its ASSIGNMENTs, each ASSIGNMENT being identified by the SUB_TASK it belongs to and an Index.

    I wish I could send a pic of the ER diagram...

    Maybe there is another and better way to model this: if so, any suggestion?
    Given this model, should I use for TRANSLATION, TYPING and DISTRIBUTION a surrogate key, instead of using the composite key, like for example:

    C: TRANSLATION (TranslationID [PK], JobNum [FKb], Version [FKb], DueDate, ...)
    D: TYPING (TypingID [PK], JobNum [FKb], Version [FKb], DueDate, ...)
    E: DISTRIBUTION (DistributionID [PK], JobNum [FKb], Version [FKb], Copies, ...)

    This makes me lose the logic behind the relationships between a TASK and its SUB_TASK, but this will "improve" the ASSIGNMENTs tables:

    F: TRA_ASSIGN (TranslationID [PK] [FKc], Index [PK], Translator, ...)
    G: TYP_ASSIGN (TypingID [PK] [FKd], Index [PK], Typyst, ...)
    H: REP_ASSIGN (DistributionID [PK] [FKe], Index [PK], Pages, ...)

    I could even go further using a surrogate key even for TASK, which leads me to the following:

    A: JOB (JobNum [PK], DocReference, StartDate, EndDate, ...)
    B: TASK (TaskID [PK], JobNum [FKa], Version , Priority, ...)

    C: TRANSLATION (TaskID [PK] [FKb], DueDate, ...)
    D: TYPING (TaskID [PK] [FKb], DueDate, ...)
    E: DISTRIBUTION (TaskID [PK] [FKb], Copies, ...)

    F: TRA_ASSIGN (TaskID [PK] [FKc], Index [PK], Translator, ...)
    G: TYP_ASSIGN (TaskID [PK] [FKd], Index [PK], Typyst, ...)
    H: REP_ASSIGN (TaskID [PK] [FKe], Index [PK], Pages, ...)

    I don't really like this second solution, but I'm still not sure about the second solution, the one with the surrogate key only in the SUB_TASks tables.
  2. Adriaan New Member

    You say:
    "there is a 1:1 not-required relationship between TASK and TRANSLATION, TYPING and DISTRIBUTION"

    If there is a really a fixed number of possible subtasks per task, where the subtask is sufficiently described by using a number or a date, then why not just add the subtasks as fields to TASK?

    If the number of different subtasks per task must remain open, then don't treat the different types of subtasks as separate entities - you'll need to add another subtable for each and adapt the application, so just include a subtask type on the subtable and work with that.

    Same for the assignments: if there's just one assignee per subtask, then put them at the subtask level. Otherwise, it won't hurt to include the TaskID + SubTaskId as the foreign key at the assignment level.
  3. TheGios New Member

    TRANSLATION, TYPING and REPRODUCTION are the only 3 possible subtask, but each of them has different information (plus a number of common information).

    C: TRANSLATION (xxx [PK}, DueDate, TotalWordsToTranslate)
    D: TYPING (xxx [PK}, DueDate)
    E: DISTRIBUTION (xxx [PK}, Duedate, NumberOfCopies)

    Keep in mind also that the Duedate for TRANSLATION is different from the Duedate of TYPING and DISTRIBUTION.

    Using one table with a TYPE fields would lead to a lot of NULL values.
    Plus, a TASK may require only TRANSLATION while another TYPING and DISTRIBUTION.
    Even for the same couple of tasks, the ENGLISH doc may require TRANSLATION and DISTRIBUTION while the FRENCH may require TYPING and DISTRIBUTION instead.

    An example:
    the ENGLISH doc requires TRANLATION with Duedate 1/1/2004 and 250 words in total, and requires DISTRIBUTION with due date 3/1/2004 and 12 copies; the FRENCH requires TYPING with Duedate 3/1/2004 , and requires DISTRIBUTION with due date 4/1/2004 and 12 copies.
  4. Chappy New Member

    I would recommend a composite key for this. Id estimate that 99% of the time you are only going to hit the sub task tables, when you are working with (ie JOINing) via a specific job ID, so you want the join to be quick.

    A surrogate key here will serve no purpose because although each subtask will have a unique id, you will then still have to use the fields containing job id whenever you join sub tasks onto the job.



  5. TheGios New Member

    This is exactly my same opinion, but people keep on telling me that a surrogate key is better and that sql-server would execute a fast join anyway... I'm not sure about this...

    Like you say, I will have to join the tables on JobNum and Version
  6. Adriaan New Member

    Okay, last comment from me on this: if you do not split out entities that are in a 1:1 relationship, then you also do not need any foreign key, and you can query that information without joins.

    If you cannot have Nulls because you need constraints on some of those fields, then I can understand your objection. In that case, well-constructed triggers can take care of the constraints, preferably with your application double-checking before committing.
  7. TheGios New Member

    quote:Originally posted by Adriaan

    Okay, last comment from me on this: if you do not split out entities that are in a 1:1 relationship, then you also do not need any foreign key, and you can query that information without joins.

    If you cannot have Nulls because you need constraints on some of those fields, then I can understand your objection. In that case, well-constructed triggers can take care of the constraints, preferably with your application double-checking before committing.

    So yo'd suggest the following:

    A: JOB (JobNum [PK], DocReference, StartDate, EndDate, ...)
    B: TASK (JobNum [PK] [FKa], Version [PK], Priority, TRA_DueDate, TRA_TotWords, TRA_...,
    TYP_DueDate, TYP_TotPages, TYP_...,
    DIS_DueDate, DIS_Copies, DIS_...)

    C: TRA_ASSIGN (JobNum [PK] [FKb], Version [PK] [FKb], Index [PK], Translator, ...)
    D: TYP_ASSIGN (JobNum [PK] [FKb], Version [PK] [FKb], Index [PK], Typyst, ...)
    E: REP_ASSIGN (JobNum [PK] [FKb], Version [PK] [FKb], Index [PK], Pages, ...)


    allowing NULL values for the TASK but avoiding joins.
  8. Adriaan New Member

    Yes, but I'd also combine C, D and E into a single table where you distinguish using an Type field (with options translation, typing, distribution) - for the same reason.

    I do understand the reasoning behind your design, but I got the impression that the data structure was being based more on an idea for a user interface than on normalization.
  9. TheGios New Member

    That's half true: the database will be used by a VB interface but also for reporting.
    That is, the DB will be accessed daily via the user interface (to store infoo about tasks carried on documents) and monthly via a reporting tool to get, for example, tranlation loads or total number of copies in 1 year for all documents belonging to a certain division, etc.

    Anyway, abou combining C, D and E, I do not agree: these three tables are very different one from another.
    Actually, we have even decided to simplify our business process and we can go for the following design


    A: WORKFLOW (DocNumber [PK], DocumentReference, DocumentLanguage, StartDate, EndDate, Priority,
    TRA_DueDate, TRA_TotWords, TRA_...,
    TYP_DueDate, TYP_TotPages, TYP_...,
    DIS_DueDate, DIS_Copies, DIS_...)

    B: TRA_ASSIGN (DocNumber [PK] [FKa], Index [PK], Translator, EstimatedWords, ActualWords, ...)
    C: TYP_ASSIGN (DocNumber [PK] [FKa], Index [PK], Typyst, ActualWords, ActualPages, ...)
    D: REP_ASSIGN (DocNumber [PK] [FKa], Index [PK], Copies, Format, ...)

Share This Page