SQL Server Performance

Getting Sessions From Web Logs

Discussion in 'General Developer Questions' started by tarr11, Jun 20, 2003.

  1. tarr11 New Member


    Hi I'm trying to load my IIS logs into a SQL table, and determine
    user "sessions" from this. Ignoring all the problems with the
    stateless web, etc, here's what the table looks like:

    create table LogFiles
    (
    SESSION_ID char(20),
    DATE_CREATED datetime not null,
    PAGE_NAME char(10) not null,
    QUERY_STRING char(100)
    )


    create table Sessions
    (
    SESSION_ID char(20) not null,
    DATE_STARTED datetime not null,
    DATE_ENDED datetime not null
    )


    my puny brain can't figure out the query here. I suspect it involves
    correlated subqueries and self joins, but I can't make it work.

    I've got to make a list of distinct sessions such that there is no
    gap of greater than x minutes (lets say 30) between two requests from
    the same session (otherwise it should be counted as a new session.

    Ie, if i have
    SessionID=1
    Date = 1/1/2003 10:00:00 AM
    Page = Foo.asp

    SessionID=1
    Date = 1/1/2003 10:25:00 AM
    Page = Bar.asp

    SessionID=1
    Date = 1/1/2003 11:00 AM
    Page = Bar.asp

    this should produce 2 sessions:
    Session ID = 1
    Start 10:00 AM
    End 10:25 AM

    Session ID = 1
    Start 11:00 AM
    End 11:00 AM

    Help much appreciated.

    Thx,
    Doug
  2. gaurav_bindlish New Member

    I am looking forward to solution as well as I have been planning to write this query form a long time.

    Gaurav
  3. rushmada New Member

    Could u please figure it out(In detail) the exact requirement.

    Thanks.





    Rushendra
  4. bambola New Member

    I've come up with something. It seems to be working though I'm sure it can be written better.
    For it to work, you have to add an IDENTITY column to the LogFiles table.


    SET NOCOUNT ON

    CREATE TABLE [LogFiles] (
    [i_ID] [int] IDENTITY (1, 1) NOT NULL ,
    [SESSION_ID] [char] (20) NULL ,
    [DATE_CREATED] [datetime] NOT NULL ,
    [PAGE_NAME] [char] (10) NOT NULL ,
    [QUERY_STRING] [char] (100) NULL
    ) ON [PRIMARY]
    GO

    INSERT INTO LogFiles SELECT1, '2003-01-01 10:00:00', 'Foo.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 10:25:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 11:00:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 11:05:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 12:00:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 12:10:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 13:00:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT1, '2003-01-01 10:15:00', 'Bar.asp',''

    INSERT INTO LogFiles SELECT 2, '2003-01-01 10:00:00', 'Foo.asp',''
    INSERT INTO LogFiles SELECT 2, '2003-01-01 10:25:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT 2, '2003-01-01 10:15:00', 'Bar.asp',''
    INSERT INTO LogFiles SELECT 2, '2003-01-01 12:00:00', 'Bar.asp',''

    DECLARE @t table
    (
    SESSION_ID int
    , I_IDa int
    , DATE_CREATEDa datetime
    , DATE_CREATED datetime
    , I_ID int -- I added this 2 fields just for control.
    , ddiff int
    )

    INSERT INTO @t
    SELECT a.SESSION_ID, a.I_ID, a.DATE_CREATED, b.DATE_CREATED
    , b.I_ID,DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) df
    FROM logfiles a with (nolock)
    INNER JOIN logfiles b
    ON a.I_ID < b.I_ID
    WHERE a.SESSION_ID = b.SESSION_ID
    AND DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) BETWEEN 0 AND 30 -- or any other interval

    SELECT SESSION_ID, DATE_CREATEDa , DATE_CREATED
    FROM @t b
    where EXISTS
    (
    SELECT NULL
    FROM @t a --WITH (NOLOCK)
    WHERE a.SESSION_ID = b.SESSION_ID
    AND a.DATE_CREATEDa = b.DATE_CREATEDa
    GROUP BY i_ida, DATE_CREATEDa
    HAVING b.ddiff = MAX(a.ddiff)
    )

    UNION ALL

    SELECT SESSION_ID, date_created, date_created
    FROM logfiles a WITH (NOLOCK)
    WHERE NOT EXISTS
    (
    SELECT NULL
    FROM @t b
    WHERE a.SESSION_ID = b.SESSION_ID
    AND a.i_id = b.i_id OR a.i_id = b.i_ida
    )
    Bambola.

  5. tarr11 New Member

    Here's a cursor version. (I'm ashamed, so ashamed!)




    set nocount on
    delete from Sessions

    declare curTraffic cursor local fast_forward

    for

    select LF1.SESSION_ID,LF1.DATE_CREATED
    from
    LogFiles LF1

    order by LF1.SESSION_ID,LF1.DATE_CREATED

    declare @Session char(20)
    declare @LastSession char(20)
    declare @DateCreated datetime
    declare @LastDate datetime
    declare @SessionStart datetime
    declare @SessionEnd datetime

    set @LastDate ='1/1/1900'
    set @LastSession = ''

    open curTraffic
    fetch next from curTraffic into @Session,@DateCreated
    set @SessionStart = @DateCreated
    set @LastDate = @DateCreated
    set @LastSession = @Session
    WHILE (@@FETCH_STATUS=0)
    BEGIN


    if (@Session <> @LastSession) or ABS(DATEDIFF(mi,@LastDate,@DateCreated)) > 30
    BEGIN

    insert into Sessions(SESSION_ID,DATE_STARTED,DATE_ENDED)
    values(@LastSession,@SessionStart,@LastDate)

    set @SessionStart=@DateCreated
    END

    set @LastSession = @Session
    set @LastDate = @DateCreated

    fetch next from curTraffic into @Session,@DateCreated

    END

    insert into Sessions(SESSION_ID,DATE_STARTED,DATE_ENDED)
    values(@LastSession,@SessionStart,@LastDate)
  6. bambola New Member

    You should be! [<img src='/community/emoticons/emotion-4.gif' alt=':p' />] [<img src='/community/emoticons/emotion-5.gif' alt=';)' />]<br />But didn't my code (the first and ugly but seems to be working) work for you?<br /><br />Bambola.
  7. bambola New Member

    I think you are missing some rows here. That is of course if I understood you right.1
    This is table logfiles: (you have the insert script above)


    1 2003-01-01 10:00:00.000Foo.asp
    2 2003-01-01 10:00:00.000Foo.asp
    2 2003-01-01 10:15:00.000Bar.asp
    1 2003-01-01 10:15:00.000Bar.asp
    1 2003-01-01 10:25:00.000Bar.asp
    2 2003-01-01 10:25:00.000Bar.asp
    1 2003-01-01 11:00:00.000Bar.asp
    1 2003-01-01 11:05:00.000Bar.asp
    1 2003-01-01 12:00:00.000Bar.asp
    2 2003-01-01 12:00:00.000Bar.asp
    1 2003-01-01 12:10:00.000Bar.asp
    1 2003-01-01 13:00:00.000Bar.asp
    This is result set return by your function (5 rows)


    1 2003-01-01 10:00:00.0002003-01-01 10:25:00.000
    1 2003-01-01 11:00:00.0002003-01-01 11:05:00.000
    1 2003-01-01 12:00:00.0002003-01-01 12:10:00.000
    1 2003-01-01 13:00:00.0002003-01-01 13:00:00.000
    2 2003-01-01 10:00:00.0002003-01-01 10:25:00.000
    This is result set return by my function
    Code:
    
    1     2003-01-01  10:00:00.000   2003-01-01 10:25:00.000
    1     2003-01-01  11:00:00.000   2003-01-01 11:05:00.000
    1     2003-01-01  12:00:00.000   2003-01-01 12:10:00.000
    2     2003-01-01  10:00:00.000   2003-01-01 10:25:00.000
    1     2003-01-01  13:00:00.000   2003-01-01 13:00:00.000
    [B]2     2003-01-01  12:00:00.000   2003-01-01 12:00:00.000[/B]
    
    Try it with the 3 rows you specified in your first post. Your procedure 
    returns only one session and not 2 like you requested. The "orphan" session
    (Session ID = 1 Start 11:00 AM End 11:00 AM) is missing.
    
    Bambola.
    
    
     
  8. bambola New Member

    Ok. This seems simple enough. no table datatype, but still an added IDENTITY to Logfiles.


    SELECT a.SESSION_ID, a.DATE_CREATED, max(b.DATE_CREATED)
    FROM logfiles a WITH (NOLOCK)
    INNER JOIN logfiles b
    ON a.I_ID < b.I_ID
    WHERE a.SESSION_ID = b.SESSION_ID
    AND DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) BETWEEN 0 AND 30
    GROUP BY a.SESSION_ID, a.I_ID, a.DATE_CREATED

    UNION ALL

    SELECT a.SESSION_ID, a.DATE_CREATED, a.DATE_CREATED
    FROM logfiles a
    LEFT OUTER JOIN logfiles b ON b.SESSION_ID = a.SESSION_ID
    AND a.DATE_CREATED < b.DATE_CREATED
    where b.DATE_CREATED IS NULL

    Bambola.

  9. tarr11 New Member

    Haven't tried your solution yet. Identity values! That's just as bad as cursors! <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br /><br />Anyway, here's another requirement. Sessions are typically generated from a cookie. This means that the session_id won't exist on the first request in the logfile. Need to have a second pass that finds all log entries without session ids, and then tries to match them by referer/user-agent/ within a very short time frame.<br /><br />Yuck.
  10. bambola New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by tarr11</i><br /><br />Identity values! That's just as bad as cursors! <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br /><hr height="1" noshade id="quote"></blockquote id="quote"></font id="quote"><br />It is really?<br /><br />Bambola.
  11. tarr11 New Member

    Identities (used this way) are just a substitute for cursors. You are assigning each row a unique arbitrary value. A set based solution would not rely on "rowid" type solutions.<br /><br />But nobody else is piping up with a better solution! <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br /><br />
  12. bambola New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">Identities (used this way) are just a substitute for cursors. You are assigning each row a unique arbitrary value. A set based solution would not rely on "rowid" type solutions.<hr height="1" noshade id="quote"></blockquote id="quote"></font id="quote"><br />Not a bad substitute if it prevents you from using cursors! [<img src='/community/emoticons/emotion-2.gif' alt=':D' />]<br /><br />Seriously, identity and cursors are two totally different things. In this example they both serve to distinguish one row from another, but at a different cost. Which one of them would work better depends on the selectivity of SESSION_ID. I do in general try to avoid cursors when possible, but there are rare cases where they actually work better or the only way to perform a certain task.<br /><br />As to finding a set-based solution. In your example you are refering to the column DATE_CREATED in 2 different way: start_date and end_date. The only way I know to do it without looping over the table (cursor or whatever) is with self join, like you suggested in your first post.<br /><br />If you want to do self join, you need to be able to identify a each row, meaning, have column/s (identity, uniqueidentifier, a combination or column) that will have a unique value. Otherwise how can you distinguish one row from another?<br /><br />With the table structure presented above, there is no possible unique value. Even a combnation of all 4 fields (assuming none of them allows NULL) will not be unique. So without adding identity (for example) I fail to see a way to self join in order to achieve a set-based solution.<br /><br />Prove me wrong - I would always like to learn something new.<br /><br />Bambola.<br /><br />

Share This Page