Hi I'm trying to load my IIS logs into a SQL table, and determine user "sessions" from this. Ignoring all the problems with the stateless web, etc, here's what the table looks like: create table LogFiles ( SESSION_ID char(20), DATE_CREATED datetime not null, PAGE_NAME char(10) not null, QUERY_STRING char(100) ) create table Sessions ( SESSION_ID char(20) not null, DATE_STARTED datetime not null, DATE_ENDED datetime not null ) my puny brain can't figure out the query here. I suspect it involves correlated subqueries and self joins, but I can't make it work. I've got to make a list of distinct sessions such that there is no gap of greater than x minutes (lets say 30) between two requests from the same session (otherwise it should be counted as a new session. Ie, if i have SessionID=1 Date = 1/1/2003 10:00:00 AM Page = Foo.asp SessionID=1 Date = 1/1/2003 10:25:00 AM Page = Bar.asp SessionID=1 Date = 1/1/2003 11:00 AM Page = Bar.asp this should produce 2 sessions: Session ID = 1 Start 10:00 AM End 10:25 AM Session ID = 1 Start 11:00 AM End 11:00 AM Help much appreciated. Thx, Doug
I am looking forward to solution as well as I have been planning to write this query form a long time. Gaurav
I've come up with something. It seems to be working though I'm sure it can be written better. For it to work, you have to add an IDENTITY column to the LogFiles table. SET NOCOUNT ON CREATE TABLE [LogFiles] ( [i_ID] [int] IDENTITY (1, 1) NOT NULL , [SESSION_ID] [char] (20) NULL , [DATE_CREATED] [datetime] NOT NULL , [PAGE_NAME] [char] (10) NOT NULL , [QUERY_STRING] [char] (100) NULL ) ON [PRIMARY] GO INSERT INTO LogFiles SELECT1, '2003-01-01 10:00:00', 'Foo.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 10:25:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 11:00:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 11:05:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 12:00:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 12:10:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 13:00:00', 'Bar.asp','' INSERT INTO LogFiles SELECT1, '2003-01-01 10:15:00', 'Bar.asp','' INSERT INTO LogFiles SELECT 2, '2003-01-01 10:00:00', 'Foo.asp','' INSERT INTO LogFiles SELECT 2, '2003-01-01 10:25:00', 'Bar.asp','' INSERT INTO LogFiles SELECT 2, '2003-01-01 10:15:00', 'Bar.asp','' INSERT INTO LogFiles SELECT 2, '2003-01-01 12:00:00', 'Bar.asp','' DECLARE @t table ( SESSION_ID int , I_IDa int , DATE_CREATEDa datetime , DATE_CREATED datetime , I_ID int -- I added this 2 fields just for control. , ddiff int ) INSERT INTO @t SELECT a.SESSION_ID, a.I_ID, a.DATE_CREATED, b.DATE_CREATED , b.I_ID,DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) df FROM logfiles a with (nolock) INNER JOIN logfiles b ON a.I_ID < b.I_ID WHERE a.SESSION_ID = b.SESSION_ID AND DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) BETWEEN 0 AND 30 -- or any other interval SELECT SESSION_ID, DATE_CREATEDa , DATE_CREATED FROM @t b where EXISTS ( SELECT NULL FROM @t a --WITH (NOLOCK) WHERE a.SESSION_ID = b.SESSION_ID AND a.DATE_CREATEDa = b.DATE_CREATEDa GROUP BY i_ida, DATE_CREATEDa HAVING b.ddiff = MAX(a.ddiff) ) UNION ALL SELECT SESSION_ID, date_created, date_created FROM logfiles a WITH (NOLOCK) WHERE NOT EXISTS ( SELECT NULL FROM @t b WHERE a.SESSION_ID = b.SESSION_ID AND a.i_id = b.i_id OR a.i_id = b.i_ida ) Bambola.
Here's a cursor version. (I'm ashamed, so ashamed!) set nocount on delete from Sessions declare curTraffic cursor local fast_forward for select LF1.SESSION_ID,LF1.DATE_CREATED from LogFiles LF1 order by LF1.SESSION_ID,LF1.DATE_CREATED declare @Session char(20) declare @LastSession char(20) declare @DateCreated datetime declare @LastDate datetime declare @SessionStart datetime declare @SessionEnd datetime set @LastDate ='1/1/1900' set @LastSession = '' open curTraffic fetch next from curTraffic into @Session,@DateCreated set @SessionStart = @DateCreated set @LastDate = @DateCreated set @LastSession = @Session WHILE (@@FETCH_STATUS=0) BEGIN if (@Session <> @LastSession) or ABS(DATEDIFF(mi,@LastDate,@DateCreated)) > 30 BEGIN insert into Sessions(SESSION_ID,DATE_STARTED,DATE_ENDED) values(@LastSession,@SessionStart,@LastDate) set @SessionStart=@DateCreated END set @LastSession = @Session set @LastDate = @DateCreated fetch next from curTraffic into @Session,@DateCreated END insert into Sessions(SESSION_ID,DATE_STARTED,DATE_ENDED) values(@LastSession,@SessionStart,@LastDate)
You should be! [<img src='/community/emoticons/emotion-4.gif' alt='' />] [<img src='/community/emoticons/emotion-5.gif' alt='' />]<br />But didn't my code (the first and ugly but seems to be working) work for you?<br /><br />Bambola.
I think you are missing some rows here. That is of course if I understood you right.1 This is table logfiles: (you have the insert script above) 1 2003-01-01 10:00:00.000Foo.asp 2 2003-01-01 10:00:00.000Foo.asp 2 2003-01-01 10:15:00.000Bar.asp 1 2003-01-01 10:15:00.000Bar.asp 1 2003-01-01 10:25:00.000Bar.asp 2 2003-01-01 10:25:00.000Bar.asp 1 2003-01-01 11:00:00.000Bar.asp 1 2003-01-01 11:05:00.000Bar.asp 1 2003-01-01 12:00:00.000Bar.asp 2 2003-01-01 12:00:00.000Bar.asp 1 2003-01-01 12:10:00.000Bar.asp 1 2003-01-01 13:00:00.000Bar.asp This is result set return by your function (5 rows) 1 2003-01-01 10:00:00.0002003-01-01 10:25:00.000 1 2003-01-01 11:00:00.0002003-01-01 11:05:00.000 1 2003-01-01 12:00:00.0002003-01-01 12:10:00.000 1 2003-01-01 13:00:00.0002003-01-01 13:00:00.000 2 2003-01-01 10:00:00.0002003-01-01 10:25:00.000 This is result set return by my function Code: 1 2003-01-01 10:00:00.000 2003-01-01 10:25:00.000 1 2003-01-01 11:00:00.000 2003-01-01 11:05:00.000 1 2003-01-01 12:00:00.000 2003-01-01 12:10:00.000 2 2003-01-01 10:00:00.000 2003-01-01 10:25:00.000 1 2003-01-01 13:00:00.000 2003-01-01 13:00:00.000 [B]2 2003-01-01 12:00:00.000 2003-01-01 12:00:00.000[/B] Try it with the 3 rows you specified in your first post. Your procedure returns only one session and not 2 like you requested. The "orphan" session (Session ID = 1 Start 11:00 AM End 11:00 AM) is missing. Bambola.
Ok. This seems simple enough. no table datatype, but still an added IDENTITY to Logfiles. SELECT a.SESSION_ID, a.DATE_CREATED, max(b.DATE_CREATED) FROM logfiles a WITH (NOLOCK) INNER JOIN logfiles b ON a.I_ID < b.I_ID WHERE a.SESSION_ID = b.SESSION_ID AND DATEDIFF(minute, a.DATE_CREATED, b.DATE_CREATED) BETWEEN 0 AND 30 GROUP BY a.SESSION_ID, a.I_ID, a.DATE_CREATED UNION ALL SELECT a.SESSION_ID, a.DATE_CREATED, a.DATE_CREATED FROM logfiles a LEFT OUTER JOIN logfiles b ON b.SESSION_ID = a.SESSION_ID AND a.DATE_CREATED < b.DATE_CREATED where b.DATE_CREATED IS NULL Bambola.
Haven't tried your solution yet. Identity values! That's just as bad as cursors! <img src='/community/emoticons/emotion-1.gif' alt='' /><br /><br />Anyway, here's another requirement. Sessions are typically generated from a cookie. This means that the session_id won't exist on the first request in the logfile. Need to have a second pass that finds all log entries without session ids, and then tries to match them by referer/user-agent/ within a very short time frame.<br /><br />Yuck.
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by tarr11</i><br /><br />Identity values! That's just as bad as cursors! <img src='/community/emoticons/emotion-1.gif' alt='' /><br /><hr height="1" noshade id="quote"></blockquote id="quote"></font id="quote"><br />It is really?<br /><br />Bambola.
Identities (used this way) are just a substitute for cursors. You are assigning each row a unique arbitrary value. A set based solution would not rely on "rowid" type solutions.<br /><br />But nobody else is piping up with a better solution! <img src='/community/emoticons/emotion-1.gif' alt='' /><br /><br />
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">Identities (used this way) are just a substitute for cursors. You are assigning each row a unique arbitrary value. A set based solution would not rely on "rowid" type solutions.<hr height="1" noshade id="quote"></blockquote id="quote"></font id="quote"><br />Not a bad substitute if it prevents you from using cursors! [<img src='/community/emoticons/emotion-2.gif' alt='' />]<br /><br />Seriously, identity and cursors are two totally different things. In this example they both serve to distinguish one row from another, but at a different cost. Which one of them would work better depends on the selectivity of SESSION_ID. I do in general try to avoid cursors when possible, but there are rare cases where they actually work better or the only way to perform a certain task.<br /><br />As to finding a set-based solution. In your example you are refering to the column DATE_CREATED in 2 different way: start_date and end_date. The only way I know to do it without looping over the table (cursor or whatever) is with self join, like you suggested in your first post.<br /><br />If you want to do self join, you need to be able to identify a each row, meaning, have column/s (identity, uniqueidentifier, a combination or column) that will have a unique value. Otherwise how can you distinguish one row from another?<br /><br />With the table structure presented above, there is no possible unique value. Even a combnation of all 4 fields (assuming none of them allows NULL) will not be unique. So without adding identity (for example) I fail to see a way to self join in order to achieve a set-based solution.<br /><br />Prove me wrong - I would always like to learn something new.<br /><br />Bambola.<br /><br />