SharePoint statistics: source processing
In post http://weblogs.asp.net/soever/archive/2005/05/21/408207.aspx I did some investigations into the information logged by SharePoint in the IIS and STS log files. In this post I describe some decisions I’m going to make on processing these log files, based on the information that became available during my investigations. I’m writing these blog posts while doing these investigations, so if you have any comments on the decisions I make, please let me know!!
Goal of this weblog post is to massage the available log data into a format that can easily be processed for importing into the “Stage Area – IN”, a SQL server (2005) database where we import all source data that will eventually en up into out data warehouse.
STS logs
First of all we need a tool to convert the STS binary log files to a format that can easily be processed. The article Usage Event Logging in Windows SharePoint Services contains the code for a C++ application to do this conversion. I also got my hands on a C# implementation through Steven Kassim, a colleague of mine. He got this code from a newsgroup, but I couldn’t find where it exactly came from, and who wrote it. I’m doing some modifications to the code to change the output format (so LogParser can handle it), and to improve the speed. I will publish the code as soon as I’m ready. [Update: tracked down the newsgroup: http://groups.yahoo.com/group/sharepointdiscussions/, and the author: Fred LaForest].
IIS logs
Although the IIS log files are already in a format that could be easily parsed, there are some good reasons to do a preprocessing parse to accomplish the following:
- Handle the problem of the IIS log header appearing in the log file on each IIS-RESET
- Filter out log entries we are not interested in:
- Requests made by service account, like the full text indexer account
- Request to assets in team sites resulting in /_vti_bin/
- Requests made to assets we are not interested in, like javascript files, css stylesheet, images, etc.
- Filter out fields we are not interested in, like in our case the client IP address, be base the location on the main loacation of a user in the company directory (can also be done through IIS by only selecting the properties in our log file that we are interested in!)
IIS supports multiple log formats, and multiple ways to log information. It is possible to do direct ODBC logging to a database, but this approach gives a heavier load on the web servers. The best format IIS can log in is the W3C Extended Log File Format. In this log format it is possible to select the fields we are interested in:
Carefully selecting the the properties we are interested in can greatly reduce the amount of data that will be logged.
For more information on the W3C Extended Log File Format see:
- http://www.w3.org/TR/WD-logfile.html
- http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/676400bc-8969-4aa7-851a-9319490a9bbb.mspx
Processing the log files: the tool
There are many good systems around to process log files. Two log file processors I would really like to mention are:
- Analog, http://www.analog.cx
- LogParser, http://www.logparser.com
I have selected LogParser, because of its following features:
- It supports any log file format (handy for the STS log files)
- It might even be possible to implement direct binary parsing of the STS log files through a custom component into LogParser (still investigating this)
- It support incremental input parsing through checkpoints, which simplifies incrementally importing of log file data into our database
- It has a powerful query syntax
- It is very powerful in its supported output formats
- There is extensive programmability support available
For more information on LogParser see:
- http://www.logparser.com, especially the forum is very interesting! Gabriele Giuseppini does a great job in answering the questions!
- How Log Parser 2.2 Works, by Gabriele Giuseppini, the author of LogParser: http://www.microsoft.com/technet/community/columns/profwin/pw0505.mspx
For information on LogParser with respect to SharePoint, where direct reporting on the log files is done see:
- http://www.wssdemo.com/Pages/stats.aspx?menu=Articles
- http://blog.spsclerics.com/archive/2005/03/07/570.aspx
- http://msmvps.com/obts/archive/2005/04/01/40501.aspx
Back to the IIS log, what do we need
As stated in the previous post, in the STS log all successful requests to all pages and documents that are within WSS sites are logged. This includes WSS site based SPS things like MySite and Areas. All those request are logged in the IIS log as well, and they are difficult to correlate due to time differences. It is also the question if it is interesting to correlate those log entries, the STS log contains all the information that we need… although… I have one issue: the bandwidth consumed by the request. I can’t get the correct value out of the STS log (although it should be in there), while the IIS log contains the correct values (sc-bytes = cs-bytes). This would be the only reason to do the correlation. I’m still working on this issue (I post on this later), so lets assume that problem will be solved.
So where do we need the IIS logs for:
- Pages not found (404 errors)
- Pages in the /_layouts folder, this is also the location where we store our custom web applications and our custom services
- Unmanaged paths in the SharePoint virtual directory (paths excluded for the SharePoint render-engine “treatment”)
- IIS logs of other web sites, not related to SharePoint, but part of our intranet
Any requests for images, javascript files and stylesheet files in the IIS log can be skipped in our case, because those files are static files, supporting the SharePoint UI and our custom applications. We also filter out requests made by service account, we are not interested in those reuqests.
In the STS log requests for images are interesting, because these images are user uploaded documents within the WSS sites. We do filter out request made by service accounts as well for the the STS logs.
Moving IIS log files into the database
To move the IIS log files into the database we need a table definition for the IIS logs. I’m currently using the following table definition:
CREATE TABLE [dbo].[IISlog] (
[date] [datetime] NULL,
[time] [datetime] NULL,
[csUsername] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[sComputername] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[csMethod] [varchar](16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[csUriStem] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[csUriQuery] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[scStatus] [smallint] NULL,
[scSubstatus] [smallint] NULL,
[scWin32Status] [int] NULL,
[scBytes] [int] NULL,
[csBytes] [int] NULL,
[timeTaken] [int] NULL,
[csHost] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[csUserAgent] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[csReferer] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[application] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
And the following LogParser script to move the data from the log files to the database:
"C:\Program Files\Log Parser 2.2\logparser.exe" "SELECT date, time, cs-username, s-computername, cs-method, cs-uri-stem, cs-uri-query, sc-status, sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-host, cs(User-Agent) as cs-User-Agent, cs(Referer) as cs-Referer, 'SharePointPortal' as application INTO IISlog FROM c:\projects\IISlog\*.log WHERE (cs-username IS NOT NULL) AND (TO_LOWERCASE(cs-username) NOT IN ('domain\serviceaccount'))" -i:IISW3C -o:SQL -server:localhost -database:SharePoint_SA_IN -clearTable:ON
This is the first step where I filter out all request made by the system account used to index the SharePoint content. I did not do the filtering out of the WSS sites requests (we will use the STS log for this) and the unwanted files in the /_layouts/ directory yet. I’m moving one step at a time. So we now have all log files (collected into the directory c:\projects\IISlog) moved into the database.
Moving STS log files into the database
To move the STS log files into the database we need a table definition for the STS logs. I’m currently using the following table definition:
CREATE TABLE [dbo].[STSlog](
[application] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[date] [datetime] NULL,
[time] [datetime] NULL,
[username] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[computername] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[method] [varchar](16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[siteURL] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[webURL] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[docName] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[bytes] [int] NULL,
[queryString] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[userAgent] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[referer] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[bitFlags] [smallint] NULL,
[status] [smallint] NULL,
[siteGuid] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
And the following script to move the data from the binary log files to the database:
"C:\projects\STSLogParser\STSLogParser.exe" 2005-01-01 "c:\projects\STSlog\2005-01-01\00.log" c:\projects\logparsertmp\stslog.csv
"C:\Program Files\Log Parser 2.2\logparser.exe" "SELECT 'SharePointPortal' as application, TO_DATE(TO_UTCTIME(TO_TIMESTAMP(TO_TIMESTAMP(date, 'yyyy-MM-dd'), TO_TIMESTAMP(time, 'hh:mm:ss')))) AS date, TO_TIME( TO_UTCTIME( TO_TIMESTAMP(TO_TIMESTAMP(date, 'yyyy-MM-dd'), TO_TIMESTAMP(time, 'hh:mm:ss')))), UserName as username, 'SERVERNAME' as computername, 'GET' as method, SiteURL as siteURL, WebURL as webURL, DocName as docName, cBytes as bytes, QueryString as queryString, UserAgent as userAgent, RefURL as referer, TO_INT(bitFlags) as bitFlags, TO_INT(HttpStatus) as status, TO_STRING(SiteGuid) as siteGuid INTO STSlog FROM c:\projects\logparsertmp\stslog.csv WHERE (username IS NOT NULL) AND (TO_LOWERCASE(username) NOT IN (domain\serviceaccount))" -i:CSV -headerRow:ON -o:SQL -server:localhost -database:SharePoint_SA_IN -clearTable:ON
This script currently moves only one day, but you get the drift. As you can see we also set day, computername and application in the log file. Currently using fixed values, we will move this into a dynamic system later on. The date field is obvious, we want to record the date into the database for each log entry. We need the computer and application fields because we will have multiple servers, and multiple “applications” build on SharePoint, like for example ‘SharePointPortal’, ‘TeamSites’ (Intranet) and ‘ExternalTeamSites’ (Extranet).
The STSLogParser is an application to parse the STS log file from it’s binary format into a comma serperated ASCII log file. I will post the code for this converter in one of my next posts.