Mining Your HTTP Server Logs for Statistical Gold

System Administration
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Congratulations! You finally have that Web site up and running and are now getting hundreds, if not thousands, of visitors each day, but is your Web site meeting your business objectives? Do you know who is visiting your site and what they’re looking at or, perhaps more importantly, not looking at? Log files created as byproducts of running the HTTP Server for the AS/400 can answer these questions and more. Let me show you how.

Who, Where, What, Etc.

Log files generated by the HTTP Server can tell you who is visiting your site, where they are coming from, what they are looking at, and much more. Server statistics log access to every Web page, image, or Common Gateway Interface (CGI) program that resides on your site, and you can study access patterns that users take through your site. The logs can tell you what pages people enter your site on, where they came from (e.g., search engines or links from other Web sites), and what page they were on before they left your site. Add to that all the technical statistics, and you have access to tools that can really help your site be the best it can be.

The HTTP Server Extended Log File

With V4R3, IBM introduced the National Center for Supercomputing Applications (NCSA)-compliant “extended” log format. This industry-standard log file format contains data you need to analyze your Web site to discover what is working and what is not. This log file format standard is maintained by the World Wide Web Consortium (www.w3. org/TR/WD-logfile).

Figure 1 shows a single entry from a typical HTTP Server log file. The HTTP Server creates a new log file each day; this is consistent with industry-standard practice found on Apache and other major servers. The 10 fields shown in this report are industry- standard fields captured by all Web servers (on any platform) and can be interpreted and analyzed to present all the previously mentioned types of statistics.

When considering logging and log analysis, you want to decide where to put your logs. The HTTP Server supports storing logs in either a QSYS.LIB DDS database file or a QDLS shared folder or the AS/400 Integrated File System (AS/400 IFS) “root” directory system. I strongly suggest using the AS/400 IFS root directory system. I create a root-level directory called something like WEBSITES to be the common area for all my Web servers


and their resources, and, under WEBSITES, I create a subdirectory for each Web server instance that I run on my machine. Say I call one of the server instances PUBLIC. I would create a PUBLIC subdirectory under WEBSITES and then create a LOGS sub-subdirectory to store all my HTTP Server logs for this server instance. Therefore, my logs would be stored in WEBSITESPUBLICLOGS.

I strongly advise acquiring a commercial log analysis tool to analyze your log files, because using the log by itself limits the business intelligence that you can obtain from it. Log analyzer vendors earn their money by understanding the relationships between records in your log file. If you reviewed the output of a commercial log analyzer, you would be amazed to see the amount of information that it can obtain from these 10-field records. Good log analyzers, however, do more than provide basic services, such as resolving domain names of IP addresses listed in your log. Good log analyzers perform a custom NSLOOKUP (an Internet utility that retrieves information from a Domain Name System [DNS] server about an IP address or a domain name) to obtain additional information about users from the InterNIC. (The InterNIC is the organization that operates and maintains the root name servers storing domain name registration information.) Some log analyzers are free; others vary in price. Some of the very best log analyzers cost less than $300.

If you insist on storing your logs in DDS files, follow the steps under logging in the HTTP Server for AS/400 Webmaster’s Guide V4R3 or the HTTP Server for AS/400 Webmaster’s Guide V4R4. Once you have configured the server and begun logging, you can use any standard programming language or query facility to produce your own reports. Be warned, however, that the value of any report produced with a query facility on the AS/400 is extremely limited.

Reporting Tools

In addition to analyzing the extended log file, IBM offers two additional facilities for monitoring activity on your HTTP Server interactively: the Monitor and Basic Web reporting features. The Monitor facility provides snapshots of basic HTTP Server statistics, whereas the Basic log reporting facility provides an interactive view of the access and error logs. There is a third option called Web Mining, but it is not supported if you choose the extended log file format. The extended log format is the log file format that conforms to the aforementioned standards published by the W3C. Together, the extended log file format and a good log analyzer produce much more information than IBM’s Web Mining report anyway. (I do not cover the built-in Web Mining report here.)

Configuring the HTTP Server for Logging

Be sure that your HTTP ADMIN server is running. To check, open Operations Navigator, select Network, Servers, TCP/IP, then look at the list of servers in the right-hand window. If you see HTTP Administration with the “started” status, it is running. If it shows
“stopped,” right-click on the name of the server, and then click on Start. Once the ADMIN server is running, you can access it from a browser by typing www.mycompany.com:2001 in the Location field. To modify the configuration of HTTP servers, you need *IOSYSCONFIG special authority in your AS/400 user profile. After your browser connects to your AS/400, select the IBM HTTP Server for the AS/400 from the list of services displayed and then select Configurations and Administration from the next page. Click Configurations and select a configuration from the drop-down list.

Select Basic, but be sure that the check box labeled Look up host name of requesting clients is not checked. Checking this box causes the server to do a DNS lookup on every user accessing your site and write the user’s domain name to the log file. The cost (in time) of doing DNS lookups is not worth the little bit of data you get by enabling this feature; a good log analyzer does the lookups when you run it and obtains a great deal more information.


To update your configuration, click the Apply button. Next, you need to click the Logging item on this menu, followed by Global Log File Settings. Under Global Log File Settings, you must choose how the server will log date and time information and what type of log you wish to create. I strongly suggest using local time, since the bulk of your log file reports will go to business users throughout your organization. The statistics are also more meaningful when presented in local time. If you choose local time, be sure that the QUTCOFFSET system value contains a valid offset in hours for your time zone. For example, my machines are in Los Angeles, which is -8 hours away for standard time and- 9 hours away for daylight saving time. Unfortunately, this value must be manually set whenever you change from standard time to daylight saving time or vice versa. The HTTP Server uses QUTCOFFSET and QTIME to calculate appropriate times for log entries.

Click the Access Log File item in the menu frame and type the fully qualified IFS ROOT file path to the location where you wish to store your log files. I suggest setting the log size to zero to turn off size checking and allow the file to grow as large as needed. If you specify a size and the log file reaches that size during a day’s processing, the server stops logging and you lose data. Use the default format unless you are using virtual named servers and need to create a custom log format that contains the name of the Web server.

The log file maintenance section of the form allows the HTTP Server to delete old log files on the schedule that you specify. If you click Keep logs, the server creates a new log file each day and leaves old files in the access log subdirectory until you do something manually to remove them. I use a 45-day time limit to clean up these files. This allows time for running monthly statistics and automatically keeps my system free of large, disk- consuming files.

You can also choose to exclude server IP addresses or host names, user agents (browser types), methods, Multipurpose Internet Mail Extension (MIME) types, or return codes, but I do not use this Excluded URLs setting. If you enter a URL in the box below this option, the server does not log any access from the specified URL. I prefer to log everything in the log file and use my log analyzer to subset the data.

Once you make your choices and click the Apply button to update your configuration, you have completed all the basic configuration steps necessary to create an extended log file and start logging. However, you must stop and start your HTTP Server instance before these changes become effective.

Configuring the HTTP Server Monitor Function

Figure 2 (page 75) illustrates a typical Monitor report. This view displays basic server activity statistics as of the moment you click the Monitor button on the Server Instances/Work with Server Instances display for the selected server instance. You can also display the number of total bytes transmitted and received and display a list of URLs processed since the server was last started.

Click the System Management link on the menu frame and then click Activity Monitoring. Click the Enable activity monitoring support check box and make sure that a check mark appears in the box. Finally, click the Apply button. The next time you start your server, the Monitor display will be available.

To view the Monitor report, click Server Instances at the top of the menu frame. Click Work with server instances, select the server you wish to monitor in the list box, and then click the Monitor button. (Be sure that you have applied all current V4R4 5769-DG1 [HTTP Server] PTFs. Enhancements as well as fixes are delivered via PTFs. It is important that you keep your PTFs current.)

Tools of the Trade

After configuring your server, you should collect valuable raw data containing pure gold. The trick now is to extract information from this raw data. There are many commercial Web analyzers available at prices ranging from nothing to several thousands of dollars.


Unfortunately, I have not found any that actually run on the AS/400. For now, you probably want to focus on commercial products that run on a Microsoft Windows-based PC.

A commercial log analyzer is the result of analysis and design by companies focused on understanding Web logs and log data. The data in a single log record is interesting but does not contain the value that you get from analyzing relationships between many records, building summaries, and extracting statistics that don’t seem to exist in the basic data.

Figure 1 provides a sample log record. The visitor’s IP address is 63.254.26.147, and, if you had DNS name resolution turned on, you would see HRBGA010-
1163.splitrock.net in the domain name field of the log. What does this tell you? Well, if you used a WHOIS lookup, you would find out that this is most likely an ISP named Splitrock Services located in The Woodlands, Texas. Could you write code to do this yourself? Of course you could, but the AS/400 does not have a WHOIS utility. You would have to write your own TCP/IP Sockets-based code to do it.

A list of both free and commercial log analyzer products that run on various platforms (except the AS/400) can be found on the Access Log Analyzers Web Site (www.uu.se/software/analyzers/ access-analyzers.html). My personal favorite is the WebTrends Log Analyzer from WebTrends Corporation This product runs on any Microsoft Windows 9x/NT/2000 PC and can access your AS/400 log files via NetServer or Client Access/400 (V3R2M0). If you are running V4R4, you want to configure your server to produce the EXTENDED log file; releases prior to V4R4 produce the COMMON log files and do not include reports on referrers, browsers, client operating systems, or search engines.

The WebTrends Log Analyzer reads your log files directly from the AS/400 IFS directories on your AS/400. You will need Client Access/400 (V4R3 or earlier) or configure NetServer (V4R3 or later) to allow your PC access to files in the IFS. You can FTP the files to your PC, but you will create an administrative nightmare and limit the product’s automation and scheduling capabilities. WebTrends contains a robust set of file selection logic and date-range selection logic for selecting the range of data to be included in a report. My favorite setting is last full month. With this setting, the product figures out which log files to use and contains all the calendar logic for extracting data for the previous month.

There are many options within the product to include or exclude domains, IP addresses, and much more. You can either select which of the options’ standard reports to produce or customize their reports to fit your specific needs. (I run the full set of default reports each month.) You can also choose from many output options, ranging from hard copy reports to HTML pages published to a designated directory on your Web server (my favorite). You can even email reports to a distribution list of individuals.

A Sample of Available Statistics

WebTrends produces a robust series of charts and statistics in an attractive HTML format that you can distribute to interested parties in your organization by publishing to your intranet or extranet. Figure 3 (page 75) is an example of some of these statistics. There are, of course, many more types of information and data you can retrieve, such as general statistics, URLs most visited, top demographic information, and activity levels by visitor.

The general statistics bar chart in Figure 3 provides a quick view of the number of visitors (not page hits) per day and breaks it into United States, International, and Unknown users. WebTrends identifies visitors by resolving IP addresses via NSLOOKUP and then counting unique IP addresses for the day. If it cannot resolve a visitor’s country of origin by using one of several techniques, that visitor is counted as unknown. The presentation of this chart is both a good design and a marketing technique that catches your users’ attention with a great visual and provides a top-down, drilldown approach to


analyzing your site. You see a tremendous amount of information at a glance, and your brain processes information about the chart, such as which period of the month was busiest. You may be surprised at the volume of international traffic, and perhaps management will want you to investigate further to discover an untapped market.

Go Wherever the Gold Is

WebTrends is, by far, the most popular commercial product. There are other log analyzers out there. For now, though, WebTrends is the best that I have found. If you find something as good or better, drop me a line.

REFERENCES AND RELATED MATERIALS

• Access Log Analyzers: www.uu.se/software/analyzers/access-analyzers.html
• HTTP Server for AS/400 Webmaster’s Guide V4R3 (GC41-5434-03, CD-ROM QB3AEO03)

• HTTP Server for AS/400 Webmaster’s Guide V4R4 (GC41-5434-04, CD-ROM QB3AEO04)

• National Center for Supercomputing Applications Web site: www.ncsa.uiuc.edu
• WebTrends Web site: www.webtrends.com

Typical Access Log Entry -- (wrapped for publication -- normally stored on one line)

63.254.26.147 - - [22/Apr/2000:00:04:38 -0100] "GET /html/as400_other.htm HTTP/1.1" 200 5932
"http://www.ignite400.org/html/as400resource.htm" "Mozilla/4.0 (compatible; MSIE 5."

IP Address of visitor 63.254.26.147
Visitor domain* User Id** Access Date and Time (local) [22/Apr/2000:00:04:38 -0100]
HTTP Server Method "GET
URL requested by visitor /html/as400_other.htm
HTTP protocol version HTTP/1.1"
Server return code and sub-code 200 5932
Referer
"http://www.ignite400.org/html/as400resource.htm"

Browser "Mozilla/4.0 (compatible; MSIE 5."

* only shows up if the server is configured to resolve domain names (NOT RECOMMENDED)
** the user id shows up if the user is accessing an authenticated directory

The hypen "-" is a place holder indicating that no data for the field is available.

Figure 1: Typical log file entries like this one show a lot of information but can’t go as deep as a log analyzer can.

Mining_Your_HTTP_Server_Logs_for_Statistical_Gold05-00.png 397x215

Figure 2: A typical Monitor report displays basic server activity statistics.


Mining_Your_HTTP_Server_Logs_for_Statistical_Gold06-00.png 397x297

Figure 3: The general statistics report of the WebTrends Log Analyzer is a good design and handy marketing tool.


BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$