Home · Search · Print · Help
Deriving workload parameters from access logs
This page describes a group of Polygraph tools that simplify matching test
configuration to real proxy access logs. These tools are available starting
with Polygraph version 3.0.
Table of Contents
1. For the impatient
2. Introduction
3. Extracting statistics
4. Extracting content
5. access-filter
6. access-order
7. access2pgl
8. access2cdb
1. For the impatient
% access-filter --profile server squid-access.log > filtered-access.log
% access-order filtered-access.log | sort ... > ordered-access.log
% access2pgl ordered-access.log > workload.pgd
% access-filter --profile content squid-access.log > content-access.log
% access2cdb --cdbs mycdbs/ content-access.log
2. Introduction
Origin server and proxy access logs contain information that can be
used to build Web Polygraph workloads. Access logs can be fed to Polygraph
robots "as-is" for trace replay or preprocessed to extract various
statistics and content to configure Polygraph robots and servers. This
page focuses on the latter approach.
3. Extracting statistics
Typical proxy access logs contain information about request timing,
"busy" user periods, response sizes, response time, response status code,
etc. With some effort, that information can be extracted in a way suitable
for writing Polygraph workloads. Note, however, that an access log alone
is usually insufficient to build a complete workload because access logs
lack information about HTTP connections, cachability, inter-object
relationships, etc. It is sometimes possible and desirable to instrument a
proxy to log more information, and the tools discussed here can be adapted
to extract additional, custom statistics.
Proxies sometimes log details about transactions that Polygraph cannot
accurately reproduce. For example, Polygraph does not yet support FTP
transactions and many HTTP response codes. Also, access log entries are
usually ordered by entry timestamp rather than request timestamp, which
makes them awkward to use for accurate reproduction of request
interarrival times. Finally, it is sometimes desirable to base Polygraph
workload on a subset of log entries (e.g., log entries originating from
"local" end-users). All these factors lead to multiple steps of extracting
workload parameters from a raw access log:
Make sure the access log is in Squid access log format. Most
proxies can be configured to use that popular format. Alternatively,
existing logs can be converted to Squid access log format. Polygraph
tools do not use many Squid-specific log fields, simplifying the
conversion process. This step is not described here; Polygraph tools
assume Squid access log format.
Optionally, remove unwanted log entries using Polygraph's access-filter tool or your own custom
filter.
Re-order the access log using a combination of Polygraph's access-order tool and your favorite sort program.
This step is required to get accurate request interarrival
information. If you do not plan on using custom request-interarrival
distributions or session parameters, then you do not have to
reorder.
Extract statistics from ordered access logs using Polygraph's
access2pgl tool.
Use extracted statistics to write Polygraph workloads. This
step is not described here.
4. Extracting content
Access logs do not contain actual response content, of course. However,
they do contain URLs that can often be used to download content.
Downloaded content can then be fed into Polygraph content databases, to configure simulated
servers to use real content for their responses.
If you are building a content database from access logs, two problems
must be solved. First, one has to make sure that logged URLs are suitable
for re-requesting. While no automated solution is 100% accurate, the
"content" profile of the access-filter tool can be
useful for eliminating many URLs that probably should not be fetched.
Even if you use this filter, please understand that filtered access logs
may still contain entries that should not be requested in some
environments.
The second problem is splitting access log entries into groups based on
logged or guessed content types. In most cases, you want to have separate
content databases for "images", "html", "downloads", etc. because those
common classes of objects have distinct properties and relationships. The
access2cdb tool downloads and stores content in
several content databases, based on content types.
5. access-filter
The access-filter tool reads access log entries and writes
"good" access log entries to the standard output. The goodness criteria
depend on the filter "profile", specified via the required
--profile command line option. Many criteria are arbitrary and
can be inappropriate for your purposes. With some Perl knowledge, it should be easy to
modify the script to use different criteria.
The right profile depends on what statistics or content you plan to
extract from the filtered access log. Sometimes, different filtering rules
should be used to collect different kinds of statistics. For example,
response status code is much less important when measuring request
interarrival times than when when measuring response sizes.
At the time of writing, the following profiles and corresponding
goodness criteria are defined:
- country profile (for extracting request interarrival
distributions):
- US-based client IP addresses
- server profile (for extracting most server-side
parameters)
- HTTP protocol
- 2xx and 3xx status codes
- GET, POST, and HEAD request methods
- content profile (for building content databases):
- HTTP protocol
- 200 status code
- GET request method
- no query terms in request-URI
You may want to consult with the documentation at the top of the
access-filter script source code for current enabled goodness
tests.
Besides good log entries, the access-filter tool prints
statistics related to its filtering choices. The format details are not
documented, but the output is a collection of histograms (number,
percentage) of status codes (SC), URI scheme or protocol (PRT), URI query
terms (URI), requestion methods (MT), country codes (CC), log entries,
client IP addresses (IP), and reasons for disabling a client IP address
(Bads). These stats and progress lines are printed to standard error
stream.
% access-filter --profile server squid-access.log 1> filtered-access.log 2> filter.stats
The filter does not modify good log entries.
6. access-order
Squid logs an entry when the corresponding transaction has been
completed. This means that entries are stored in response completion order
rather than request acceptance order. Fortunately, there is enough
information in the log to reorder entries based on request acceptance
time.
The access-order tool reads access log entries, modifies each
read entry so that the first field becomes the time of the request
acceptance, and writes the modified entry back. A sort routine can
be used to sort the output:
%access-order filtered-access.log | sort -t' ' -n +0 > ordered-access.log
The exact sort command options may differ depending on your
environment, but should tell the command that the input is separated by
spaces and need to be numerically sorted on the first field.
Sorting should not be needed if you do not plan to use any statistics
related to request arrival time.
7. access2pgl
The access2pgl tool reads access log file and prints
statistics that can be used for configuring Polygraph. At the time of
writing, the following distributions are computed along with related
parameters:
- Request interarrival times during busy periods
- Duration of a busy session period
- Number of requests per busy session period
- Duration of an idle session period
- Response times
- Response sizes
- Response status codes
- Request methods
- Request header sizes (requires customized Squid log)
- Request body sizes (requires customized Squid log)
User sessions computed on per-IP bases. The access2pgl script
measures the delay between sequential requests from the same IP address.
If delay is longer than one minute (configurable via the
SessionIdleTout constant at the top of the script), then the
delay becomes the idle period and the session for the corresponding IP
ends. The script prints statistics about busy and idle periods as well as
the number of user requests per session, aggregated over all IP
addresses.
Depending on the distribution, it is dumped either using PGL selector syntax (for
cut-and-pasting into PGL workloads) or using the tabular distribution syntax (for
external storing and referring to from PGL files).
% access2pgl ordered-access.log > workload.pgd
An example of computed statistics is available elsewhere.
8. access2cdb
The access2cdb tool reads access log file and and downloads
referenced objects, stuffing them into Polygraph Content Database (.cdb)
files, based on reported or guessed content type. The user specifies the
directory where the files should be created or updated. That directory
must exist.
For each access log entry, access2cdb analyses the content
type based on a hard-coded table of content types and extensions. Filename
extensions are only used if no content type information was logged. You
can modify the @ContentGroups variable inside the script to
change mapping between content types and content database as well as to
add new content types.
Once the content group is selected (based on content type), the
access2cdb tool downloads the corresponding URL using
wget tool and adds content to the corresponding content database
in the user-specified directory (the --cdbs option) using the
cdb tool distributed with Polygraph. This process may take a long
time for long logs since fetching real content over the Internet is often
slow and the script fetches one URL at a time.
% access2cdb --cdbs mycontent/ content-access.log
Both wget and cdb tools must be in your executable
path. You can adjust script sources if you want to use a different
download tool.
Upon completion, access2cdb prints simple statistics
reflecting popularity of content groups and individual content
types/extensions.
Home · Search · Print · Help
|