One of the unique features of Web Polygraph is an ability to simulate
realistic Web content. In this context, the word ``content'' stands for the
actual bytes that comprise the body of a Web object (as opposed to generic
properties such as message size distribution or object popularity).
Realistic content simulation can be used for benchmarking various products
or services that depend on or manipulate with object contents. For example,
content filtering/blocking proxies and advertisement insertion services should
be tested using realistic content.
Content Simulation Module (CSM) in Polygraph is configured using PGL
Content type. Usually, Polygraph fills object bodies with semi-random bytes.
This manual shows you how to configure Polygraph to simulate realistic HTML
content.
Configuring CSM is a relatively straightforward process. First, you will
need to prepare a database with HTML content that will be used to populate the
model. Then you will use PGL to specify the model parameters.
3.1 Content database
To create a content database file, use the cdb program (compiled
during "make all").
usage: src/csm/cdb <database.cdb> <command> [file.html ...]
commands:
show - dump db contents to stdout
add - absorb file(s) contents
As you can see, cdb can display database contents and add files to
the database. If the database does not exist, the ``add'' operation will
create it. You can add one or more files at a time. By default, contents to
be added is read from the standard input. Alternatively, you can specify file
names. Cdb assumes that all input files are in HTML-like format. It is your
responsibility to strip off any HTTP headers, if needed.
At the time of writing, borders between input files are not important. The
following two command will produce the same results:
example> cat file1.html file2.html file3.html | cdb test.cdb add
example> cdb test.cdb add file1.html file2.html file3.html
During a test, Polygraph will use HTML constructs from the database (and
only from the database) to generate HTML pages.
A 1.2 MB (gzipped) content database used for the demo is available.
3.2 PGL parameters
To enable content simulation based on a .cdb database, simply
add the content_db option to your Content specification. For example,
Content SimpleContent = {
mime = { type = "text/html"; extensions = [ ".html" ]; };
size = exp(11KB);
cachable = 80%;
content_db = "pages.cdb"; // import content templates
...
};
Complete PGL configuration that drives the demo is available. To support a human-driven demo like that,
you need to tell Polygraph server to ignore URLs by using --ign_urls yes command
line option of polysrv. This strange option should not be used for
real tests where Polygraph robots and not humans make requests, of course.
3.3 Injecting generated content with text
Many applications that analyze HTML content depend on the presence of
well-known keywords. A common example is a content filtering proxy that would
deny access to any page that contains the keyword ``sex''. If you can read
this page, you are not using such a proxy.
Polygraph allows you to inject generated HTML with arbitrary text. The
injections will appear at random places, between HTML tags, not to disturb the
HTML code. The following configuration instructs Polygraph to take injections
from the "inj.tdb" file and infect 30% of the files. A file is
considered ``infected'' if it receives at least one injection. The
inject_gap field specifies the distance between two consecutive
injections within one file.
Content PoisonedContent = {
...
inject_db = "inj.tdb"; // import text to inject
infect_prob = 30%; // portion of injected files
inject_gap = exp(100Byte); // average distance between injections
};
A .tdb file is simply a text file. You can use your favorite
editor to maintain this database. New lines separate individual entries.
Currently, there is no way to specify an entry that spawns multiple lines.
Please let us know if we should add such a feature. Entries can contain
arbitrary text, including white space and HTML tags.
A small injection database used for the demo is available.