Home · Search · Print · Help
On Realism
Is Web Polygraph a realistic benchmark? Should you care?
Table of Contents
1. Introduction
2. Realistic workloads
3. Realistic results
4. Who cares?
1. Introduction
Realism: concern for fact or reality and rejection of the
impractical and visionary. Fidelity to real life and to accurate
representation without idealization. -- Merriam-Webster.
When it comes to Web environments, there are many realities. A caching
proxy engineer has very different perspective on the workloads compared to a
L2 switch engineer. Moreover, a caching proxy in a forward environment sees
very different traffic patterns compared to essentially the same caching proxy
installed as a surrogate in front of a Web site. The discussion on this Web
page applies to any given real-world environment that Polygraph can
simulate.
There are at least two kinds of realism in a benchmarking context: the
realism of workloads and the realism of the results. An ideal benchmark poseses
both qualities. Web Polygraph can exhibit each quality or both, depending on
the application as explained below. Since the primary goal of a benchmark is
to produce realistic results (regardless of the means), the second kind of
realism should be the primary concern of a benchmark designer.
2. Realistic workloads
Web Polygraph simulates a variety of environments. Since many simulation
models are supported it is often possible to instruct Polygraph to emit a
traffic stream with characteristics observed in reality. For example, complex
content type models can be used to closely match the object size distribution
of a given origin server. Of course, as you move from one real server to
another, the model parameters would have to be adjusted as no two real Web
sites are identical.
It is probably impossible to match all given traffic parameters
without explicitly specifying all properties of each individual transaction.
The latter is simply not practical for most Polygraph applications due to the
prohibitively large number of transactions and their properties.
Moreover, several workload parameters are closely related in Polygraph
workloads while their real equivalents are, in theory, independent. For
example, until Polygraph 2.7, byte hit ratio (BHR) was essentially determined
by document hit ratio (DHR) for a given content distribution, while real DHR
and BHR can vary more independently. We try to eliminate artificial
dependencies whenever possible.
So, are Polygraph workloads realistic? The answer depends on the workload.
We feel that Polygraph can create very realistic workloads. Many properties of
standard workloads are very realistic. As discussed below, constraints of the
testing environment may create a conflict between realistic workload
characteristics and obtainability of realistic (or any!) results. Standard
Polygraph workloads resolve these conflicts in favor of creating more
realistic results.
3. Realistic results
Producing results that match product performance in a real environment is
our primary goal. We have anecdotal evidence that indicates some standard
Polygraph workloads report realistic performance. We also have anecdotal
evidence of significant discrepancies between reported results and real
performance. The situation is further complicated by the fact that standard
workloads represent a "typical" environment while every real environment is
not typical in some aspects.
A naive approach to producing realistic results would be to use realistic
workloads. Indeed, if a device under test is subjected to real (or very
realistic) traffic the results must be, by definition, realistic. While this
obvious approach works well on paper, it is often impractical due to
testing constraints (which are also real!).
The major constraint is time. A real cache may take 7 days to go from
"empty" to "full". We may have only a few hours worth of test time. Using a
realistic workload, we would never be able to report any reasonable results
for the given cache because several hours are not enough to reach steady state
conditions. Thus, we must take shortcuts. We have to make certain workload
characteristic unrealistic, to "compress time". These shortcuts may, and often
do, affect many workload characteristics.
A classic example of the above phenomenon is Polygraph's use of the Zipf
distribution to model object access patterns. Most standard Polygraph
workloads do not use Zipf because using Zipf in conjunction with other
workload parameters and constraints leads to unrealistic memory hit
ratios and incorrect performance results. We could name the currently used
model "Zipf" to virtually eliminate all complaints, but we prefer to improve
Polygraph and testing methodology so that true Zipf distributions become
usable. Until then, you are likely to hear that Polygraph "does not model
Zipf" and, hence, produces "wrong" results (mostly from folks who have not
actually tried to benchmark with "Zipf", or at all).
So does Polygraph produce realistic results? We hope that standard
Polygraph workloads do, but do not have sufficient data to prove it. We
believe that Polygraph produces true relative comparisons of products
even if absolute measurements are skewed. Finally, there is no better benchmark, so we work with imperfect data,
and constantly strive to make the results more realistic.
4. Who cares?
Folks using Polygraph for product comparison do. We belive that you can
rely on Polygraph's relative measurements. The "worst" product in a Polygraph
test is likely to be the "worst" product in a similar real environment (all
non-measured factors being equal). The best products are good candidates for
the final selection based on non-performance characteristics.
Folks using Polygraph for capacity planning do. We believe that
Polygraph is good for stress testing a product. The bugs and limits you find
during a Polygraph capacity test are likely to be present in reality. Absolute
numbers may vary. As a short-term trick, having real numbers to compare with
can help to derive a "fudge factor" to adjust absolute performance reported by
Polygraph to match real performance levels. Better workloads and models are
the long-term answer, of course.
Folks using Polygraph for research do. Please understand Polygraph's
primary goals and real-life testing constraints. Please share your specific
improvement ideas with us!
Marketing people using Polygraph results do. If your product performs
well, you can and should claim that Polygraph results are the most realistic
benchmarking results in the industry. If your product sucks, blame
Polygraph.
Polygraph authors do. We are working on making Polygraph better and are
open to any practical suggestions and ideas, not to mention bug
reports.
Our competitors do not. Realism does not sell.
Home · Search · Print · Help
|