On Realism

Is Web Polygraph a realistic benchmark? Should you care?

1. Introduction
2. Realistic workloads
3. Realistic results
4. Who cares?

1. Introduction

Realism: concern for fact or reality and rejection of the impractical and visionary. Fidelity to real life and to accurate representation without idealization. -- Merriam-Webster.

When it comes to Web environments, there are many realities. A caching proxy engineer has very different perspective on the workloads compared to a L2 switch engineer. Moreover, a caching proxy in a forward environment sees very different traffic patterns compared to essentially the same caching proxy installed as a surrogate in front of a Web site. The discussion on this Web page applies to any given real-world environment that Polygraph can simulate.

There are at least two kinds of realism in a benchmarking context: the realism of workloads and the realism of the results. An ideal benchmark poseses both qualities. Web Polygraph can exhibit each quality or both, depending on the application as explained below. Since the primary goal of a benchmark is to produce realistic results (regardless of the means), the second kind of realism should be the primary concern of a benchmark designer.

2. Realistic workloads

Web Polygraph simulates a variety of environments. Since many simulation models are supported it is often possible to instruct Polygraph to emit a traffic stream with characteristics observed in reality. For example, complex content type models can be used to closely match the object size distribution of a given origin server. Of course, as you move from one real server to another, the model parameters would have to be adjusted as no two real Web sites are identical.

It is probably impossible to match all given traffic parameters without explicitly specifying all properties of each individual transaction. The latter is simply not practical for most Polygraph applications due to the prohibitively large number of transactions and their properties.

Moreover, several workload parameters are closely related in Polygraph workloads while their real equivalents are, in theory, independent. For example, until Polygraph 2.7, byte hit ratio (BHR) was essentially determined by document hit ratio (DHR) for a given content distribution, while real DHR and BHR can vary more independently. We try to eliminate artificial dependencies whenever possible.

So, are Polygraph workloads realistic? The answer depends on the workload. We feel that Polygraph can create very realistic workloads. Many properties of standard workloads are very realistic. As discussed below, constraints of the testing environment may create a conflict between realistic workload characteristics and obtainability of realistic (or any!) results. Standard Polygraph workloads resolve these conflicts in favor of creating more realistic results.

3. Realistic results

Producing results that match product performance in a real environment is our primary goal. We have anecdotal evidence that indicates some standard Polygraph workloads report realistic performance. We also have anecdotal evidence of significant discrepancies between reported results and real performance. The situation is further complicated by the fact that standard workloads represent a "typical" environment while every real environment is not typical in some aspects.

A naive approach to producing realistic results would be to use realistic workloads. Indeed, if a device under test is subjected to real (or very realistic) traffic the results must be, by definition, realistic. While this obvious approach works well on paper, it is often impractical due to testing constraints (which are also real!).

The major constraint is time. A real cache may take 7 days to go from "empty" to "full". We may have only a few hours worth of test time. Using a realistic workload, we would never be able to report any reasonable results for the given cache because several hours are not enough to reach steady state conditions. Thus, we must take shortcuts. We have to make certain workload characteristic unrealistic, to "compress time". These shortcuts may, and often do, affect many workload characteristics.

A classic example of the above phenomenon is Polygraph's use of the Zipf distribution to model object access patterns. Most standard Polygraph workloads do not use Zipf because using Zipf in conjunction with other workload parameters and constraints leads to unrealistic memory hit ratios and incorrect performance results. We could name the currently used model "Zipf" to virtually eliminate all complaints, but we prefer to improve Polygraph and testing methodology so that true Zipf distributions become usable. Until then, you are likely to hear that Polygraph "does not model Zipf" and, hence, produces "wrong" results (mostly from folks who have not actually tried to benchmark with "Zipf", or at all).

So does Polygraph produce realistic results? We hope that standard Polygraph workloads do, but do not have sufficient data to prove it. We believe that Polygraph produces true relative comparisons of products even if absolute measurements are skewed. Finally, there is no better benchmark, so we work with imperfect data, and constantly strive to make the results more realistic.

4. Who cares?

Folks using Polygraph for product comparison do. We belive that you can rely on Polygraph's relative measurements. The "worst" product in a Polygraph test is likely to be the "worst" product in a similar real environment (all non-measured factors being equal). The best products are good candidates for the final selection based on non-performance characteristics.

Folks using Polygraph for capacity planning do. We believe that Polygraph is good for stress testing a product. The bugs and limits you find during a Polygraph capacity test are likely to be present in reality. Absolute numbers may vary. As a short-term trick, having real numbers to compare with can help to derive a "fudge factor" to adjust absolute performance reported by Polygraph to match real performance levels. Better workloads and models are the long-term answer, of course.

Folks using Polygraph for research do. Please understand Polygraph's primary goals and real-life testing constraints. Please share your specific improvement ideas with us!

Marketing people using Polygraph results do. If your product performs well, you can and should claim that Polygraph results are the most realistic benchmarking results in the industry. If your product sucks, blame Polygraph.

Polygraph authors do. We are working on making Polygraph better and are open to any practical suggestions and ideas, not to mention bug reports.

Our competitors do not. Realism does not sell.

Home · Search · Print · Help

On Realism

Table of Contents

1. Introduction

2. Realistic workloads

3. Realistic results

4. Who cares?