Access Restriction

Author Duffield, Nick ♦ Lund, Carsten ♦ Thorup, Mikkel
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2007
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Subset sum estimation ♦ Reservoir sampling ♦ Sampling without replacement ♦ Weighted sampling
Abstract From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items $\textit{i}$ = $0,…,\textit{n}$ ™ 1 with weights $w_{i}.$ For each item $\textit{i},$ we generate a random number $α_{i}$ ∈ (0,1] and create a priority $q_{i}$ = $w_{i}/α_{i}.$ The sample $\textit{S}$ consists of the $\textit{k}$ highest priority items. Let τ be the $(\textit{k}$ + 1)th highest priority. Each sampled item $\textit{i}$ in $\textit{S}$ gets a weight estimate $ŵ_{i}$ = $max{w_{i},$ τ}, while nonsampled items get weight estimate $ŵ_{i}$ = 0. Magically, it turns out that the weight estimates are unbiased, that is, $E[ŵ_{i}]$ = $w_{i},$ and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates $ŵ_{i}$ and $ŵ_{j}$ of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling $\textit{k}$ items with unbiased weight estimators that gets smaller variance sum than priority sampling with $\textit{k}$ + 1 items. Szegedy settled this conjecture at STOC'06.
ISSN 00045411
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2007-12-01
Publisher Place New York
e-ISSN 1557735X
Journal Journal of the ACM (JACM)
Volume Number 54
Issue Number 6

Open content in new tab

   Open content in new tab
Source: ACM Digital Library