Priority sampling for estimation of arbitrary subset sumsPriority sampling for estimation of arbitrary subset sums

Access Restriction
Subscribed

 Author Duffield, Nick ♦ Lund, Carsten ♦ Thorup, Mikkel Source ACM Digital Library Content type Text Publisher Association for Computing Machinery (ACM) File Format PDF Copyright Year ©2007 Language English
 Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science Subject Keyword Subset sum estimation ♦ Reservoir sampling ♦ Sampling without replacement ♦ Weighted sampling Abstract From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items $\textit{i}$ = $0,…,\textit{n}$ ™ 1 with weights $w_{i}.$ For each item $\textit{i},$ we generate a random number $α_{i}$ ∈ (0,1] and create a priority $q_{i}$ = $w_{i}/α_{i}.$ The sample $\textit{S}$ consists of the $\textit{k}$ highest priority items. Let τ be the $(\textit{k}$ + 1)th highest priority. Each sampled item $\textit{i}$ in $\textit{S}$ gets a weight estimate $ŵ_{i}$ = $max{w_{i},$ τ}, while nonsampled items get weight estimate $ŵ_{i}$ = 0. Magically, it turns out that the weight estimates are unbiased, that is, $E[ŵ_{i}]$ = $w_{i},$ and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates $ŵ_{i}$ and $ŵ_{j}$ of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling $\textit{k}$ items with unbiased weight estimators that gets smaller variance sum than priority sampling with $\textit{k}$ + 1 items. Szegedy settled this conjecture at STOC'06. ISSN 00045411 Age Range 18 to 22 years ♦ above 22 year Educational Use Research Education Level UG and PG Learning Resource Type Article Publisher Date 2007-12-01 Publisher Place New York e-ISSN 1557735X Journal Journal of the ACM (JACM) Volume Number 54 Issue Number 6

Open content in new tab

Source: ACM Digital Library