Thumbnail
Access Restriction
Subscribed

Author Cormode, Graham ♦ Muthukrishnan, S. ♦ Yi, Ke ♦ Zhang, Qin
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2012
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Distributed tracking ♦ Random sampling
Abstract A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this article, we present communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from $\textit{k}$ distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the $\textit{W}$ most recent elements, or arrivals within the last $\textit{w}$ time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.
ISSN 00045411
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2012-05-03
Publisher Place New York
e-ISSN 1557735X
Journal Journal of the ACM (JACM)
Volume Number 59
Issue Number 2
Page Count 25
Starting Page 1
Ending Page 25


Open content in new tab

   Open content in new tab
Source: ACM Digital Library