Department of Computer and Information Science
Research
TECHNICAL REPORT ARCHIVE

Sampling Multi-Dimensional Data
TR-CIS-2006-01 (02/08/2006)
Huseyin Akcan, Alex Astashyn, Herve Bronnimann and Leonid Bukhman
pdf version of this paper
Abstract
A wide range of mining and analysis problems involve extracting knowledge from count data. A general approach that scales well with the data is sampling, and a deterministic algorithm for efficiently reducing such data is EASE. We present a vastly simplified and more powerful algorithm (Biased-L2), and show how to use it for deterministically sampling streaming count data and iceberg cubes. The algorithm produces samples better than EASE, and more than an order of magnitude better than random samples (as measured by RMS errors of the frequency vectors over items). Coupling with other methods such as Lossy Counting further improves on the memory footprint and accuracy. We also present a related deterministic variant of reservoir sampling (DRS) which produces samples of similar quality. It can be used to maintain a sample for a data stream, for a table in a relational database or data cube under insertions, and has an excellent recovery rate in case the distribution of item changes. We show how to engineer DRS to sample over distributed sources of data such as arise from distributed databases, multiple data streams, and sensor networks.
Back to previous page
|