Department of Computer Science and Engineering

Research

CIS TECHNICAL REPORTS

Home People Undergraduate Graduate Research Contact
 

Department of Computer and Information Science

Research
TECHNICAL REPORT ARCHIVE



Sampling Multi-Dimensional Data

 

TR-CIS-2006-01 (02/08/2006)
Huseyin Akcan, Alex Astashyn, Herve Bronnimann and Leonid Bukhman

pdf version of this paper

Abstract

A wide range of mining and analysis problems involve extracting knowledge from count data. A general approach that scales well with the data is sampling, and a deterministic algorithm for efficiently reducing such data is EASE. We present a vastly simplified and more powerful algorithm (Biased-L2), and show how to use it for deterministically sampling streaming count data and iceberg cubes. The algorithm produces samples better than EASE, and more than an order of magnitude better than random samples (as measured by RMS errors of the frequency vectors over items). Coupling with other methods such as Lossy Counting further improves on the memory footprint and accuracy. We also present a related deterministic variant of reservoir sampling (DRS) which produces samples of similar quality. It can be used to maintain a sample for a data stream, for a table in a relational database or data cube under insertions, and has an excellent recovery rate in case the distribution of item changes. We show how to engineer DRS to sample over distributed sources of data such as arise from distributed databases, multiple data streams, and sensor networks.

Back to previous page

 
  poly thinking