Computer Science

Randomized error removal for online spread estimation in data streaming

Document Type

Article

Abstract

Measuring flow spread in real time from large, high-rate data streams has numerous practical applications, where a data stream is modeled as a sequence of data items from different flows and the spread of a flow is the number of distinct items in the flow. Past decades have witnessed tremendous performance improvement for single-flow spread estimation. However, when dealing with numerous flows in a data stream, it remains a significant challenge to measure per-flow spread accurately while reducing memory footprint. The goal of this paper is to introduce new multi-flow spread estimation designs that incur much smaller processing overhead and query overhead than the state of the art, yet achieves significant accuracy improvement in spread estimation. We formally analyze the performance of these new designs. We implement them in both hardware and software, and use real-world data traces to evaluate their performance in comparison with the state of the art. The experimental results show that our best sketch significantly improves over the best existing work in terms of estimation accuracy, data item processing throughput, and online query throughput.

Publication Title

Proceedings of the VLDB Endowment

Publication Date

2021

Volume

14

Issue

6

First Page

1040

Last Page

1052

ISSN

2150-8097

DOI

10.14778/3447689.3447707

Keywords

accuracy improvement, data streaming, error removal, hardware and software, memory footprint, multi flows, processing overhead, state of the art

APA Citation

Wang, H., Ma, C., Odegbile, O. O., Chen, S., & Peir, J. K. (2021). Randomized error removal for online spread estimation in data streaming. Proceedings of the VLDB Endowment, 14(6).

Share

COinS