In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.
Contents |
History
Though streaming algorithms had already been studied by Munro and Paterson[1] as well as Flajolet and Martin[2], the field of streaming algorithms was first formalized and popularized in a paper by Noga Alon, Yossi Matias, and Mario Szegedy.[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.
Models
Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector
(initialized to the zero vector
) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of
using considerably less space than it would take to represent
precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models.[4]
In the cash register model each update is of the form
, so that ai is incremented by some positive integer c. A notable special case is when c = 1 (only unit insertions are permitted).
In the turnstile model each update is of the form
, so that ai is incremented by some (possible negative) integer c. In the "strict turnstile" model, no ai at any time may be less than zero.
Several papers also consider the "sliding window" model. In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.
Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.
Applications
Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on.[5] They also have applications in databases, such estimating the size of a join.
Some streaming problems
Frequency moments
The kth frequency moment of a set of frequencies
is defined as
.
The first moment F1 is simply the sum of the frequencies (i.e., the total count). The second moment F2is useful for computing statistical properties of the data, such as the Gini coefficient of variation.
is defined as the frequency of the most frequent item(s).
The seminal paper of Alon, Matias, and Szegedy paper dealt with the problem of estimating the frequency moments.
Heavy hitters
Find all elements i whose frequency ai > T, say. Some notable algorithms are:
- Karp-Papadimitriou-Shenker algorithm
- Count-min sketch
- Sticky sampling
- Lossy counting
- Sample and Hold
- Multi-stage Bloom filters
- Count-sketch
- Sketch-guided sampling
Counting distinct elements
Counting the number of distinct elements in a stream (sometimes called the F0 moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin, and the currently best known algorithm is due to Bar-Yossef et al.
Entropy
The (empirical) entropy of a set of frequencies
is defined as
, where
.
Estimation of this quantity in a stream has been done by:
- McGregor et al.
- Do Ba et al.
- Lall et al.
- Chakraborty et al.
Lower bounds
Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.
See also
Notes
- ^ Munro & Paterson (1980)
- ^ Flajolet & Martin (1985)
- ^ Alon, Matias & Szegedy (1996)
- ^ Gilbert et al. (2001)
- ^ Xu (2007)
References
- Gilbert, A.C.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M.J. (2001), "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries", Proceedings of the International Conference on Very Large Data Bases: 79–88, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.7062.
- Xu, Jun (Jim) (2007), A Tutorial on Network Data Streaming, http://www.cc.gatech.edu/%7Ejx/reprints/talks/sigm07_tutorial.pdf. From SIGMETRICS 2007.
- Lall, A., Sekar, V., Ogihara, M., Xu, J., and Zhang, H. (2006.), "Data streaming algorithms for estimating entropy of network traffic", Proc. ACM SIGMETRICS
- Karp, R.M.; Papadimitriou, C.H.; Shenker, S. (2003), "A Simple Algorithm for Finding Frequent Elements in Streams and Bags", Transactions on Database Systems
External links
- Princeton Lecture Notes
- Streaming Algorithms for Geometric Problems, by Piotr Indyk, MIT
- Dagstuhl Workshop on Sublinear Algorithms
- IIT Kanpur Workshop on Data Streaming
- List of open problems in streaming (compiled by Andrew McGregor) from discussion at the IITK Workshop on Algorithms for Data Streams, 2006.
- StreamIt - programming language and compilation infrastructure by MIT CSAIL
- IBM Spade - Stream Processing Application Declarative Engine
- IBM InfoSphere Streams
- Tutorials and surveys
- Data Stream Algorithms and Applications by S. Muthu Muthukrishnan
- Stanford STREAM project survey
- Network Applications of Bloom filters, by Broder and Mitzenmacher
- Xu's SIGMETRICS 2007 tutorial
- Lecture notes from Data Streams course at Barbados in 2009, by Andrew McGregor and S. Muthu Muthukrishnan
- Courses
| This computer-related article is a stub. You can help Wikipedia by expanding it. |
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)




