You are currently browsing the tag archive for the ‘bite-sized streams’ tag.

Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.

We consider a stream formed by arranging a set of distinct values in random order, i.e., for a permutation of chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).

Here’s a simple algorithm that maintains a set of elements and two counters and which are initialized to zero. Store the first values from the stream in and as each new element arrives, do one of the following:

- If Increment
- If Increment
- If and : Add to , remove smallest element from , and increment
- If and : Add to , remove largest element from , and increment

At the end of the stream, if then return the -th smallest element in . Otherwise fail.

Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of and values less than and values greater that . Hence, if at the end of the stream, has a -th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering and noting that, for some constant ,

- Probability a non-zero is incremented at any step is at most 1/2 if
- is either incremented or decremented at each step
- If at the final step, then as required

Hence, we have a simple “random-walk-like” set-up and there’s a that ensures that, with high probability, the algorithm doesn’t fail.

Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].

We’re presented with a numerical stream and we want to compute some function where is the frequency of and is an arbitrary function with . One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following *basic estimator* for this type of problem:

- Pick uniformly at random from
- Compute
- Let

Note that can be computed using space as the data arrives. It’s not hard to show that because,

and

Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of . Quite how many basic estimators are required depends on the variance of and this determines the total space used by the algorithm.

Using this approach, it was shown that the -th order frequency moment, , can be approximated up to a factor with probability using only bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].

## Recent Comments