Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.
We consider a stream formed by arranging a set of distinct values in random order, i.e., for a permutation of chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).
Here’s a simple algorithm that maintains a set of elements and two counters and which are initialized to zero. Store the first values from the stream in and as each new element arrives, do one of the following:
- If Increment
- If Increment
- If and : Add to , remove smallest element from , and increment
- If and : Add to , remove largest element from , and increment
At the end of the stream, if then return the -th smallest element in . Otherwise fail.
Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of and values less than and values greater that . Hence, if at the end of the stream, has a -th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering and noting that, for some constant ,
- Probability a non-zero is incremented at any step is at most 1/2 if
- is either incremented or decremented at each step
- If at the final step, then as required
Hence, we have a simple “random-walk-like” set-up and there’s a that ensures that, with high probability, the algorithm doesn’t fail.
Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].