Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.

We consider a stream A=\langle a_1, \ldots, a_m\rangle formed by arranging a set of distinct values B=\{b_1, \ldots, b_m\} in random order, i.e., a_i=b_{\pi(i)} for a permutation \pi of [m] chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).

Here’s a simple algorithm that maintains a set S of s=O(\sqrt{m}) elements and two counters h and l which are initialized to zero. Store the first s values from the stream in S and as each new element a arrives, do one of the following:

  1. If a <\min S: Increment l
  2. If a >\max S: Increment h
  3. If \min S<a<\max S and h \geq l: Add a to S, remove smallest element from S, and increment l
  4. If \min S<a<\max S and h<l: Add a to S, remove largest element from S, and increment h

At the end of the stream, if 1\leq (m+1)/2-l\leq s then return the ((m+1)/2-l)-th smallest element in S. Otherwise fail.

Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of S and l values less than \min S and h values greater that \max S. Hence, if at the end of the stream, S has a ((m+1)/2-l)-th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering r=|h-l| and noting that, for some constant c>0,

  1. Probability a non-zero r is incremented at any step is at most 1/2 if s\geq cr
  2. r is either incremented or decremented at each step
  3. If s\geq cr at the final step, then 1\leq (m+1)/2-l\leq s as required

Hence, we have a simple “random-walk-like” set-up and there’s a s=O(\sqrt{m}) that ensures that, with high probability, the algorithm doesn’t fail.

Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].