You are currently browsing the tag archive for the ‘bite-sized streams’ tag.

Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.

We consider a stream $A=\langle a_1, \ldots, a_m\rangle$ formed by arranging a set of distinct values $B=\{b_1, \ldots, b_m\}$ in random order, i.e., $a_i=b_{\pi(i)}$ for a permutation $\pi$ of $[m]$ chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).

Here’s a simple algorithm that maintains a set $S$ of $s=O(\sqrt{m})$ elements and two counters $h$ and $l$ which are initialized to zero. Store the first $s$ values from the stream in $S$ and as each new element $a$ arrives, do one of the following:

1. If $a <\min S:$ Increment $l$
2. If $a >\max S:$ Increment $h$
3. If $\min S and $h \geq l$: Add $a$ to $S$, remove smallest element from $S$, and increment $l$
4. If $\min S and $h: Add $a$ to $S$, remove largest element from $S$, and increment $h$

At the end of the stream, if $1\leq (m+1)/2-l\leq s$ then return the $((m+1)/2-l)$-th smallest element in $S$. Otherwise fail.

Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of $S$ and $l$ values less than $\min S$ and $h$ values greater that $\max S$. Hence, if at the end of the stream, $S$ has a $((m+1)/2-l)$-th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering $r=|h-l|$ and noting that, for some constant $c>0$,

1. Probability a non-zero $r$ is incremented at any step is at most 1/2 if $s\geq cr$
2. $r$ is either incremented or decremented at each step
3. If $s\geq cr$ at the final step, then $1\leq (m+1)/2-l\leq s$ as required

Hence, we have a simple “random-walk-like” set-up and there’s a $s=O(\sqrt{m})$ that ensures that, with high probability, the algorithm doesn’t fail.

Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].

We’re presented with a numerical stream $A=\langle a_1, \ldots, a_m\rangle\in [n]^m$ and we want to compute some function $\sum_{i\in [n]} g(f_i)$ where $f_i=|\{j:a_j=i\}|$ is the frequency of $i$ and $g$ is an arbitrary function with $g(0)=0$. One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:

1. Pick $J$ uniformly at random from $[m]$
2. Compute $R=|\{j\geq J:a_j=a_J\}|$
3. Let $X=m(g(R)-g(R-1))$

Note that $X$ can be computed using $O(\log m+\log n)$ space as the data arrives. It’s not hard to show that $E[X]=\sum_{i\in [n]} g(f_i)$ because, $E[X]= \sum_{i} E[X|a_J=i] \Pr[a_J=i]=\sum_{i} E[g(R)-g(R-1)|a_J=i] f_i$

and $E[g(R)-g(R-1)|a_J=i] = \frac{g(1)-g(0)}{f_i}+\ldots + \frac{g(f_i)-g(f_i-1)}{f_i}=\frac{g(f_i)}{f_i}\ .$

Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of $\sum_{i\in [n]} g(f_i)$. Quite how many basic estimators are required depends on the variance of $X$ and this determines the total space used by the algorithm.

Using this approach, it was shown that the $k$-th order frequency moment, $F_k=\sum_{i\in [n]} f_i^k$, can be approximated up to a factor $(1+\epsilon)$ with probability $(1-\delta)$ using only $O(\epsilon^{-2} n^{1-1/k} \log \delta^{-1} (\log m+\log n))$ bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].