You are currently browsing the monthly archive for July 2009.

We’re presented with a numerical stream $A=\langle a_1, \ldots, a_m\rangle\in [n]^m$ and we want to compute some function $\sum_{i\in [n]} g(f_i)$ where $f_i=|\{j:a_j=i\}|$ is the frequency of $i$ and $g$ is an arbitrary function with $g(0)=0$. One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:

1. Pick $J$ uniformly at random from $[m]$
2. Compute $R=|\{j\geq J:a_j=a_J\}|$
3. Let $X=m(g(R)-g(R-1))$

Note that $X$ can be computed using $O(\log m+\log n)$ space as the data arrives. It’s not hard to show that $E[X]=\sum_{i\in [n]} g(f_i)$ because, $E[X]= \sum_{i} E[X|a_J=i] \Pr[a_J=i]=\sum_{i} E[g(R)-g(R-1)|a_J=i] f_i$

and $E[g(R)-g(R-1)|a_J=i] = \frac{g(1)-g(0)}{f_i}+\ldots + \frac{g(f_i)-g(f_i-1)}{f_i}=\frac{g(f_i)}{f_i}\ .$

Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of $\sum_{i\in [n]} g(f_i)$. Quite how many basic estimators are required depends on the variance of $X$ and this determines the total space used by the algorithm.

Using this approach, it was shown that the $k$-th order frequency moment, $F_k=\sum_{i\in [n]} f_i^k$, can be approximated up to a factor $(1+\epsilon)$ with probability $(1-\delta)$ using only $O(\epsilon^{-2} n^{1-1/k} \log \delta^{-1} (\log m+\log n))$ bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].

### About

A research blog about data streams and related topics.