We’re presented with a numerical stream $A=\langle a_1, \ldots, a_m\rangle\in [n]^m$ and we want to compute some small-space approximation of $\mathbf{f}= (f_1, \ldots, f_n)$ where $f_i=|\{j:a_j=i\}|$ is the frequency of $i$. In particular, we want to be able to return estimates $\tilde{f}_i$ that satisfy $f_i\leq \tilde{f}_i \leq f_i+\epsilon m$

where $i$ is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in $\tilde{O}(\epsilon^{-2} )$ space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let $p_1, p_2, \ldots, p_t$ be the first $t$ prime numbers and define “hash” functions $h_j(x)= x \bmod p_j$. Based on these functions, the CR-Precis algorithm maintains $p_t t=O(t^2 \log t)$ counters $c_{i,j}$ which are initially zero. When we observe $a$ in the stream we increment each of $c_{j,h_j(a)}$ for $j\in [t]$.

Say we’re interested in the frequency of $i$. At the end of the day, we end up with $t$ overestimates for $f_i$, i.e., $c_{1,h_1(i)}, c_{2,h_2(i)}, \ldots, c_{t,h_t(i)}$. Hence, it makes sense to return $\tilde{f}_i=\min_j c_{j,h_j(i)}$. How much of an overestimate can this be?

Note that for any $i'\neq i$, there are at most $\log n$ functions $h_j$ under which $i,i'$ collide, i.e., $h_j(i)=h_j(i')$. This follows from the Chinese Remainder Theorem. Hence, we deduce that, $\sum_j c_{j,h_j(i)} \leq t f_i+ (\sum_{i'\neq i} f_{i'}) \log n$

and therefore $\tilde{f}_i\leq f_i+(m\log n)/t$. Setting $t=O(\epsilon^{-1} \log n)$ gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only $\tilde{O}(\epsilon^{-1})$ space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a $O(t^2 \log t) \times n$ matrix $S$ in a natural way such that the algorithm described above is simply computing $S\mathbf{f}^T$. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream $A=\langle a_1, \ldots, a_m\rangle\in (\{-,+\}\times [n])^m$

and define $f_i=|\{j:a_j=(+,i)\}|-|\{j:a_j=(-,i)\}|$.

Assuming that all $f_i\geq 0$, the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing $(-,i)$ and incrementing when processing $(+,i)$.