We’re presented with a numerical stream A=\langle a_1, \ldots, a_m\rangle\in [n]^m and we want to compute some small-space approximation of \mathbf{f}= (f_1, \ldots, f_n) where f_i=|\{j:a_j=i\}| is the frequency of i. In particular, we want to be able to return estimates \tilde{f}_i that satisfy

f_i\leq \tilde{f}_i \leq f_i+\epsilon m

where i is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in \tilde{O}(\epsilon^{-2} ) space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let p_1, p_2, \ldots, p_t be the first t prime numbers and define “hash” functions h_j(x)= x \bmod p_j. Based on these functions, the CR-Precis algorithm maintains p_t t=O(t^2 \log t) counters c_{i,j} which are initially zero. When we observe a in the stream we increment each of c_{j,h_j(a)} for j\in [t].

Say we’re interested in the frequency of i. At the end of the day, we end up with t overestimates for f_i, i.e., c_{1,h_1(i)}, c_{2,h_2(i)}, \ldots, c_{t,h_t(i)}. Hence, it makes sense to return \tilde{f}_i=\min_j c_{j,h_j(i)}. How much of an overestimate can this be?

Note that for any i'\neq i, there are at most \log n functions h_j under which i,i' collide, i.e., h_j(i)=h_j(i'). This follows from the Chinese Remainder Theorem. Hence, we deduce that,

\sum_j c_{j,h_j(i)} \leq t f_i+ (\sum_{i'\neq i} f_{i'}) \log n

and therefore \tilde{f}_i\leq f_i+(m\log n)/t. Setting t=O(\epsilon^{-1} \log n) gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only \tilde{O}(\epsilon^{-1}) space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a O(t^2 \log t) \times n matrix S in a natural way such that the algorithm described above is simply computing S\mathbf{f}^T. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

A=\langle a_1, \ldots, a_m\rangle\in (\{-,+\}\times [n])^m

and define

f_i=|\{j:a_j=(+,i)\}|-|\{j:a_j=(-,i)\}|.

Assuming that all f_i\geq 0, the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing (-,i) and incrementing when processing (+,i).