We’re presented with a numerical stream and we want to compute some small-space approximation of where is the frequency of . In particular, we want to be able to return estimates that satisfy

where is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let be the first prime numbers and define “hash” functions . Based on these functions, the CR-Precis algorithm maintains counters which are initially zero. When we observe in the stream we increment each of for .

Say we’re interested in the frequency of . At the end of the day, we end up with overestimates for , i.e., . Hence, it makes sense to return . How much of an overestimate can this be?

Note that for any , there are at most functions under which collide, i.e., . This follows from the Chinese Remainder Theorem. Hence, we deduce that,

and therefore . Setting gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a *sketch* algorithm, i.e., it’s possible to transform the hash functions into a matrix in a natural way such that the algorithm described above is simply computing . This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

and define

.

Assuming that all , the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing and incrementing when processing .

### Like this:

Like Loading...

*Related*

## 8 comments

Comments feed for this article

September 22, 2009 at 1:41 pm

Graham CormodeI like the construction, it’s very neat. However, it’s also worth mentioning that there are also randomized structures that meet the “sketch” requirement and have the eps^{-1} space usage. In this context, I tend to think of the CR-precis as a derandomization of results which can use sketches (giving deterministic streaming results for point queries, quantiles, etc.). The various overheads mean that you are unlikely to want to ever actually implement it, unless you really really want a deterministic solution.

September 22, 2009 at 1:49 pm

polylogblogQuite right. We’ll be discussing some of those sketches soon :)

September 24, 2009 at 12:16 am

Piotr IndykAlternatively, one can take a look at a recent article in Communications of the ACM , describing some of those algorithms.

September 24, 2009 at 1:05 am

polylogblogCool. I knew the CACM article had been written but I hadn’t realized that it had appeared.

Graham: Should I replace the link to your VLDB paper with a link to your CACM paper?

September 24, 2009 at 12:23 am

Amit ChakrabartiThe step where you mentioned CRT doesn’t actually need CRT; it’s just a straighforward consequence of the fact that a natural number $N$ has at most $\lg N$ prime factors.

September 24, 2009 at 12:54 am

polylogblogGood point Amit. Of course, since the CR is CR-precis stands for “Chinese Remainder” perhaps we need a new name for this sketch. Can anyone suggest a backronym for “GM-sketch”?

September 24, 2009 at 1:50 pm

Piotr IndykBtw: Andrew, welcome to the Planet Blog! (not that I live on that planet :)

October 14, 2009 at 5:52 am

metooConcur with AC. All you need is a natural number $N$ has at most $\lg N$ prime factors. This is how the proof is phrased in [Pg 31] you refer to.

Also, let me point out that these prime number based constructions have been used in “Improved Bounds for a Deterministic Sublinear-Time Sparse Fourier Algorithm” by Mark Iwen and C.V. Spencer. http://www.ima.umn.edu/~iwen/WebPage/improved08.pdf and other places.

Finally, when I was thinking about that Pg31 result, I had in mind a bigger technical problem: coding and decoding in sublinear space deterministically for suitable codes (Group testing, k-selectors, Reed-Muller, ….). Progress on that problem is here: Efficiently Decodable Non-adaptive Group Testing. Piotr Indyk, Hung Q. Ngo and Atri Rudra

To Appear in SODA 2010.

- Metoo