We’re presented with a numerical stream and we want to compute some small-space approximation of where is the frequency of . In particular, we want to be able to return estimates that satisfy

where is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let be the first prime numbers and define “hash” functions . Based on these functions, the CR-Precis algorithm maintains counters which are initially zero. When we observe in the stream we increment each of for .

Say we’re interested in the frequency of . At the end of the day, we end up with overestimates for , i.e., . Hence, it makes sense to return . How much of an overestimate can this be?

Note that for any , there are at most functions under which collide, i.e., . This follows from the Chinese Remainder Theorem. Hence, we deduce that,

and therefore . Setting gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a *sketch* algorithm, i.e., it’s possible to transform the hash functions into a matrix in a natural way such that the algorithm described above is simply computing . This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

and define

.

Assuming that all , the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing and incrementing when processing .

### Like this:

Like Loading...

*Related*

## 10 comments

Comments feed for this article

September 22, 2009 at 1:41 pm

Graham CormodeI like the construction, it’s very neat. However, it’s also worth mentioning that there are also randomized structures that meet the “sketch” requirement and have the eps^{-1} space usage. In this context, I tend to think of the CR-precis as a derandomization of results which can use sketches (giving deterministic streaming results for point queries, quantiles, etc.). The various overheads mean that you are unlikely to want to ever actually implement it, unless you really really want a deterministic solution.

September 22, 2009 at 1:49 pm

polylogblogQuite right. We’ll be discussing some of those sketches soon :)

September 24, 2009 at 12:16 am

Piotr IndykAlternatively, one can take a look at a recent article in Communications of the ACM , describing some of those algorithms.

September 24, 2009 at 1:05 am

polylogblogCool. I knew the CACM article had been written but I hadn’t realized that it had appeared.

Graham: Should I replace the link to your VLDB paper with a link to your CACM paper?

September 24, 2009 at 12:23 am

Amit ChakrabartiThe step where you mentioned CRT doesn’t actually need CRT; it’s just a straighforward consequence of the fact that a natural number $N$ has at most $\lg N$ prime factors.

September 24, 2009 at 12:54 am

polylogblogGood point Amit. Of course, since the CR is CR-precis stands for “Chinese Remainder” perhaps we need a new name for this sketch. Can anyone suggest a backronym for “GM-sketch”?

September 24, 2009 at 1:50 pm

Piotr IndykBtw: Andrew, welcome to the Planet Blog! (not that I live on that planet :)

October 14, 2009 at 5:52 am

metooConcur with AC. All you need is a natural number $N$ has at most $\lg N$ prime factors. This is how the proof is phrased in [Pg 31] you refer to.

Also, let me point out that these prime number based constructions have been used in “Improved Bounds for a Deterministic Sublinear-Time Sparse Fourier Algorithm” by Mark Iwen and C.V. Spencer. http://www.ima.umn.edu/~iwen/WebPage/improved08.pdf and other places.

Finally, when I was thinking about that Pg31 result, I had in mind a bigger technical problem: coding and decoding in sublinear space deterministically for suitable codes (Group testing, k-selectors, Reed-Muller, ….). Progress on that problem is here: Efficiently Decodable Non-adaptive Group Testing. Piotr Indyk, Hung Q. Ngo and Atri Rudra

To Appear in SODA 2010.

– Metoo

September 17, 2014 at 10:48 pm

Jelani(reminded of this post from a friend’s notes)

For anyone who finds the post in the future, just thought I’d point out that what’s really going on is that you’re sketching with an incoherent matrix A. That is, the columns A_1,…,A_n of A have unit norm and || < eps for i != j. To recover x from y = Ax, you set x~ = A^T y. Then the incoherence property easily implies ||x~ – x||_infty < eps * ||x||_1, which is exactly the l1 point query guarantee. See http://arxiv.org/abs/1206.5725 for details.

You can form such a matrix from codes, or via other ways. It turns out that you can replace this "divisor code" in CR-Precis with Reed-Solomon codes and get a strictly better space bound. The main goal is to minimize (block length) x (alphabet size) (which will be the # of rows of A) subject to the code needing to have at least n codewords and relative distance 1 – eps (essentially you break the rows of A into (block length) blocks each of length (alphabet size), then each column of A is a codeword written in unary then normalized by 1/sqrt(block length) to get unit norm).

September 17, 2014 at 10:48 pm

JelaniThe above should have said “| A_i dot product A_j | < eps for i != j". I guess there's an issue with misparsing as HTML tags.