You are currently browsing the monthly archive for September 2009.

My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.

Consider a numerical stream that defines a frequency vector \mathbf{f}=(f_1, \ldots, f_n) where f_i is the frequency of i in the stream. Here’s a simple sketch algorithm that allows you to approximate F_2(\mathbf{f})=\sum_i f_i^2. Let \mathbf{x}=(x_1, \ldots, x_n)\in \{-1,1\}^n be a random vector where the x_i are 4-wise independent and unbiased. Consider r=\mathbf{x}\cdot \mathbf{f} and note that r can be computed incrementally as the stream arrives (given that \mathbf{x} is implicitly stored by the algorithm.)

By the weaker assumption of 2-wise independence, we observe that E[x_i x_j]=0 if i\neq j and so:

E[r^2]=\sum_{i,j\in [n]} f_i f_j E[x_i x_j]=F_2

By the assumption of 4-wise independence, we also observe that E[x_i x_j x_k x_l]=0 unless i=j, k=l or i=k, j=l or i=l, j=k and so:

\mbox{Var}[r^2]= \sum_{i,j,k,l\in [n]} f_i f_j f_k f_l E[x_i x_j x_k x_l]-F_2^2=4\sum_{i<j} f_i^2f_j^2\leq 2 F_2^2

Hence, if we repeat the process with O(\epsilon^{-2} \log \delta^{-1}) independent copies of x, it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor (1+\epsilon) of F_2 with probability at least 1-\delta. Note that it was lucky that we didn’t need full coordinate independence because that would have required O(n) bits just to remember \mathbf{x}. It can be shown that remembering O(\log n) bits is sufficient if we only need 4-wise independence.

BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of r^2. However, it was observed in [Indyk, McGregor] that the following also worked. First pick \mathbf{y,z}\in \{-1,1\}^{\sqrt{n}} where \mathbf{y,z} are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of \mathbf{x} be \{y_i z_j:1\leq i,j\leq\sqrt{n}\}. It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that \mbox{Var}[r^2]=O(F_2^2) and this is good enough for our purposes.

In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define \mathbf{x} based on k random vectors of length n^{1/k}. The culmination of this work shows that the variance increases to at most 3^k F_2^2 and the resultant algorithm uses \tilde{O}(\epsilon^{-2} 3^k \log \delta^{-1}) space.

At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from [n]^k, can we determine if the k coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.

If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case k=2, measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.

We’re presented with a numerical stream A=\langle a_1, \ldots, a_m\rangle\in [n]^m and we want to compute some small-space approximation of \mathbf{f}= (f_1, \ldots, f_n) where f_i=|\{j:a_j=i\}| is the frequency of i. In particular, we want to be able to return estimates \tilde{f}_i that satisfy

f_i\leq \tilde{f}_i \leq f_i+\epsilon m

where i is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in \tilde{O}(\epsilon^{-2} ) space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let p_1, p_2, \ldots, p_t be the first t prime numbers and define “hash” functions h_j(x)= x \bmod p_j. Based on these functions, the CR-Precis algorithm maintains p_t t=O(t^2 \log t) counters c_{i,j} which are initially zero. When we observe a in the stream we increment each of c_{j,h_j(a)} for j\in [t].

Say we’re interested in the frequency of i. At the end of the day, we end up with t overestimates for f_i, i.e., c_{1,h_1(i)}, c_{2,h_2(i)}, \ldots, c_{t,h_t(i)}. Hence, it makes sense to return \tilde{f}_i=\min_j c_{j,h_j(i)}. How much of an overestimate can this be?

Note that for any i'\neq i, there are at most \log n functions h_j under which i,i' collide, i.e., h_j(i)=h_j(i'). This follows from the Chinese Remainder Theorem. Hence, we deduce that,

\sum_j c_{j,h_j(i)} \leq t f_i+ (\sum_{i'\neq i} f_{i'}) \log n

and therefore \tilde{f}_i\leq f_i+(m\log n)/t. Setting t=O(\epsilon^{-1} \log n) gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only \tilde{O}(\epsilon^{-1}) space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a O(t^2 \log t) \times n matrix S in a natural way such that the algorithm described above is simply computing S\mathbf{f}^T. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

A=\langle a_1, \ldots, a_m\rangle\in (\{-,+\}\times [n])^m

and define

f_i=|\{j:a_j=(+,i)\}|-|\{j:a_j=(-,i)\}|.

Assuming that all f_i\geq 0, the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing (-,i) and incrementing when processing (+,i).

As noted here, here, and here, the SODA accepts have been announced. There are numerous cool looking papers. Particularly relevant to this blog are:

(If you spot a version of any of the above papers online, please send me a link so I can add it. Cheers.)

UPDATE: Abstracts have now been posted. Also, the Geomblog has a “behind the scenes” look at the workings of this year’s SODA committee.

About

A research blog about data streams and related topics.

Recently Tweeted

Follow

Get every new post delivered to your Inbox.

Join 124 other followers