(Amit apologizes for not actually posting this while at FOCS)

Today there were no streaming papers per se, but we did see some neat work on communication complexity, a topic that looms large in streaming lower bounds. The specific results I’ll talk about don’t have streaming implications I know of, but they all involve polylogs, which I think makes them on topic for this blog. :)

  1. Paul Beame spoke about work with his student Trinh Huynh on lower bounds on multiparty communication complexity (in the number on the forehead model). I can’t do justice to this lovely and challenging topic here. Let me just say that a nontrivial lower bound for any explicit function with \omega(\log n) players remains open. The paper’s result is a nontrivial bound on an AC^0 function that holds for up to O(\log n) players. The AC^0-ness is the crucial new part; the now-classic result of Babai-Nissan-Szegedy was for a function involving PARITY.
  2. Eyal Kushilevitz spoke about joint work with his postdoc Enav Weinreb on DISJOINTNESS with O(\log n)-sized subsets of the universe \{1,\ldots,n\}, with a promise that the intersection size is at most 1. Clearly, the communication complexity of this problem is at most O(\log^2 n), but is it in fact that high? A randomized protocol can achieve O(\log n) communication, by a neat idea due to Hastad and Wigderson. The authors here show that the deterministic complexity is \Omega(\log^2 n / \log\log n). This has some connections to some long-standing open problems such as CLIQUE-vs-IND-SET (see Kushilevitz-Nisan for more on that).
  3. Arkadev Chattopadhyay didn’t talk about communication complexity per se, but about work, joint with Avi Wigderson, motivated by the quest for lower bounds against ACC^0, a goal shared by a well-known line of work in communication complexity. They gave the first exponential size lower bound for certain depth-3 circuits allowing MOD_m gates, with m a product of two distinct primes. To cut a long story short, the reason MOD_m gates make things challenging is that the integers modulo m do not form a field.

On the non-technical front, lunch was on the top (25th) floor of the hotel, in a room affording nice panoramic views of the city. Breakfast, however, left much to be desired, being comprised of just coffee and cupcakes. Thankfully, I had had a sumptuous meal the previous night, at Mary Mac’s.

Anybody who hasn’t heard about the new theory conference Innovations in Computer Science (ICS 2010) has clearly been working too diligently and not reading their theory blogs (see here, there, this, that, those, and these.) Well, it seems like the PC have been busy: the accepted paper list and abstracts have just been posted.

Papers particularly relevant to this blog are:

  • Derandomizing Algorithms on Product Distributions and Other Applications of Order-Based Extraction [Gabizon, Hassidim]
  • Space-Efficient Estimation of Robust Statistics and Distribution Testing [Chien, Ligett, McGregor]
  • Pan-Private Streaming Algorithms [Dwork, Naor, Pitassi, Rothblum, Yekhanin]

Looks like privacy could be a hot topic in forthcoming data streams research (see also this recent post from Muthu.)

As before, if you spot a version of any of the above papers online, please send me a link so I can add it. Cheers.

(Amit Chakrabarti reports from Atlanta on FOCS 2009)

The FOCS 2009 programme proper begins tomorrow. Today, at Georgia Tech, we listened to four hour-long talks by distinguished researchers, celebrating the 50th anniversary of FOCS and the 20th of the university’s ACO program.

  1. We heard Dick Karp talk about what makes an algorithm “great”. The talk surveyed many of the significant milestones in our ongoing journey of discovery. In passing, he mentioned a number of (mostly recently-developed) subareas he didn’t get to dwell upon, one of which was “streaming algorithms”.
  2. Mihalis Yannakakis spoke about the computational aspects of finding equilibria, and the related issues of local search algorithms and convergence. The SQUARE-ROOT-SUM problem made an appearance, as did a joke about the throwing-overboard of Hippasus.
  3. Noga Alon talked about isoperimetric inequalities and their applications in algorithms (focussing on the DISJOINT-PATHS problem) and combinatorics (focussing on finding economical toric spines). Noga recalled publishing his first FOCS paper exactly 25 years ago, and thanked 21 of the 25 PCs since then for treating well his many submissions. He consulted his notes to list the four exceptions. This was easily the most technical of today’s talks. Noga even apologized for this mid-talk, to my surprise (doesn’t this community enjoy the technical details?).
  4. Rounding out the day was a talk by Manuel Blum who hardly talked TCS (as we know it) at all, focussing instead of the deeply philosophical matter of the definition of consciousness. He did not provide an attempt at a crisp answer, but instead gave five vignettes, one of which introduced us to Global Workspace Theory. This talk was unlike any we are used to hearing from a computer scientist. Though it did not suggest any research directions to me, it had the stimulating effect of a good Borges short.

Over dinner, I couldn’t help but comment to my companions on what I wished I’d heard today: What were the controversies in the field 20, 30, 40 years ago? I’m sure the more senior among us could regale us with tales that make the Conceptual Wars of today seem tame!

The last four BSS posts considered streams of numerical values. While there are still many more bite-sized results to cover for such streams, I thought we’d mix things up a bit and switch to graph streams for the next four posts. We’ll consider a stream of unweighted, undirected edges \langle e_1, \ldots, e_m\rangle and try to estimate properties of the graph G=(V,E) defined by this stream subject to the usual constraints of stream computation (sequential access to the stream and limited memory). For this post, we focus on estimating distances.

The results we’ll discuss revolve around constructing a t-spanner, a sub-graph H=(V,E') of G such that for any u,v\in V

d_G(u,v)\leq d_H(u,v) \leq t d_G(u,v)

where d_X(u,v) is the length of the shortest path between u, v\in V in graph X. Obviously, given such a graph we can approximate any distance up to a factor of t.

So how can we construct a t-spanner in the data-stream model? Here’s a very simple algorithm: Intialize H to \emptyset and on receiving edge e=(u,v), add the edge to H if and only if d_H(u,v)>t.

It’s easy to show that H is indeed a t-spanner because, for each edge e=(u,v) on a shortest path in G, H either includes e or a path between u and v of length at most t. The space taken by the algorithm is just the space required to store H. How large can this be? Well, note that H has girth at least t+2, i.e., it contains no cycles of length t+1 or shorter. It can be shown that such a graph has at most O(n^{1+2/(t+1)}) edges if t is odd (see e.g., [Bollobás] or [Indyk's lecture slides]) and hence we have a stream algorithm using \tilde{O}(n^{1+2/(t+1)}) bits of memory that finds a t approximation for any shortest path distance.

Unfortunately, the time to process each edge in the above algorithm is rather large and this spurred some nice follow-up work including [Feigenbaum et al.], [Baswana], and [Elkin]. It’s now known how to construct a t-spanner in \tilde{O}(n^{1+2/(t+1)}) space and constant worst-case time per edge.

It’s not hard to extend these algorithms to weighted graphs (consider geometrically grouping the edge weights and constructing a spanner for each distinct edge weight) but there’s bad news for directed graphs: even determining if there exists a directed path between two nodes requires \Omega(n^2) bits of memory.

BONUS! Some British English… Back in the UK, we call this

Spanner.

a spanner. Apparently that’s not the usual term in US English and I found this out after giving a talk at SODA back in 2005. The talk went reasonably well but, during the coffee break, numerous people asked why I’d kept on displaying pictures of wrenches throughout my talk. Whoops.

In technical writing, I normally endeavor [sic] to use US English (although I did once engage in some “colouring of paths” in a paper with a Canadian co-author). But I still think “math and stat” sounds funny :)

My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.

Consider a numerical stream that defines a frequency vector \mathbf{f}=(f_1, \ldots, f_n) where f_i is the frequency of i in the stream. Here’s a simple sketch algorithm that allows you to approximate F_2(\mathbf{f})=\sum_i f_i^2. Let \mathbf{x}=(x_1, \ldots, x_n)\in \{-1,1\}^n be a random vector where the x_i are 4-wise independent and unbiased. Consider r=\mathbf{x}\cdot \mathbf{f} and note that r can be computed incrementally as the stream arrives (given that \mathbf{x} is implicitly stored by the algorithm.)

By the weaker assumption of 2-wise independence, we observe that E[x_i x_j]=0 if i\neq j and so:

E[r^2]=\sum_{i,j\in [n]} f_i f_j E[x_i x_j]=F_2

By the assumption of 4-wise independence, we also observe that E[x_i x_j x_k x_l]=0 unless i=j, k=l or i=k, j=l or i=l, j=k and so:

\mbox{Var}[r^2]= \sum_{i,j,k,l\in [n]} f_i f_j f_k f_l E[x_i x_j x_k x_l]-F_2^2=4\sum_{i<j} f_i^2f_j^2\leq 2 F_2^2

Hence, if we repeat the process with O(\epsilon^{-2} \log \delta^{-1}) independent copies of x, it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor (1+\epsilon) of F_2 with probability at least 1-\delta. Note that it was lucky that we didn’t need full coordinate independence because that would have required O(n) bits just to remember \mathbf{x}. It can be shown that remembering O(\log n) bits is sufficient if we only need 4-wise independence.

BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of r^2. However, it was observed in [Indyk, McGregor] that the following also worked. First pick \mathbf{y,z}\in \{-1,1\}^{\sqrt{n}} where \mathbf{y,z} are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of \mathbf{x} be \{y_i z_j:1\leq i,j\leq\sqrt{n}\}. It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that \mbox{Var}[r^2]=O(F_2^2) and this is good enough for our purposes.

In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define \mathbf{x} based on k random vectors of length n^{1/k}. The culmination of this work shows that the variance increases to at most 3^k F_2^2 and the resultant algorithm uses \tilde{O}(\epsilon^{-2} 3^k \log \delta^{-1}) space.

At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from [n]^k, can we determine if the k coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.

If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case k=2, measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.

We’re presented with a numerical stream A=\langle a_1, \ldots, a_m\rangle\in [n]^m and we want to compute some small-space approximation of \mathbf{f}= (f_1, \ldots, f_n) where f_i=|\{j:a_j=i\}| is the frequency of i. In particular, we want to be able to return estimates \tilde{f}_i that satisfy

f_i\leq \tilde{f}_i \leq f_i+\epsilon m

where i is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in \tilde{O}(\epsilon^{-2} ) space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let p_1, p_2, \ldots, p_t be the first t prime numbers and define “hash” functions h_j(x)= x \bmod p_j. Based on these functions, the CR-Precis algorithm maintains p_t t=O(t^2 \log t) counters c_{i,j} which are initially zero. When we observe a in the stream we increment each of c_{j,h_j(a)} for j\in [t].

Say we’re interested in the frequency of i. At the end of the day, we end up with t overestimates for f_i, i.e., c_{1,h_1(i)}, c_{2,h_2(i)}, \ldots, c_{t,h_t(i)}. Hence, it makes sense to return \tilde{f}_i=\min_j c_{j,h_j(i)}. How much of an overestimate can this be?

Note that for any i'\neq i, there are at most \log n functions h_j under which i,i' collide, i.e., h_j(i)=h_j(i'). This follows from the Chinese Remainder Theorem. Hence, we deduce that,

\sum_j c_{j,h_j(i)} \leq t f_i+ (\sum_{i'\neq i} f_{i'}) \log n

and therefore \tilde{f}_i\leq f_i+(m\log n)/t. Setting t=O(\epsilon^{-1} \log n) gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only \tilde{O}(\epsilon^{-1}) space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a O(t^2 \log t) \times n matrix S in a natural way such that the algorithm described above is simply computing S\mathbf{f}^T. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

A=\langle a_1, \ldots, a_m\rangle\in (\{-,+\}\times [n])^m

and define

f_i=|\{j:a_j=(+,i)\}|-|\{j:a_j=(-,i)\}|.

Assuming that all f_i\geq 0, the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing (-,i) and incrementing when processing (+,i).

As noted here, here, and here, the SODA accepts have been announced. There are numerous cool looking papers. Particularly relevant to this blog are:

(If you spot a version of any of the above papers online, please send me a link so I can add it. Cheers.)

UPDATE: Abstracts have now been posted. Also, the Geomblog has a “behind the scenes” look at the workings of this year’s SODA committee.

Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.

We consider a stream A=\langle a_1, \ldots, a_m\rangle formed by arranging a set of distinct values B=\{b_1, \ldots, b_m\} in random order, i.e., a_i=b_{\pi(i)} for a permutation \pi of [m] chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).

Here’s a simple algorithm that maintains a set S of s=O(\sqrt{m}) elements and two counters h and l which are initialized to zero. Store the first s values from the stream in S and as each new element a arrives, do one of the following:

  1. If a <\min S: Increment l
  2. If a >\max S: Increment h
  3. If \min S<a<\max S and h \geq l: Add a to S, remove smallest element from S, and increment l
  4. If \min S<a<\max S and h<l: Add a to S, remove largest element from S, and increment h

At the end of the stream, if 1\leq (m+1)/2-l\leq s then return the ((m+1)/2-l)-th smallest element in S. Otherwise fail.

Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of S and l values less than \min S and h values greater that \max S. Hence, if at the end of the stream, S has a ((m+1)/2-l)-th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering r=|h-l| and noting that, for some constant c>0,

  1. Probability a non-zero r is incremented at any step is at most 1/2 if s\geq cr
  2. r is either incremented or decremented at each step
  3. If s\geq cr at the final step, then 1\leq (m+1)/2-l\leq s as required

Hence, we have a simple “random-walk-like” set-up and there’s a s=O(\sqrt{m}) that ensures that, with high probability, the algorithm doesn’t fail.

Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].

We’re presented with a numerical stream A=\langle a_1, \ldots, a_m\rangle\in [n]^m and we want to compute some function \sum_{i\in [n]} g(f_i) where f_i=|\{j:a_j=i\}| is the frequency of i and g is an arbitrary function with g(0)=0. One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:

  1. Pick J uniformly at random from [m]
  2. Compute R=|\{j\geq J:a_j=a_J\}|
  3. Let X=m(g(R)-g(R-1))

Note that X can be computed using O(\log m+\log n) space as the data arrives. It’s not hard to show that E[X]=\sum_{i\in [n]} g(f_i) because,

E[X]= \sum_{i} E[X|a_J=i] \Pr[a_J=i]=\sum_{i} E[g(R)-g(R-1)|a_J=i] f_i

and

E[g(R)-g(R-1)|a_J=i] = \frac{g(1)-g(0)}{f_i}+\ldots + \frac{g(f_i)-g(f_i-1)}{f_i}=\frac{g(f_i)}{f_i}\ .

Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of \sum_{i\in [n]} g(f_i). Quite how many basic estimators are required depends on the variance of X and this determines the total space used by the algorithm.

Using this approach, it was shown that the k-th order frequency moment, F_k=\sum_{i\in [n]} f_i^k, can be approximated up to a factor (1+\epsilon) with probability (1-\delta) using only O(\epsilon^{-2} n^{1-1/k} \log \delta^{-1} (\log m+\log n)) bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].

Welcome to the polylogblog! For reasons that are still not entirely clear to me, I decided to start a research blog. This is it. Topics of interest include data streams, sketches, compressed sensing, communication complexity, and information theory. We’ll see what happens.

To get the ball rolling, I’m planning a bunch of bite-sized posts on various results related to data streams. Most will be easy, some will be harder, some obscure, some well known. Clarification and/or corrections more than welcome!