Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:

The outline was:

  1. Basic Framework: If you have a small-space algorithm for stream problem Q, you can turn this into a low-communication protocol for a related communication problem P. Hence, a lower bound on the communication required to solve P implies a lower bound on the space required to solve Q. Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
  2. Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
  3. Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]

Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.

Piotr Indyk asked if I could post the following announcement and I’m happy to oblige.

Piotr’s Post-Doc Position: Applications are invited for a Postdoctoral Research Assistant position for the MIT-Shell-Draper Lab research project

“Applications of compressive sensing and sparse structure exploitation in hydrocarbon reservoir exploration and surveillance”

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

  • Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications
  • Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline
  • Frequent travel to Shell facilities in Houston

Minimum Qualifications

  • Ph.D. in Computer Science, Electrical Engineering, Mathematics or related disciplines

Preferred Qualifications

  • Expertise in sparse recovery, compressive sensing, streaming/sketching, machine learning and statistical inference algorithms
  • Experience and/or interest in research projects of interdisciplinary nature
  • Programming experience (MATLAB)

Applications (including CV and three reference letters) should be submitted to

ideally by April 15, 2012. However, applications will be considered until the position is filled.

In the second lecture, we discussed algorithms for graph streams. Here we observe a sequence of edges and the goal is to estimate properties of the underlying graph without storing all the edges. The slides are here:

We covered:

  1. Spanners: If you judiciously store only a subset of the edges, you can ensure that the distance between any two nodes in the subgraph is only a constant factor larger than the distance in the entire graph.
  2. Sparsifiers: Here you also remember a subset of edges but also apply weights to these edges such that the capacity of every cut is preserved up to a small factor. We show how to do this incrementally.
  3. Sketching: All this talk of remembering a subset of the edges is all very well but what if we’re dealing with a fully dynamic graph where edges are both added and deleted. Can we still compute interesting graph properties if we can’t be sure the edges we remember wouldn’t subsequently be deleted? You’ll have to see the slides to find out. Or see this SODA paper.

Lunch Break! If you want to fully recreate the experience of a French workshop, I suggest you now find yourself a plate of oysters and break for lunch. Lectures will resume shortly.

Just back from EPIT 2012 where I gave four lectures on data streams and a bit of communication complexity. Other lecturers discussed semi-definite programming, inapproximability, and quantum computing. Thanks to Iordanis Kerenidis, Claire Mathieu, and Frédéric Magniez for a great week.

Anyhow, I wanted to post my slides along with some editorial comments that may or may not be of interest. Here’s the first lecture:

The goal was to cover some of the basic algorithmic techniques such as different forms of sampling and random linear projections. The basic narrative was:

  1. Sampling: Sure, you can uniformly sample from the stream and try to make some inferences from the sample. But there are better ways to use your limited memory.
  2. Sketching: You can compute random linear projections of the data. Hash-based sketches like Count-Min and Count-Sketch are versatile and solve many problems including quantile estimation, range queries, and heavy hitters.
  3. Sampling Again: A useful primitives enabled by sketching, is non-uniform sampling such as \ell_p sampling where the probability of returning a stream element of type i is proportional to the p-th power of the frequency of i.

I bookended the lecture with algorithms for frequency moments, first a suboptimal result via AMS sampling (from Alon et al. STOC 1996) and then a near-optimal result via \ell_p sampling (from Andoni et al. FOCS 2011 and Jowhari et al. PODS 2011). I was tempted to use another example but that seemed churlish given the importance of frequency moments in the development of data streams theory. To keep things simple I assumed algorithms had an unlimited source of random bits. Other than that I was quite happy that I managed to present most of the details without drowning in a sea of definitions and notation.

Everyone loves the union bound. It is the most dependable of all probability bounds. No matter what mess of dependencies threaten your analysis, the union bound will ride in and save the day. Except when it doesn’t.

I’ve been giving a talk about some recent results (here are some slides) and have received a particular question from a couple of different audiences. The exact details aren’t important but the basic gist was why I didn’t simplify some of the analysis by just applying the union bound. The short answer was it didn’t apply. But how could that be? My longer answer was longer and probably less helpful. So let me distill things down to the following paradox (which isn’t really a paradox of course).

The Geography Test. Suppose you have a geography test tomorrow. In the test you’ll be asked two questions out of five possible questions (and you already know the list of possible questions). If you get both right, you’ll pass the course. However, rather than revise everything you adopt a random strategy that ensures that for any question, you’ll have a correct answer with probability at least 4/5. Hence, by the union bound, you’ll get both questions right with probability at least 3/5.

Right? Sure, it’s a straight-forward application of the union bound: the probability of being wrong on one or both questions is at most the probability of being wrong on the first (at most 1/5) plus the probability of being wrong on the second (at most 1/5). And the union bound never let’s you down.

The Bad News. You’re going to fail both your geography and probability classes. Consider the following scenario. The five possible exam questions are: 1) name a city in the US, 2) name a city in France, 3) name a city in Germany, 4) name a city in the UK, and 5) name a city in Italy. Your cunning strategy is to memorize one of the rows in the following table.

US France Germany UK Italy
New York Paris Berlin Glasgow Reykjavik
Philadelphia Paris Berlin Reykjavik Rome
Los Angeles Paris Reykjavik Glasgow Rome
Boston Reykjavik Berlin Glasgow Rome
Reykjavik Paris Berlin Glasgow Rome

You choose the row to remember uniformly at random thus guaranteeing that you can answer any question correctly with probability 4/5. Of course the professor doesn’t know which row you’ll pick but he does know your strategy. (This is equivalent to the usual assumption we make about adversaries knowing your algorithm but not your random bits.)

Unfortunately, you stay up so late studying your row that you sleep in and miss the exam. You go to the professor’s office to plead for a make-up exam. He is initially reluctant but then a devious smile crosses his face. Sure, he’ll still give you the exam. You’re relieved that you’ll still pass the course with probability at least 3/5. The professor doesn’t look so sure.

The first question he asks is to name a city in the US. “Philadelphia” you answer with confidence. As the professor nods you see that devious smile again. The second question… name a city in the UK.

And then you realize that the probability of success was never as large as 3/5 since the professor could always guarantee that you fail by first asking for a US city and then choosing his second question accordingly.

Morals of the story: Don’t miss exams, study hard, and don’t blindly trust the union bound.

That time of year again… the list of accepted papers for SODA has been posted. Some of the papers that might be of (direct or indirect*) interest to the streaming/sketching crowd include:

  • Sublinear Time, Measurement-Optimal, Sparse Recovery For All [Porat, Strauss]
  • Analyzing Graph Structure via Linear Measurements [Ahn, Guha, McGregor]
  • Sparser Johnson-Lindenstrauss Transforms [Kane, Nelson]
  • On the communication and streaming complexity of maximum bipartite matching [Goel, Kapralov, Khanna]
  • The Shifting Sands Algorithm [McGregor, Valiant]
  • Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy [Phillips, Verbin, Zhang]
  • A Near-Optimal Sublinear-Time Algorithm for Approximating the Minimum Vertex Cover Size [Onak, Ron, Rosen, Rubinfeld]
  • Optimal Column-Based Low-Rank Matrix Reconstruction [Guruswami, Sinop]
  • Simple and Practical Algorithm for Sparse Fourier Transform [Hassanieh, Indyk, Katabi, Price]
  • Sketching Valuation Functions [Badanidiyuru, Dobzinski, Fu, Kleinberg, Nisan, Roughgarden]
  • Width of Points in the Streaming Model [Andoni, Nguyen]

(*) I’m not entirely sure what metric I’m using to determine “indirect interest” but rest assured, once I know, I’ll let you know. There also might be some additions to the above list once I’ve seen the actual papers.

BONUS! Comments from the chair… The PC chair, Yuval Rabani, discusses the selection process. Other posts on the selected papers include this and that and can be found here and there.

A few weeks ago I attended a workshop on Coding, Complexity, and Sparsity at the University of Michigan. Thanks to the organizers (Anna Gilbert, Martin Strauss, Atri Rudra, Hung Ngo, Ely Porat, and S. Muthukrishnan) for not only putting together a great program but also for treating the speakers like celebrities! The place swarmed with autograph hunters, roads were closed, and security kept the paparazzi at bay… at least I think this was all for our benefit.

I gave a brief tutorial on data streams and my slides can be found here if you’re interested. One of the results I went through was for the \ell_0-sampling problem introduced by [Cormode, Muthukrishnan, Rozenbaum] and [Frahling, Indyk, Sohler]. See also [Monemizadeh and Woodruff] and [Jowhari, Saglam, Tardos]. Here the set-up is that you see a sequence of m updates to a length n vector v. These updates can increment or decrement entries of v although for the talk I assumed that the entires themselves always remained non-negative. E.g., for m=5, n=6 the sequence

(6,+),~ (3,+),~ (6,-),~ (5,+),~ (2,+)

would result in the vector v=(0,1,1,0,1,0). An algorithm for \ell_0-sampling should return an element chosen uniformly at random from the set \{i: v_i> 0\}. See the slides (or the original papers) for an algorithm using \textrm{polylog} (m,n) space. Anyhow, during the talk I mentioned there was a simple trick to determine whether

\ell_0(v):=|\{i: v_i> 0\}|=1 .

But I decided to leave it as an easy puzzle for which I’d give the answer later. Of course, I forgot. See the comments to this post for the answer.

Other things to check out:

Piotr Indyk, Ilan Newman, Krzysztof Onak, and I have just finished up a new open problems list. Here it is! It mainly covers topics in data streams and property testing but there’s also some other cool stuff. The list is compiled from the recent Bertinoro workshop and the slightly less recent Kanpur workshop.

(Cartoon from XKCD obviously.)

Streaming and sketching papers in the RANDOM/APPROX accepted list include:

  • Streaming Algorithms with One-Sided Estimation [Brody, Woodruff]
  • Everywhere-Tight Information Cost Tradeoffs for Augmented Index [Chakrabarti, Kondapally]
  • Almost Optimal Explicit Johnson-Lindenstrauss Transformations [Kane, Meka, Nelson]
  • Periodicity and Cyclic Shifts via Linear Sketches [Crouch, McGregor]
  • Sparse recovery with partial support knowledge [Do Ba, Indyk]

The sun is setting on La Rocca di Bertinoro. The pasta has run dry and the wifi signal is fading. Theorists are scurrying towards taxis and preparing themselves for their overnight flights. Sublinear Algorithms 2011 is all over. But fear not! There are still three more talks to report…

First Robi Krauthgamer discussed “Polylogarithmic Approximation for Edit Distance” [Andoni, Krauthgamer, Onak]. Given two length n sequences, the edit distance is the minimum number of character operations (inserts, deletions, substitutions) that are sufficient to transform the first string into the latter. Masek and Paterson showed that the edit distance could be computed exactly in O(n^2/\log^2 n) time. Recent work has focused on approximation in near-linear time. Robi presented an algorithm that achieves a (\log n)^{O(\epsilon^{-1})} approximation in n^{1+\epsilon} time. This was a significant improvement over the previous best of 2^{\tilde{O}(\sqrt{\log n})} approximation. [Andoni, Krauthgamer, Onak] is also the paper that sowed the seeds for the precision-sampling technique that Alex presented on Wednesday (one of my favorite talks from the workshop).


Next, Artur Czumaj gave a talk entitled “Planar Graphs: Random Walks and Bipartiteness Testing” [Czumaj, Monemizadeh, Onak, Sohler]. This is one of the few results for property testing in general graphs, i.e., not dense graphs or graphs with constant degree. In the dense graph model you can query entries of the adjacency matrix and need to accept input graphs with the property of interest and reject graphs if \epsilon n^2 entries of the adjacency matrix would need to be changed in order to have a graph that has the property. After about ten years of research we have a very good understanding of what is possible in this model. Attention has since focussed on the bounded-degree adjacency list model where every node has degree at most d and we need to reject graphs if \epsilon dn changes would be necessary to satisfy the property. In both models, testing bipartiteness has been an important problem: in the dense graph model the complexity is constant (we’re treating \epsilon as constant) and in the bounded-degree model the complexity is \Theta(\textrm{polylog}(n) \sqrt{n}). In the general graph model, there is no bound on the max-degree and we may query the graph by a) asking for the i-th neighbor of a specific node or b) asking for the degree of a specific node. In practice it seems like asking for a random neighbor of a specific node is sufficiently powerful. Artur showed that it was possible to test bipartiteness in planar graphs with constant queries.

The very last talk was given Ronitt Rubinfeld on “Testing Collections of Properties” [Levi, Ron, Rubinfeld]. Previous work on property testing of distributions considered single distributions or pairs of distributions: we’d take iid samples from the distributions and try to determine if, e.g., the distributions where identical or far from identical (i.e., the \ell_1 distance between the distributions is at least \epsilon.). Ronitt considered a more general model in which there are m distributions D_1, \ldots, D_m over [n] and a (known or unknown) distribution \nu over [m]. On querying the distributions we get (i,x) where i\sim \nu and x \sim D_i. The goals included determining whether all D_i are identical or whether they can be clustered into k clusters such that within a cluster, all distributions are close. Results included a tight bound if n>m of \tilde{\Theta}(n^{2/3}m^{1/3}) for the question of testing if all distributions where identical. Using the observation that (i,x) are independent iff all D_i‘s are equal, this also gives a tight lower bound for independence testing, resolving an open question from previous work.

BONUS! More Photos… Here are a few more photos from the workshop.

By day we’d listen to theory talks in the Fresco room…

By night we’d talk about theory over dinner…

On our afternoon off, we admired cathedrals and talked more theory.

About

A research blog about data streams and related topics.

Recently Tweeted

Follow

Get every new post delivered to your Inbox.

Join 73 other followers