While on the topic of STOC, I also wanted to mention a STOC workshop on “Algorithms for Distributed and Streaming Data” that will hopefully be of interest. It will take place Saturday afternoon, 19th May in NYU. The schedule can be found here.

So here’s the pitch: At this point it’s readily apparent that big data has become big news (e.g., see here and here). But what does this mean for the STOC/FOCS/SODA community? What are the algorithmic problems we could be solving? What are the appropriate computational models? Are there opportunities for industrial impact? What should we be teaching our undergraduate and graduate students? To address the relevant topics, we’ve lined-up a great set of speakers including Sergei Vassilvitskii, John Langford, Piotr Indyk, and Ashish Goel. Hope to see you there.

If yes, just a reminder that this year (for the first time) there’ll be an award for the best student presentation. More details here. In addition to the cash, the reputation for giving great talks can be very helpful when applying for jobs.

We’ll be giving preference to talks that are “clear, compelling, and appeal to a broad cross-section of the STOC audience.” My suggestion of giving extra credit for incorporating fire juggling, 3D slides, and celebrity guests fell on deaf ears.

Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:

The outline was:

  1. Basic Framework: If you have a small-space algorithm for stream problem Q, you can turn this into a low-communication protocol for a related communication problem P. Hence, a lower bound on the communication required to solve P implies a lower bound on the space required to solve Q. Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
  2. Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
  3. Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]

Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.

Piotr Indyk asked if I could post the following announcement and I’m happy to oblige.

Piotr’s Post-Doc Position: Applications are invited for a Postdoctoral Research Assistant position for the MIT-Shell-Draper Lab research project

“Applications of compressive sensing and sparse structure exploitation in hydrocarbon reservoir exploration and surveillance”

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

  • Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications
  • Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline
  • Frequent travel to Shell facilities in Houston

Minimum Qualifications

  • Ph.D. in Computer Science, Electrical Engineering, Mathematics or related disciplines

Preferred Qualifications

  • Expertise in sparse recovery, compressive sensing, streaming/sketching, machine learning and statistical inference algorithms
  • Experience and/or interest in research projects of interdisciplinary nature
  • Programming experience (MATLAB)

Applications (including CV and three reference letters) should be submitted to

ideally by April 15, 2012. However, applications will be considered until the position is filled.

In the second lecture, we discussed algorithms for graph streams. Here we observe a sequence of edges and the goal is to estimate properties of the underlying graph without storing all the edges. The slides are here:

We covered:

  1. Spanners: If you judiciously store only a subset of the edges, you can ensure that the distance between any two nodes in the subgraph is only a constant factor larger than the distance in the entire graph.
  2. Sparsifiers: Here you also remember a subset of edges but also apply weights to these edges such that the capacity of every cut is preserved up to a small factor. We show how to do this incrementally.
  3. Sketching: All this talk of remembering a subset of the edges is all very well but what if we’re dealing with a fully dynamic graph where edges are both added and deleted. Can we still compute interesting graph properties if we can’t be sure the edges we remember wouldn’t subsequently be deleted? You’ll have to see the slides to find out. Or see this SODA paper.

Lunch Break! If you want to fully recreate the experience of a French workshop, I suggest you now find yourself a plate of oysters and break for lunch. Lectures will resume shortly.

Just back from EPIT 2012 where I gave four lectures on data streams and a bit of communication complexity. Other lecturers discussed semi-definite programming, inapproximability, and quantum computing. Thanks to Iordanis Kerenidis, Claire Mathieu, and Frédéric Magniez for a great week.

Anyhow, I wanted to post my slides along with some editorial comments that may or may not be of interest. Here’s the first lecture:

The goal was to cover some of the basic algorithmic techniques such as different forms of sampling and random linear projections. The basic narrative was:

  1. Sampling: Sure, you can uniformly sample from the stream and try to make some inferences from the sample. But there are better ways to use your limited memory.
  2. Sketching: You can compute random linear projections of the data. Hash-based sketches like Count-Min and Count-Sketch are versatile and solve many problems including quantile estimation, range queries, and heavy hitters.
  3. Sampling Again: A useful primitives enabled by sketching, is non-uniform sampling such as \ell_p sampling where the probability of returning a stream element of type i is proportional to the p-th power of the frequency of i.

I bookended the lecture with algorithms for frequency moments, first a suboptimal result via AMS sampling (from Alon et al. STOC 1996) and then a near-optimal result via \ell_p sampling (from Andoni et al. FOCS 2011 and Jowhari et al. PODS 2011). I was tempted to use another example but that seemed churlish given the importance of frequency moments in the development of data streams theory. To keep things simple I assumed algorithms had an unlimited source of random bits. Other than that I was quite happy that I managed to present most of the details without drowning in a sea of definitions and notation.

Everyone loves the union bound. It is the most dependable of all probability bounds. No matter what mess of dependencies threaten your analysis, the union bound will ride in and save the day. Except when it doesn’t.

I’ve been giving a talk about some recent results (here are some slides) and have received a particular question from a couple of different audiences. The exact details aren’t important but the basic gist was why I didn’t simplify some of the analysis by just applying the union bound. The short answer was it didn’t apply. But how could that be? My longer answer was longer and probably less helpful. So let me distill things down to the following paradox (which isn’t really a paradox of course).

The Geography Test. Suppose you have a geography test tomorrow. In the test you’ll be asked two questions out of five possible questions (and you already know the list of possible questions). If you get both right, you’ll pass the course. However, rather than revise everything you adopt a random strategy that ensures that for any question, you’ll have a correct answer with probability at least 4/5. Hence, by the union bound, you’ll get both questions right with probability at least 3/5.

Right? Sure, it’s a straight-forward application of the union bound: the probability of being wrong on one or both questions is at most the probability of being wrong on the first (at most 1/5) plus the probability of being wrong on the second (at most 1/5). And the union bound never let’s you down.

The Bad News. You’re going to fail both your geography and probability classes. Consider the following scenario. The five possible exam questions are: 1) name a city in the US, 2) name a city in France, 3) name a city in Germany, 4) name a city in the UK, and 5) name a city in Italy. Your cunning strategy is to memorize one of the rows in the following table.

US France Germany UK Italy
New York Paris Berlin Glasgow Reykjavik
Philadelphia Paris Berlin Reykjavik Rome
Los Angeles Paris Reykjavik Glasgow Rome
Boston Reykjavik Berlin Glasgow Rome
Reykjavik Paris Berlin Glasgow Rome

You choose the row to remember uniformly at random thus guaranteeing that you can answer any question correctly with probability 4/5. Of course the professor doesn’t know which row you’ll pick but he does know your strategy. (This is equivalent to the usual assumption we make about adversaries knowing your algorithm but not your random bits.)

Unfortunately, you stay up so late studying your row that you sleep in and miss the exam. You go to the professor’s office to plead for a make-up exam. He is initially reluctant but then a devious smile crosses his face. Sure, he’ll still give you the exam. You’re relieved that you’ll still pass the course with probability at least 3/5. The professor doesn’t look so sure.

The first question he asks is to name a city in the US. “Philadelphia” you answer with confidence. As the professor nods you see that devious smile again. The second question… name a city in the UK.

And then you realize that the probability of success was never as large as 3/5 since the professor could always guarantee that you fail by first asking for a US city and then choosing his second question accordingly.

Morals of the story: Don’t miss exams, study hard, and don’t blindly trust the union bound.

That time of year again… the list of accepted papers for SODA has been posted. Some of the papers that might be of (direct or indirect*) interest to the streaming/sketching crowd include:

  • Sublinear Time, Measurement-Optimal, Sparse Recovery For All [Porat, Strauss]
  • Analyzing Graph Structure via Linear Measurements [Ahn, Guha, McGregor]
  • Sparser Johnson-Lindenstrauss Transforms [Kane, Nelson]
  • On the communication and streaming complexity of maximum bipartite matching [Goel, Kapralov, Khanna]
  • The Shifting Sands Algorithm [McGregor, Valiant]
  • Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy [Phillips, Verbin, Zhang]
  • A Near-Optimal Sublinear-Time Algorithm for Approximating the Minimum Vertex Cover Size [Onak, Ron, Rosen, Rubinfeld]
  • Optimal Column-Based Low-Rank Matrix Reconstruction [Guruswami, Sinop]
  • Simple and Practical Algorithm for Sparse Fourier Transform [Hassanieh, Indyk, Katabi, Price]
  • Sketching Valuation Functions [Badanidiyuru, Dobzinski, Fu, Kleinberg, Nisan, Roughgarden]
  • Width of Points in the Streaming Model [Andoni, Nguyen]

(*) I’m not entirely sure what metric I’m using to determine “indirect interest” but rest assured, once I know, I’ll let you know. There also might be some additions to the above list once I’ve seen the actual papers.

BONUS! Comments from the chair… The PC chair, Yuval Rabani, discusses the selection process. Other posts on the selected papers include this and that and can be found here and there.

A few weeks ago I attended a workshop on Coding, Complexity, and Sparsity at the University of Michigan. Thanks to the organizers (Anna Gilbert, Martin Strauss, Atri Rudra, Hung Ngo, Ely Porat, and S. Muthukrishnan) for not only putting together a great program but also for treating the speakers like celebrities! The place swarmed with autograph hunters, roads were closed, and security kept the paparazzi at bay… at least I think this was all for our benefit.

I gave a brief tutorial on data streams and my slides can be found here if you’re interested. One of the results I went through was for the \ell_0-sampling problem introduced by [Cormode, Muthukrishnan, Rozenbaum] and [Frahling, Indyk, Sohler]. See also [Monemizadeh and Woodruff] and [Jowhari, Saglam, Tardos]. Here the set-up is that you see a sequence of m updates to a length n vector v. These updates can increment or decrement entries of v although for the talk I assumed that the entires themselves always remained non-negative. E.g., for m=5, n=6 the sequence

(6,+),~ (3,+),~ (6,-),~ (5,+),~ (2,+)

would result in the vector v=(0,1,1,0,1,0). An algorithm for \ell_0-sampling should return an element chosen uniformly at random from the set \{i: v_i> 0\}. See the slides (or the original papers) for an algorithm using \textrm{polylog} (m,n) space. Anyhow, during the talk I mentioned there was a simple trick to determine whether

\ell_0(v):=|\{i: v_i> 0\}|=1 .

But I decided to leave it as an easy puzzle for which I’d give the answer later. Of course, I forgot. See the comments to this post for the answer.

Other things to check out:

Piotr Indyk, Ilan Newman, Krzysztof Onak, and I have just finished up a new open problems list. Here it is! It mainly covers topics in data streams and property testing but there’s also some other cool stuff. The list is compiled from the recent Bertinoro workshop and the slightly less recent Kanpur workshop.

(Cartoon from XKCD obviously.)

About

A research blog about data streams and related topics.

Recently Tweeted

Follow

Get every new post delivered to your Inbox.

Join 124 other followers