Ely Porat asked me remind everyone that the deadline for the 20th String Processing and Information Retrieval Symposium (SPIRE) is 2nd May, about a month from now. More details at websrv.cs.biu.ac.il/spire2013/.

Update: The deadline has been extended to 9th May.

A guest post from Krzysztof Onak:

A few recent workshops on sublinear algorithms compiled lists of open problems suggested by participants. During the last of them, in July in Dortmund, we realized that it would be great to have a single repository with all those problems. After followup discussions (with Alex Andoni, Piotr Indyk, and Andrew McGregor), we created a wiki page at http://sublinear.info/. Currently, it only contains open problems from the aforementioned workshops, but we invite submissions of inspiring problems from all areas of sublinear algorithms (sublinear time, sublinear space, etc.). Additionally, we want to compile a list of books, surveys, lecture notes, and slides that can be useful for learning about different areas of sublinear algorithms. We hope that this wiki will not serve only spambots, which have already been raiding it for a while, but it will also be a great source of inspiration for the whole community.

After a very successful hiring season last year, the department is now focusing on hiring in theory, NLP, robotics, and vision (that’s four separate searches rather than one extreme interdisciplinary position).  So please apply! The official ad is here and note that, unlike previous years, we’re able to hire in theory at either the assistant or associate level. We’ll start reviewing applications December 3.

Continuing the report from the Dortmund Workshop on Algorithms for Data Streams, here are the happenings from Day 3. Previous posts: Day 1 and Day 2.

Michael Kapralov started the day with new results on computing matching large matchings in the semi-streaming model, one of my favorite pet problems. You are presented with a stream of unweighted edges on n nodes and want to approximate the size of the maximum matching given the constraint that you only have O(n polylog n) bits of memory. It’s trivial to get a 1/2 approximation by constructing a maximal matching greedily. Michael shows that it’s impossible to beat a 1-1/e factor even if the graph is bipartite and the edges are grouped by their right endpoint. In this model, he also shows a matching (no pun intended) 1-1/e approximation and an extension to a $1-e^{-p}p^{p-1}/(p-1)!$ approximation given p passes.

Next up, Mert Seglam talked about $\ell_p$ sampling. Here the stream consists of a sequence of updates to an underlying vector $\mathbf{x}\in {\mathbb R}^n$ and the goal is to randomly select an index where $i$ is chosen with probability proportional to $|x_i|^p$. It’s a really nice primitive that gives rise to simple algorithms for a range of problems including frequency moments and finding duplicates. I’ve been including the result in recent tutorials. Mert’s result simplifies and improves an earlier result by Andoni et al.

The next two talks focused on communication complexity, the evil nemesis of the honest data stream algorithm. First, Xiaoming Sun talked about space-bounded communication complexity. The standard method to prove a data stream memory lower bound is to consider two players corresponding to the first and second halves of the data stream. A data stream algorithm gives rise to a communication protocol where the players emulate the algorithm and transmit the memory state when necessary. In particular, multi-pass stream algorithms give rise to multi-round communication protocols. Hence a communication lower bound gives rise to a memory lower bound. However, in the standard communication setting we suppose that the two players may maintain unlimited state between rounds. The fact that stream algorithms can’t do this may lead to suboptimal data stream bounds. To address this, Xiaoming’s work outlines a communication model where the players may maintain only a limited amount of state between the sending of each message and establishes bounds on classical problems including equality and inner-product.

In the final talk of the day, Amit Chakrabarti extolled the virtues of Talagrand’s inequality and explained why every data stream researcher should know it. In particular, Amit reviewed the history on proving lower bounds for the Gap-Hamming communication problem (Alice and Bob each have a length n string and wish to determine whether the Hamming distance is less than n/2-√n or greater than n/2+√n) and ventured that the history wouldn’t have been so long if the community had had a deeper familiarity with Talagrand’s inequality. It was a really gracious talk in which Amit actually spent most of the time discussing Sasha Sherstov’s recent proof of the lower bound rather than his own work.

BONUS! Spot the theorist… After the talks, we headed off to Revierpark Wischlingen to contemplate some tree-traversal problems. If you think your observation skills are up to it, click on the picture below to play “spot the theorist.” It may take some time, so keep looking until you find him or her.

This week, I’m at the Workshop on Algorithms for Data Streams in Dortmund, Germany. It’s a continuation in spirit of the great Kanpur workshops from 2006 and 2009.

The first day went very well despite the widespread jet lag (if only jet lag from those traveling from the east could cancel out with those traveling from the west.) Sudipto Guha kicked things off with a talk on combinatorial optimization problems in the (multiple-pass) data stream model. There was a nice parallel between Sudipto’s talk and a later talk by David Woodruff and both were representative of a growing number of papers that have used ideas developed in the context of data streams to design more efficient algorithms in the usual RAM model. In the case of Sudipto’s talk, this was a faster algorithm to approximate $b$-matchings while David’s result was a faster algorithm for least-squares regression.

Other talks included Christiane Lammersen presenting a new result for facility location in data streams; Melanie Schmidt talking about constant-size coresets for $k$-means and projective clustering; and Dan Feldman discussing the data stream challenges that arise when trying to transform real-time GPS data from your smart-phone into a human-readable diary of your life. I spoke about work on constructing a combinatorial sparsifier for an $n^2$-dimensional graph via a single random linear projection into roughly $n$ dimensions. Rina Panigrahy wrapped things up with an exploration of different distance measures in social networks, i.e., how to quantify how closely-connected you are to your favorite celebrity. This included proposing a new measure based on the probability that two individuals remained connected if every edge was deleted with some probability. He then related this to electrical resistance and spectral sparsification. He refused to be drawn on which of his co-authors had the closest connection to the Kardashians.

To be continued… Tomorrow, Suresh will post about day 2 across at the Geomblog.

As promised, here are the slides from the STOC Workshop on Algorithms for Distributed and Streaming Data. The workshop was standing-room only so here’s your chance to review the slides while sitting down. More generally, all the workshops seemed to be a great success and I’m happy to see that the experiment will be repeated at FOCS. Deadline for proposals is 20 June.

Thanks again to the speakers and everyone who came along.

© Copyright Keith Edkins and licensed for reuse under this Creative Commons Licence

Atri Rudra asked me to post an announcement for this year’s

Coding, Complexity, and Sparsity Workshop.

It’ll take place at the University of Michigan from July 30th to August 2nd. I really enjoyed last year’s workshop.

The Blurb. Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms).

Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking. The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. We will have several tutorial lectures that will be directed to graduate students and postdocs.

These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms.

We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation.

Confirmed speakers:

• Eric Allender, Rutgers
• Mark Braverman, Princeton
• Mahdi Cheraghchi, Carnegie Mellon University
• Anna Gal, The University of Texas at Austin
• Piotr Indyk, MIT
• Swastik Kopparty, Rutgers
• Dick Lipton, Georgia Tech
• Andrew McGregor, University of Massachusetts, Amherst
• Raghu Meka, IAS
• Eric Price, MIT
• Ronitt Rubinfeld MIT
• Shubhangi Saraf, IAS
• Chris Umans, Caltech
• David Woodruff, IBM

We have some funding for graduate students and postdocs. For registration and other details, please look at the workshop webpage:

While on the topic of STOC, I also wanted to mention a STOC workshop on “Algorithms for Distributed and Streaming Data” that will hopefully be of interest. It will take place Saturday afternoon, 19th May in NYU. The schedule can be found here.

So here’s the pitch: At this point it’s readily apparent that big data has become big news (e.g., see here and here). But what does this mean for the STOC/FOCS/SODA community? What are the algorithmic problems we could be solving? What are the appropriate computational models? Are there opportunities for industrial impact? What should we be teaching our undergraduate and graduate students? To address the relevant topics, we’ve lined-up a great set of speakers including Sergei Vassilvitskii, John Langford, Piotr Indyk, and Ashish Goel. Hope to see you there.

If yes, just a reminder that this year (for the first time) there’ll be an award for the best student presentation. More details here. In addition to the cash, the reputation for giving great talks can be very helpful when applying for jobs.

We’ll be giving preference to talks that are “clear, compelling, and appeal to a broad cross-section of the STOC audience.” My suggestion of giving extra credit for incorporating fire juggling, 3D slides, and celebrity guests fell on deaf ears.

Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:

The outline was:

1. Basic Framework: If you have a small-space algorithm for stream problem $Q$, you can turn this into a low-communication protocol for a related communication problem $P$. Hence, a lower bound on the communication required to solve $P$ implies a lower bound on the space required to solve $Q$. Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
2. Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
3. Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]

Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.

### About

A research blog about data streams and related topics.