Continuing the report from the Dortmund Workshop on Algorithms for Data Streams, here are the happenings from Day 3. Previous posts: Day 1 and Day 2.

Michael Kapralov started the day with new results on computing matching large matchings in the semi-streaming model, one of my favorite pet problems. You are presented with a stream of unweighted edges on n nodes and want to approximate the size of the maximum matching given the constraint that you only have O(n polylog n) bits of memory. It’s trivial to get a 1/2 approximation by constructing a maximal matching greedily. Michael shows that it’s impossible to beat a 1-1/e factor even if the graph is bipartite and the edges are grouped by their right endpoint. In this model, he also shows a matching (no pun intended) 1-1/e approximation and an extension to a approximation given p passes.

Next up, Mert Seglam talked about sampling. Here the stream consists of a sequence of updates to an underlying vector and the goal is to randomly select an index where is chosen with probability proportional to . It’s a really nice primitive that gives rise to simple algorithms for a range of problems including frequency moments and finding duplicates. I’ve been including the result in recent tutorials. Mert’s result simplifies and improves an earlier result by Andoni et al.

The next two talks focused on communication complexity, the evil nemesis of the honest data stream algorithm. First, Xiaoming Sun talked about space-bounded communication complexity. The standard method to prove a data stream memory lower bound is to consider two players corresponding to the first and second halves of the data stream. A data stream algorithm gives rise to a communication protocol where the players emulate the algorithm and transmit the memory state when necessary. In particular, multi-pass stream algorithms give rise to multi-round communication protocols. Hence a communication lower bound gives rise to a memory lower bound. However, in the standard communication setting we suppose that the two players may maintain unlimited state between rounds. The fact that stream algorithms can’t do this may lead to suboptimal data stream bounds. To address this, Xiaoming’s work outlines a communication model where the players may maintain only a limited amount of state between the sending of each message and establishes bounds on classical problems including equality and inner-product.

In the final talk of the day, Amit Chakrabarti extolled the virtues of Talagrand’s inequality and explained why every data stream researcher should know it. In particular, Amit reviewed the history on proving lower bounds for the Gap-Hamming communication problem (Alice and Bob each have a length n string and wish to determine whether the Hamming distance is less than n/2-√n or greater than n/2+√n) and ventured that the history wouldn’t have been so long if the community had had a deeper familiarity with Talagrand’s inequality. It was a really gracious talk in which Amit actually spent most of the time discussing Sasha Sherstov’s recent proof of the lower bound rather than his own work.

**BONUS! Spot the theorist…** After the talks, we headed off to Revierpark Wischlingen to contemplate some tree-traversal problems. If you think your observation skills are up to it, click on the picture below to play “spot the theorist.” It may take some time, so keep looking until you find him or her.

## 2 comments

Comments feed for this article

August 11, 2012 at 9:31 pm

Themes in streaming algorithms (workshop at TU Dortmund) « Another Word For It[...] have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a [...]

September 17, 2012 at 9:44 am

Thought this was cool: Streaming Data Mining Tutorial slides (and more) « CWYAlpha[...] Day 3 here. [...]