You are currently browsing the monthly archive for February 2010.

Readers with a particularly long memory may recall that we’re in the midst of a series of “bite-sized results” in data stream algorithms. Back in October, we discussed graph spanners and now it’s time for some graph matchings.

So consider a stream of undirected weighted edges \langle (e_1,w_{e_i}), \ldots, (e_m,w_{e_m})\rangle where w_{e} is the weight of edge e. Can we find a matching, i.e., a subset of the edges that don’t share any endpoints, such that the sum of the associated weights is maximized?

Here’s a really simple algorithm that maintains a matching as the edges are processed:

  • The initial matching is the empty matching
  • When processing edge e, the weight of e is compared with the total weight W of the conflicting edges, i.e., the edges in the currently stored matching that share an endpoint with e. If w_e>(1+\gamma) W then e is added to the matching and the conflicting edges are removed.

It makes a nice exercise to prove that the above algorithm achieves a 3+1/\gamma+2\gamma approximation ratio (see pages 149 and 150 of this thing if you get stuck) and choosing the optimal \gamma gives you a 5.828 approximation. Incidentally, this algorithm later found use in a paper on auctioning ad slots [Constantin, Feldman, Muthukrishnan, Pal].

Subsequently, a 5.585 approximation algorithm was presented in [Zelke]. The main idea was to remember some of the edges that were ejected from the matching in the second step of the above algorithm. These edges are potentially added back into the matching at some later point. While a natural idea, it requires substantially more analysis to show the improved approximation ratio.

(Graham Cormode reports from Day 2 of the fourth Bristol Algorithms Days)

Day 2, and I seem to be lacking energy: maybe it’s the lack of adrenaline from giving a talk, the jet lag kicking in, or the after effects of the workshop banquet the night before (the first dinner I have had at a swimming pool). But energy is the order of the day: two talks on energy efficiency. With political emphasis on sustainability and energy efficiency, expect to see more work using these keywords. Prudence Wong of Liverpool, and Lap-Kei Lee at MPI talked about related works on processor scheduling: now processors can go faster or slower (or sleep), with different energy costs. Since the tradeoff is non-linear (it’s more like cubic), scheduling problems get revisited with the added goal of minimizing energy costs while time objectives are met. This seems to have inspired the scheduling community, and while techniques look familiar (choosing the potential function, obtaining competitive ratios) the results are new.

The question I am left with is how should the rest of theory contribute to research on energy efficiency (which they surely will, given funding for this topic)? In the sublinear world, we could take the lazy route and claim “because our algorithms are fast and use small memory, they are more efficient than running the exact algorithm”. But this does not lead us to new challenges. Maybe we need a different way of counting the energy cost of an algorithm, which we can evaluate in big-Oh notation along with the time and space bounds.

The current buzzphrase in UK funding is “economic impact”: work is now evaluated in terms of its broader impact on the economy. Naturally, this change to the status quo has caused some consternation, especially to the more theoretically minded. “How can we put a price on fundamental research?” the question goes. But it is not such a bad exercise to ask what impact (if any) our work could have on the world. Again, a lazy route is to pontificate on past glories of theory in general, but I’d to see people form more coherent arguments about how their work has had impact. I’m reminded of Muthu’s comments about billion dollar problems and trillion dollar problems. It is easy to claim to work in areas that contribution $$$s to the economy: but does your work have a non-zero (and non-negative) contribution to this sum?

The last session has talks from two stringmasters from Bar-Ilan who need no surnames (Ami and Moshe), and a combinatorial tour-de-force from Alex Tiskin. Nothing streaming here, so I refrain from summarizing to save my energies…

Thanks go to Raphael Clifford and Malcolm MacCallum for organizing, along with a large band of helpers. Algorithms (whether streamy, stringy, energetic or otherwise) are active in the UK, and are surely having some economic impact.

(end of Graham’s post)

BONUS! Some toponymy… So why is Bristol, called Bristol? Well, according to the web, the name comes from “Brycgstow” which means something like “the meeting place by the bridge” in Anglo-Saxon. Fascinating, I hear you say.
Well, more intriguingly, if you believe this BBC story then it could have been named by the Ossetians (in the guise of the Sarmatians or the Alans) along with London (“standing water”), Belfast (“broken spade”), Crewe, Hove, and a bunch of other places in the UK. Yes, we’re talking about the Ossetia that’s 2255 miles from Bristol. But should I believe everything the BBC tells me?

(Graham Cormode reports from the fourth Bristol Algorithms Days)

To Bristol, UK, England, for a 2 day workshop of stringy, streamy, energetic and generally algorithmic talks and chat. This is a “Feasibility workshop” aimed at “Bridging the gaps” which I learn is UK-funding speak for “let’s throw a bunch of people together from different areas, and see what they come up with”. So not altogether different from most workshops, except that the funding agency wants some credit if any interesting collaborations emerge.

The morning session was the most streamy: I talked about efforts on streaming interactive proofs, Patrick McSharry gave a stats-y view of streams, and Mark Feblowitz talked streaming systems, from an IBM perspective. I think there’s lots of room to bring these different views closer together. The stats view outlined by Patrick has many familiar notions. A core concept is ensembles of simple classifiers for prediction: these are easy to train and quick to extract answers from, compared to trying to make one monolithic model to rule them all. What I don’t yet get is how to best learn the weights to apply to the different classifiers. Familiar tricks keep the sufficient statistics fast to update: rolling windows and exponential decay. I wonder, how much of this is convenience rather than being “the right thing” to do?

InfosphereStreams is the IBM name for a system that reminds me a lot of early research systems (Stream, Aurora, Borealis): impressive looking operator graphs that transform and reduce streams, spread across multiple machines. A high-level view makes it seem like this is a solved problem, but I feel sure that there are many algorithmic challenges (scheduling, operator placement) lurking under the surface. Moreover, there is a temptation with such systems to achieve higher performance by throwing more hardware at the problem instead of looking for an alternate algorithm — what can we do to identify such options, or at least argue for them?

After lunch, a survey of string matching in streaming data, with hard time constraints. Most intriguing to me was discussion of the Porat and Porat FOCS09 paper, which I have so far found inaccessible: for the most part, since it is not possible to find a copy on the web (even behind IEEE’s paywall), but also because it is reputedly tricky to digest. Rumours swirl at lunch that there is a cleaner, almost “book” level proof in preparation, possibly with improvements. I hope that this shows up soon.

BONUS! A game… The final talk had nothing to do with streams, but everything to do with Flood-It, an addictive flash game which unsurprisingly turns out to be NP hard. A bunch of Bristolians have taken it apart and shown some great analysis — it’s a lot of FUN. Check it out here.

The accepted list from STOC has just been posted (thanks My Brain is Open). Streaming and communication complexity papers that caught my eye include:

Godel’s Lost Letter had a great post about [Magniez, Mathieu, Nayak] and I previously mentioned the main problem from [Braverman, Ostrovsky]. I have a soft spot for both papers because they resolve open problems from a couple of my previous papers (this one and that one if you’re interested.)


A research blog about data streams and related topics.

Recently Tweeted