You are currently browsing the monthly archive for June 2011.

Piotr Indyk, Ilan Newman, Krzysztof Onak, and I have just finished up a new open problems list. Here it is! It mainly covers topics in data streams and property testing but there’s also some other cool stuff. The list is compiled from the recent Bertinoro workshop and the slightly less recent Kanpur workshop.

(Cartoon from XKCD obviously.)

Streaming and sketching papers in the RANDOM/APPROX accepted list include:

  • Streaming Algorithms with One-Sided Estimation [Brody, Woodruff]
  • Everywhere-Tight Information Cost Tradeoffs for Augmented Index [Chakrabarti, Kondapally]
  • Almost Optimal Explicit Johnson-Lindenstrauss Transformations [Kane, Meka, Nelson]
  • Periodicity and Cyclic Shifts via Linear Sketches [Crouch, McGregor]
  • Sparse recovery with partial support knowledge [Do Ba, Indyk]

The sun is setting on La Rocca di Bertinoro. The pasta has run dry and the wifi signal is fading. Theorists are scurrying towards taxis and preparing themselves for their overnight flights. Sublinear Algorithms 2011 is all over. But fear not! There are still three more talks to report…

First Robi Krauthgamer discussed “Polylogarithmic Approximation for Edit Distance” [Andoni, Krauthgamer, Onak]. Given two length n sequences, the edit distance is the minimum number of character operations (inserts, deletions, substitutions) that are sufficient to transform the first string into the latter. Masek and Paterson showed that the edit distance could be computed exactly in O(n^2/\log^2 n) time. Recent work has focused on approximation in near-linear time. Robi presented an algorithm that achieves a (\log n)^{O(\epsilon^{-1})} approximation in n^{1+\epsilon} time. This was a significant improvement over the previous best of 2^{\tilde{O}(\sqrt{\log n})} approximation. [Andoni, Krauthgamer, Onak] is also the paper that sowed the seeds for the precision-sampling technique that Alex presented on Wednesday (one of my favorite talks from the workshop).

Next, Artur Czumaj gave a talk entitled “Planar Graphs: Random Walks and Bipartiteness Testing” [Czumaj, Monemizadeh, Onak, Sohler]. This is one of the few results for property testing in general graphs, i.e., not dense graphs or graphs with constant degree. In the dense graph model you can query entries of the adjacency matrix and need to accept input graphs with the property of interest and reject graphs if \epsilon n^2 entries of the adjacency matrix would need to be changed in order to have a graph that has the property. After about ten years of research we have a very good understanding of what is possible in this model. Attention has since focussed on the bounded-degree adjacency list model where every node has degree at most d and we need to reject graphs if \epsilon dn changes would be necessary to satisfy the property. In both models, testing bipartiteness has been an important problem: in the dense graph model the complexity is constant (we’re treating \epsilon as constant) and in the bounded-degree model the complexity is \Theta(\textrm{polylog}(n) \sqrt{n}). In the general graph model, there is no bound on the max-degree and we may query the graph by a) asking for the i-th neighbor of a specific node or b) asking for the degree of a specific node. In practice it seems like asking for a random neighbor of a specific node is sufficiently powerful. Artur showed that it was possible to test bipartiteness in planar graphs with constant queries.

The very last talk was given Ronitt Rubinfeld on “Testing Collections of Properties” [Levi, Ron, Rubinfeld]. Previous work on property testing of distributions considered single distributions or pairs of distributions: we’d take iid samples from the distributions and try to determine if, e.g., the distributions where identical or far from identical (i.e., the \ell_1 distance between the distributions is at least \epsilon.). Ronitt considered a more general model in which there are m distributions D_1, \ldots, D_m over [n] and a (known or unknown) distribution \nu over [m]. On querying the distributions we get (i,x) where i\sim \nu and x \sim D_i. The goals included determining whether all D_i are identical or whether they can be clustered into k clusters such that within a cluster, all distributions are close. Results included a tight bound if n>m of \tilde{\Theta}(n^{2/3}m^{1/3}) for the question of testing if all distributions where identical. Using the observation that (i,x) are independent iff all D_i‘s are equal, this also gives a tight lower bound for independence testing, resolving an open question from previous work.

BONUS! More Photos… Here are a few more photos from the workshop.

By day we’d listen to theory talks in the Fresco room…

By night we’d talk about theory over dinner…

On our afternoon off, we admired cathedrals and talked more theory.

It’s the last day of Sublinear Algorithms 2011. Thanks to Artur Czumaj, Piotr Indyk, Ronitt Rubinfeld, and Robi Krauthgamer for organizing such a great workshop. Thanks also to all the guest bloggers (Jelani Nelson, Krzysztof Onak, Seshadhri, Piotr Indyk, Sariel Har-Peled) this week for doing such a great job. My only concern is that readers of this blog might start to expect posts of such a high quality. To minimize this risk, I’ll finish up the workshop blogging.

The day started with two property testing talks. First, Oded Lachish talked about “Testing a Language Accepted by a Fixed Boolean Formula”. Here we assume full knowledge of a formula \phi(x_1,\ldots, x_n) and are given oracle access to an assignment of the variables. By querying only a few of the variable assignments we wish to distinguish between the cases when the assignment satisfies \phi and the case when is far from doing so, i.e., all satisfying assignments differ in at least \epsilon n positions. Oded presented quasi-polynomial algorithms for a variety of cases including binary, read-once formula and binary, monotone and read-k-times formula.

Next up was Gilad Tsur who talked about “Approximating the Number of Relevant Variables in a Function” [Ron, Tsur]. Here there’s also a boolean formula f:\{0,1\}^n\rightarrow \{0,1\} but in contrast to the previous talk we don’t know f. Rather, we can query the value of f on inputs of our choosing. The goal is to multiplicatively estimate how many variables in the input actually play a role in determining the function. Unfortunately this is hard in general: consider a function that always a evaluates to 0 except on one specific input. However, Gilad showed it was possible for certain families of functions such as linear functions. He then considered the relaxation to distinguishing between the case when the number of relevant variables is at most k, and the case in which it is far from any function in which the number of relevant variable is more than (1+ g)k for some parameter g.

After a short break, I talked about “Data Streams, Dyck Languages, and Detecting Dubious Data Structures” [Chakrabarti, Cormode, Kondapally, McGregor]. Here’s the problem… You are watching your friend interact with a priority queue: at each step she either inserts a new value into the queue or asks for the minimum value currently in the queue to be extracted. However, you’re suspicious that the priority queue may not always be extracting the correct minimum. Can you help your friend recognize when this happens without having to remember all the values she has inserted? In the talk we showed how this was possible. We also discussed recognizing whether a string of brackets was balanced and concluded that reading backwards can be a very useful skill. Slides are here.


A research blog about data streams and related topics.

Recently Tweeted