(It’s Day 3 at Sublinear Algorithms and today Seshadhri is taking on blogging duties. Thanks Sesh!)

How to win Netflix: The survey of the day was “Compressed sensing to matrix completion and beyond” by Benjamin Recht. We started by discussing the basic matrix completion problem. Think of a matrix with the rows being movies and columns being Netflix users. The entries are the respective ratings. Suppose we got few of those ratings. Can we reasonably guess what the remaining entries are (and thereby win \$1,000,000)? Formally, we have some $k \times n$ matrix $M$ with some entries that are provided. We assume that the matrix is of low rank, meaning that $M = LR$, where $L$ is $k \times r$ and $R$ is $r \times n$. The total number of free variables is $r(k+n)$, which for small $r$ is much smaller than the trivial bound of $kn$. Ben showed how techniques from compressed sensing could be used to attack this problem. His work gave a strong mathematical basis for certain heuristics that people used to solve such problems.

The importance of being coherent: That segued into the next talk, the only board talk of the workshop. Michael Mahoney talked about “Fast Approximations for Matrix Coherence and Statistical Leverage”. The concept of matrix coherence was introduced by compressed sensing folks (and the related notion of statistical leverage was present in earlier work by Michael). Consider a low-rank matrix M with the SVD $M = U \Sigma V^*$ (so $U$ is an $n \times r$ matrix). The coherence is the maximum length of the projection of a standard basis vector onto the subspace spanned by the $r$ column vectors of $U$. Formally, $\mu(U) = \frac{n}{r}\max_{i \leq n}\| {\bf P}e_i\|^2$

where $e_i$‘s are the standard basis vectors and ${\bf P}$ is the projection matrix onto $U$. Michael talked about fast algorithms to compute this quantity, involving many linear algebraic manipulations and the recent fast Johnson-Lindenstrauss algorithms.

Big brother is watching you(r net traffic): After that, it was time for more traditional theory CS. Ely Porat talked about the classic problem of pattern matching in the streaming model. We are given a text T that we stream through, and a fixed pattern P of length $m$. The fingerprinting algorithm of Karp-Rabin solves this problem, but requires $O(m)$ space (since it basically maintains a sliding window of that size). Ely’s aim was to get this down to $\textrm{polylog} m$ space, so that it would be appropriate to be blogged about here. He discussed some clever tricks to break this pattern into smaller sub-patterns, and using recursive fingerprinting schemes. One open problem was the get space lower bounds for this rather fundamental problem. He also mentioned the open problem of arranging his sabbatical for next year! Contact him for details.

How to find your Dacians: Alexandr Andoni gave a talk on “Precision and Sampling Algorithms”. The title was changed in real time to “Sublinear Algorithms via Precision Sampling” (and hence us attendees will never know what his original talk was about). After giving the motivation of the emperor of ancient Rome attempting to estimate the total number of Dacians in various provinces, Alex described a very cute sampling problem. We want to estimate the values of $S = \sum_{1 \leq i \leq n} a_i$, where $a_i \in [0,1]$. Unfortunately, we can only get an estimate $\widetilde{a_i}$ with a precision $u_i$ (so $|\widetilde{a_i} - a_i| \leq u_i$). Our power is that we can choose the precisions $u_i$ but have to pay a cost of $(1/n)\sum_i 1/u_i$. How do we get an estimate $\widetilde{S}$ satisfying $S-\epsilon < \widetilde{S} < (1+\epsilon)S + O(1)$

for constant $\epsilon$? Alex gave a nifty algorithm that has a cost of $O(\epsilon^{-3}\log n)$. This had applications in approximating the edit distance and also in streaming algorithms for large frequence moments.

How to bake your cake and eat it too: Nir Halman talked about “FPTASs for Stochastic Optimization Problems”. He introduced his talk using a baking analogy. First, he told us why the result (or cake) was going to be tasty, and then gave us the ingredients and recipe for the result. The problem was a stochastic version of the classic discrete knapsack problem. We have $T$ objects with the $i$th object having volume $v_i$ and profit $p_i$. The volume of the knapsack is $B$, and we can solve this problem using dynamic programming is $O(BT)$ time. This is, of course, not polynomial, and the aim to get a $(1+\epsilon)$-approximate solution and only pay an overhead of $poly(1/\epsilon)$. The general framework of this result is strong enough to generate FPTASs for various versions of knapsack. One of the important cases Nir discussed was the stochastic version, where the volume $v_i$ comes from some explicitly known distribution. Another interesting case was the non-linear knapsack, where we can choose integer copies of each element.

The Arthur Cayley approach to odd-cycles: Asaf Shapira was next, telling us about “Testing Odd-Cycle Freeness of Boolean Functions”. Given a boolean function $f$, an odd cycle a set of domain points $x_1, x_2, \ldots, x_t$ such that $t$ is odd, $f(x_1) = \ldots = f(x_t) = 1$ and $\sum_i x_i = 0$. This is a property that falls into the broad class of linear-invariant properties defined by simple constraints, but uncharacteristically, does not have an bounded constraint size. The main result is that this is testable in $O(1/\epsilon^2)$ queries. Previous testers used general characterizations that gave the standard “tower of $(1/\epsilon)$” query complexity (Arnab talked about this result on the previous day). Asaf gave an excellent talk on relating this problem to testing bipartiteness of the Cayley graph of $f$, which was really the heart of the result. He gave details on how to combine Fourier analysis ideas with a spectral approach, connecting the odd cycles in $f$ to the bipartiteness (or lack thereof) of the Cayley graph. This was done extremely well, so even though he walked us through the technical approach, the main ideas were very clear (a rather tough thing to do, IMHO). Finally, he also explained that there exists a canonical $poly(1/\epsilon)$-query tester for this problem. This tester randomly samples $x_1, \ldots, x_{\log (1/\epsilon)}$ and only queries the subspace spanned by them. It is an interesting open question whether for any linear invariant property that is testable in $q(\epsilon)$ queries, there exists a canonical tester running in $poly(q(\epsilon))$ time.

After lunch, it was time for the excursion to the ancient town of Ravenna. The highlights were the mosaics of the Basilica of San Vitale and the final resting place of Dante Aligheri. These involved little to no new theoretical results, and hence the blog post shall end here.