You are currently browsing the category archive for the ‘educational’ category.

Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:

The outline was:

1. Basic Framework: If you have a small-space algorithm for stream problem $Q$, you can turn this into a low-communication protocol for a related communication problem $P$. Hence, a lower bound on the communication required to solve $P$ implies a lower bound on the space required to solve $Q$. Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
2. Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
3. Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]

Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.

Everyone loves the union bound. It is the most dependable of all probability bounds. No matter what mess of dependencies threaten your analysis, the union bound will ride in and save the day. Except when it doesn’t.

I’ve been giving a talk about some recent results (here are some slides) and have received a particular question from a couple of different audiences. The exact details aren’t important but the basic gist was why I didn’t simplify some of the analysis by just applying the union bound. The short answer was it didn’t apply. But how could that be? My longer answer was longer and probably less helpful. So let me distill things down to the following paradox (which isn’t really a paradox of course).

The Geography Test. Suppose you have a geography test tomorrow. In the test you’ll be asked two questions out of five possible questions (and you already know the list of possible questions). If you get both right, you’ll pass the course. However, rather than revise everything you adopt a random strategy that ensures that for any question, you’ll have a correct answer with probability at least 4/5. Hence, by the union bound, you’ll get both questions right with probability at least 3/5.

Right? Sure, it’s a straight-forward application of the union bound: the probability of being wrong on one or both questions is at most the probability of being wrong on the first (at most 1/5) plus the probability of being wrong on the second (at most 1/5). And the union bound never let’s you down.

The Bad News. You’re going to fail both your geography and probability classes. Consider the following scenario. The five possible exam questions are: 1) name a city in the US, 2) name a city in France, 3) name a city in Germany, 4) name a city in the UK, and 5) name a city in Italy. Your cunning strategy is to memorize one of the rows in the following table.

US France Germany UK Italy
New York Paris Berlin Glasgow Reykjavik
Los Angeles Paris Reykjavik Glasgow Rome
Boston Reykjavik Berlin Glasgow Rome
Reykjavik Paris Berlin Glasgow Rome

You choose the row to remember uniformly at random thus guaranteeing that you can answer any question correctly with probability 4/5. Of course the professor doesn’t know which row you’ll pick but he does know your strategy. (This is equivalent to the usual assumption we make about adversaries knowing your algorithm but not your random bits.)

Unfortunately, you stay up so late studying your row that you sleep in and miss the exam. You go to the professor’s office to plead for a make-up exam. He is initially reluctant but then a devious smile crosses his face. Sure, he’ll still give you the exam. You’re relieved that you’ll still pass the course with probability at least 3/5. The professor doesn’t look so sure.

The first question he asks is to name a city in the US. “Philadelphia” you answer with confidence. As the professor nods you see that devious smile again. The second question… name a city in the UK.

And then you realize that the probability of success was never as large as 3/5 since the professor could always guarantee that you fail by first asking for a US city and then choosing his second question accordingly.

Morals of the story: Don’t miss exams, study hard, and don’t blindly trust the union bound.

Here’s another bite-sized stream algorithm for your delectation. This time we want to simulate a random walk from a given node $u$ in a graph whose edges arrive as an arbitrarily-ordered stream. I’ll allow you multiple passes and semi-streaming space, i.e., $\tilde{O}(n)$ space where $n$ is the number of nodes in the graph. You need to return the final vertex of a length $t=o(n)$ random walk.

This is trivial if you take $t$ passes: in each pass pick a random neighbor of the node picked in the last pass. Can you do it in fewer passes?

Well, here’s an algorithm from [Das Sarma, Gollapudi, Panigrahy] that simulates a random walk of length $t$ in $\tilde{O}(n)$ space while only taking $O(\sqrt{t})$ passes. As in the trivial algorithm, we build up the random walk sequentially. But rather than making a single hop of progress in each pass, we’ll construct the random walk by stitching together shorter random walks.

1. We first compute short random walks from each node. Using the trivial algorithm, do a length $\sqrt{t}$ walk from each node $v$ and let $T[v]$ be the end point.
2. We can’t reuse short random walks (otherwise the steps in the random walk won’t be independent) so let $S$ be the set of nodes from which we’re already taken a short random walk. To start, let $S\leftarrow \{u\}$ and $v\leftarrow T[u], \ell\leftarrow\sqrt{t}$ where $v$ is the vertex that is reached by the random walk constructed so far and $\ell$ is the length of this random walk.
3. While $\ell < t-\sqrt{t}$
1. If $v\not \in S$ then set $v\leftarrow T[v], \ell \leftarrow \sqrt{t}+\ell, S\leftarrow S\cup \{v\}$
2. Otherwise, sample $\sqrt{t}$ edges (with replacement) incident on each node in $S$. Find the maximal path from $v$ such that on the $i$-th visit to node $x$, we take the $i$-th edge that was sampled for node $x$. The path terminates either when a node in $S$ is visited more than $\sqrt{t}$ times or we reach a node that isn’t in $S$. Reset $v$ to be the final node of this path and increase $\ell$ by the length of the path. (If we complete the length $t$ random walk during this process we may terminate at this point and return the current node.)
4. Perform the remaining $O(\sqrt{t})$ steps of the walk using the trivial algorithm.

So why does it work? First note that the maximum size of $S$ is $O(\sqrt{t})$ because $|S|$ is only incremented when $\ell$ increases by at least $\sqrt{t}$ and we know that $\ell \leq t$. The total space required to store the vertices $T$ is $\tilde{O}(n)$. When we sample $\sqrt{t}$ edges incident on each node in $S$, this requires $\tilde{O}(|S|\sqrt{t})=\tilde{O}(t)$ space. Hence the total space is $\tilde{O}(n)$. For the number of passes, note that when we need to take a pass to sample edges incident on $S$, we make $O(\sqrt{t})$ hops of progress because either we reach a node with an unused short walk or the walk uses $\Omega(\sqrt{t})$ samples edges. Hence, including the $O(\sqrt{t})$ passes used at the start and end of the algorithm, the total number of passes is $O(\sqrt{t})$.

Das Sarma et al. also present a trade-off result that reduces the space to $\tilde{O}(n\alpha+\sqrt{t/\alpha})$ for any $\alpha\in (0,1]$ at the expense of increasing the number of passes to $\tilde{O}(\sqrt{t/\alpha})$. They then use this for estimating the PageRank vector of the graph.

Following on from the last post on maximum weighted matchings in the stream model, I wanted to briefly mention a result that appeared in STACS last week. The set-up is the same as before: there’s a stream of undirected weighted edges $\langle (e_1,w_{e_i}), \ldots, (e_m,w_{e_m})\rangle$ where $w_{e}$ is the weight of edge $e$ and we want to find an approximate maximum weighted matching. Last time, we showed a simple algorithm that gave a 5.828 approximation that was “preemptive online”, i.e., at any point in time the algorithm remembered no information other than the feasible matching found so far. The follow-up work achieved a 5.585 approximation but didn’t have this property. This begs numerous questions including if 5-ish is the “right answer” for this problem and whether there is a separation between preemptive-online algorithms and stream algorithms that use roughly the same amount of memory.

Which brings us to [Epstein, Levin, Mestre, Segev]. They showed a lower bound of 4.967 for any deterministic preemptive online algorithm. Furthermore, they show that there exists a deterministic stream algorithm that breaks through this bound and achieves a 4.911 approximation. As a commenter pointed out, the algorithm is nice and simple. Furthermore, it makes a good teaching example of how algorithms don’t necessary appear from thin air, but rather can be developed in a series of less mysterious steps. Here’s what they do in the paper:

• Design a simple deterministic 8-approximation: Geometrically group edge weights by rounding edge weight to the nearest power of 2 and ignore the “very light” edges. Greedily find a maximal matching among each group of edges. At the end of the stream, find a maximum matching among the stored edges.
• Randomize the algorithm to get a 4.911-approximation: Instead of grouping to the nearest power of 2, do randomized geometric grouping with the appropriate parameters.
• Derandomize!

Thought… It might be interesting to investigate which other problems exhibit a separation between the preemptive online model and the stream model. You may need to be flexible with your definition of “preemptive online.”

Readers with a particularly long memory may recall that we’re in the midst of a series of “bite-sized results” in data stream algorithms. Back in October, we discussed graph spanners and now it’s time for some graph matchings.

So consider a stream of undirected weighted edges $\langle (e_1,w_{e_i}), \ldots, (e_m,w_{e_m})\rangle$ where $w_{e}$ is the weight of edge $e$. Can we find a matching, i.e., a subset of the edges that don’t share any endpoints, such that the sum of the associated weights is maximized?

Here’s a really simple algorithm that maintains a matching as the edges are processed:

• The initial matching is the empty matching
• When processing edge $e$, the weight of $e$ is compared with the total weight $W$ of the conflicting edges, i.e., the edges in the currently stored matching that share an endpoint with $e$. If $w_e>(1+\gamma) W$ then $e$ is added to the matching and the conflicting edges are removed.

It makes a nice exercise to prove that the above algorithm achieves a $3+1/\gamma+2\gamma$ approximation ratio (see pages 149 and 150 of this thing if you get stuck) and choosing the optimal $\gamma$ gives you a 5.828 approximation. Incidentally, this algorithm later found use in a paper on auctioning ad slots [Constantin, Feldman, Muthukrishnan, Pal].

Subsequently, a 5.585 approximation algorithm was presented in [Zelke]. The main idea was to remember some of the edges that were ejected from the matching in the second step of the above algorithm. These edges are potentially added back into the matching at some later point. While a natural idea, it requires substantially more analysis to show the improved approximation ratio.

The last four BSS posts considered streams of numerical values. While there are still many more bite-sized results to cover for such streams, I thought we’d mix things up a bit and switch to graph streams for the next four posts. We’ll consider a stream of unweighted, undirected edges $\langle e_1, \ldots, e_m\rangle$ and try to estimate properties of the graph $G=(V,E)$ defined by this stream subject to the usual constraints of stream computation (sequential access to the stream and limited memory). For this post, we focus on estimating distances.

The results we’ll discuss revolve around constructing a $t$-spanner, a sub-graph $H=(V,E')$ of $G$ such that for any $u,v\in V$

$d_G(u,v)\leq d_H(u,v) \leq t d_G(u,v)$

where $d_X(u,v)$ is the length of the shortest path between $u, v\in V$ in graph $X$. Obviously, given such a graph we can approximate any distance up to a factor of $t.$

So how can we construct a $t$-spanner in the data-stream model? Here’s a very simple algorithm: Intialize $H$ to $\emptyset$ and on receiving edge $e=(u,v)$, add the edge to $H$ if and only if $d_H(u,v)>t$.

It’s easy to show that $H$ is indeed a $t$-spanner because, for each edge $e=(u,v)$ on a shortest path in $G$, $H$ either includes $e$ or a path between $u$ and $v$ of length at most $t$. The space taken by the algorithm is just the space required to store $H$. How large can this be? Well, note that $H$ has girth at least $t+2$, i.e., it contains no cycles of length $t+1$ or shorter. It can be shown that such a graph has at most $O(n^{1+2/(t+1)})$ edges if $t$ is odd (see e.g., [Bollobás] or [Indyk’s lecture slides]) and hence we have a stream algorithm using $\tilde{O}(n^{1+2/(t+1)})$ bits of memory that finds a $t$ approximation for any shortest path distance.

Unfortunately, the time to process each edge in the above algorithm is rather large and this spurred some nice follow-up work including [Feigenbaum et al.], [Baswana], and [Elkin]. It’s now known how to construct a $t$-spanner in $\tilde{O}(n^{1+2/(t+1)})$ space and constant worst-case time per edge.

It’s not hard to extend these algorithms to weighted graphs (consider geometrically grouping the edge weights and constructing a spanner for each distinct edge weight) but there’s bad news for directed graphs: even determining if there exists a directed path between two nodes requires $\Omega(n^2)$ bits of memory.

BONUS! Some British English… Back in the UK, we call this

a spanner. Apparently that’s not the usual term in US English and I found this out after giving a talk at SODA back in 2005. The talk went reasonably well but, during the coffee break, numerous people asked why I’d kept on displaying pictures of wrenches throughout my talk. Whoops.

In technical writing, I normally endeavor [sic] to use US English (although I did once engage in some “colouring of paths” in a paper with a Canadian co-author). But I still think “math and stat” sounds funny :)

My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.

Consider a numerical stream that defines a frequency vector $\mathbf{f}=(f_1, \ldots, f_n)$ where $f_i$ is the frequency of $i$ in the stream. Here’s a simple sketch algorithm that allows you to approximate $F_2(\mathbf{f})=\sum_i f_i^2.$ Let $\mathbf{x}=(x_1, \ldots, x_n)\in \{-1,1\}^n$ be a random vector where the $x_i$ are 4-wise independent and unbiased. Consider $r=\mathbf{x}\cdot \mathbf{f}$ and note that $r$ can be computed incrementally as the stream arrives (given that $\mathbf{x}$ is implicitly stored by the algorithm.)

By the weaker assumption of 2-wise independence, we observe that $E[x_i x_j]=0$ if $i\neq j$ and so:

$E[r^2]=\sum_{i,j\in [n]} f_i f_j E[x_i x_j]=F_2$

By the assumption of 4-wise independence, we also observe that $E[x_i x_j x_k x_l]=0$ unless $i=j, k=l$ or $i=k, j=l$ or $i=l, j=k$ and so:

$\mbox{Var}[r^2]= \sum_{i,j,k,l\in [n]} f_i f_j f_k f_l E[x_i x_j x_k x_l]-F_2^2=4\sum_{i

Hence, if we repeat the process with $O(\epsilon^{-2} \log \delta^{-1})$ independent copies of $x$, it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor $(1+\epsilon)$ of $F_2$ with probability at least $1-\delta$. Note that it was lucky that we didn’t need full coordinate independence because that would have required $O(n)$ bits just to remember $\mathbf{x}$. It can be shown that remembering $O(\log n)$ bits is sufficient if we only need 4-wise independence.

BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of $r^2$. However, it was observed in [Indyk, McGregor] that the following also worked. First pick $\mathbf{y,z}\in \{-1,1\}^{\sqrt{n}}$ where $\mathbf{y,z}$ are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of $\mathbf{x}$ be $\{y_i z_j:1\leq i,j\leq\sqrt{n}\}$. It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that $\mbox{Var}[r^2]=O(F_2^2)$ and this is good enough for our purposes.

In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define $\mathbf{x}$ based on $k$ random vectors of length $n^{1/k}$. The culmination of this work shows that the variance increases to at most $3^k F_2^2$ and the resultant algorithm uses $\tilde{O}(\epsilon^{-2} 3^k \log \delta^{-1})$ space.

At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from $[n]^k$, can we determine if the $k$ coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.

If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case $k=2$, measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.

We’re presented with a numerical stream $A=\langle a_1, \ldots, a_m\rangle\in [n]^m$ and we want to compute some small-space approximation of $\mathbf{f}= (f_1, \ldots, f_n)$ where $f_i=|\{j:a_j=i\}|$ is the frequency of $i$. In particular, we want to be able to return estimates $\tilde{f}_i$ that satisfy

$f_i\leq \tilde{f}_i \leq f_i+\epsilon m$

where $i$ is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in $\tilde{O}(\epsilon^{-2} )$ space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)

Let $p_1, p_2, \ldots, p_t$ be the first $t$ prime numbers and define “hash” functions $h_j(x)= x \bmod p_j$. Based on these functions, the CR-Precis algorithm maintains $p_t t=O(t^2 \log t)$ counters $c_{i,j}$ which are initially zero. When we observe $a$ in the stream we increment each of $c_{j,h_j(a)}$ for $j\in [t]$.

Say we’re interested in the frequency of $i$. At the end of the day, we end up with $t$ overestimates for $f_i$, i.e., $c_{1,h_1(i)}, c_{2,h_2(i)}, \ldots, c_{t,h_t(i)}$. Hence, it makes sense to return $\tilde{f}_i=\min_j c_{j,h_j(i)}$. How much of an overestimate can this be?

Note that for any $i'\neq i$, there are at most $\log n$ functions $h_j$ under which $i,i'$ collide, i.e., $h_j(i)=h_j(i')$. This follows from the Chinese Remainder Theorem. Hence, we deduce that,

$\sum_j c_{j,h_j(i)} \leq t f_i+ (\sum_{i'\neq i} f_{i'}) \log n$

and therefore $\tilde{f}_i\leq f_i+(m\log n)/t$. Setting $t=O(\epsilon^{-1} \log n)$ gives the desired result.

But wait, there’s more! There are actually other deterministic algorithms for this problem that use only $\tilde{O}(\epsilon^{-1})$ space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a $O(t^2 \log t) \times n$ matrix $S$ in a natural way such that the algorithm described above is simply computing $S\mathbf{f}^T$. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream

$A=\langle a_1, \ldots, a_m\rangle\in (\{-,+\}\times [n])^m$

and define

$f_i=|\{j:a_j=(+,i)\}|-|\{j:a_j=(-,i)\}|$.

Assuming that all $f_i\geq 0$, the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing $(-,i)$ and incrementing when processing $(+,i)$.

Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.

We consider a stream $A=\langle a_1, \ldots, a_m\rangle$ formed by arranging a set of distinct values $B=\{b_1, \ldots, b_m\}$ in random order, i.e., $a_i=b_{\pi(i)}$ for a permutation $\pi$ of $[m]$ chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).

Here’s a simple algorithm that maintains a set $S$ of $s=O(\sqrt{m})$ elements and two counters $h$ and $l$ which are initialized to zero. Store the first $s$ values from the stream in $S$ and as each new element $a$ arrives, do one of the following:

1. If $a <\min S:$ Increment $l$
2. If $a >\max S:$ Increment $h$
3. If $\min S and $h \geq l$: Add $a$ to $S$, remove smallest element from $S$, and increment $l$
4. If $\min S and $h: Add $a$ to $S$, remove largest element from $S$, and increment $h$

At the end of the stream, if $1\leq (m+1)/2-l\leq s$ then return the $((m+1)/2-l)$-th smallest element in $S$. Otherwise fail.

Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of $S$ and $l$ values less than $\min S$ and $h$ values greater that $\max S$. Hence, if at the end of the stream, $S$ has a $((m+1)/2-l)$-th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering $r=|h-l|$ and noting that, for some constant $c>0$,

1. Probability a non-zero $r$ is incremented at any step is at most 1/2 if $s\geq cr$
2. $r$ is either incremented or decremented at each step
3. If $s\geq cr$ at the final step, then $1\leq (m+1)/2-l\leq s$ as required

Hence, we have a simple “random-walk-like” set-up and there’s a $s=O(\sqrt{m})$ that ensures that, with high probability, the algorithm doesn’t fail.

Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].

We’re presented with a numerical stream $A=\langle a_1, \ldots, a_m\rangle\in [n]^m$ and we want to compute some function $\sum_{i\in [n]} g(f_i)$ where $f_i=|\{j:a_j=i\}|$ is the frequency of $i$ and $g$ is an arbitrary function with $g(0)=0$. One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:

1. Pick $J$ uniformly at random from $[m]$
2. Compute $R=|\{j\geq J:a_j=a_J\}|$
3. Let $X=m(g(R)-g(R-1))$

Note that $X$ can be computed using $O(\log m+\log n)$ space as the data arrives. It’s not hard to show that $E[X]=\sum_{i\in [n]} g(f_i)$ because,

$E[X]= \sum_{i} E[X|a_J=i] \Pr[a_J=i]=\sum_{i} E[g(R)-g(R-1)|a_J=i] f_i$

and

$E[g(R)-g(R-1)|a_J=i] = \frac{g(1)-g(0)}{f_i}+\ldots + \frac{g(f_i)-g(f_i-1)}{f_i}=\frac{g(f_i)}{f_i}\ .$

Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of $\sum_{i\in [n]} g(f_i)$. Quite how many basic estimators are required depends on the variance of $X$ and this determines the total space used by the algorithm.

Using this approach, it was shown that the $k$-th order frequency moment, $F_k=\sum_{i\in [n]} f_i^k$, can be approximated up to a factor $(1+\epsilon)$ with probability $(1-\delta)$ using only $O(\epsilon^{-2} n^{1-1/k} \log \delta^{-1} (\log m+\log n))$ bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].

A research blog about data streams and related topics.