You are currently browsing the category archive for the ‘educational’ category.
The outline was:
- Basic Framework: If you have a small-space algorithm for stream problem , you can turn this into a low-communication protocol for a related communication problem . Hence, a lower bound on the communication required to solve implies a lower bound on the space required to solve . Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
- Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
- Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]
Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.
Everyone loves the union bound. It is the most dependable of all probability bounds. No matter what mess of dependencies threaten your analysis, the union bound will ride in and save the day. Except when it doesn’t.
I’ve been giving a talk about some recent results (here are some slides) and have received a particular question from a couple of different audiences. The exact details aren’t important but the basic gist was why I didn’t simplify some of the analysis by just applying the union bound. The short answer was it didn’t apply. But how could that be? My longer answer was longer and probably less helpful. So let me distill things down to the following paradox (which isn’t really a paradox of course).
The Geography Test. Suppose you have a geography test tomorrow. In the test you’ll be asked two questions out of five possible questions (and you already know the list of possible questions). If you get both right, you’ll pass the course. However, rather than revise everything you adopt a random strategy that ensures that for any question, you’ll have a correct answer with probability at least 4/5. Hence, by the union bound, you’ll get both questions right with probability at least 3/5.
Right? Sure, it’s a straight-forward application of the union bound: the probability of being wrong on one or both questions is at most the probability of being wrong on the first (at most 1/5) plus the probability of being wrong on the second (at most 1/5). And the union bound never let’s you down.
The Bad News. You’re going to fail both your geography and probability classes. Consider the following scenario. The five possible exam questions are: 1) name a city in the US, 2) name a city in France, 3) name a city in Germany, 4) name a city in the UK, and 5) name a city in Italy. Your cunning strategy is to memorize one of the rows in the following table.
You choose the row to remember uniformly at random thus guaranteeing that you can answer any question correctly with probability 4/5. Of course the professor doesn’t know which row you’ll pick but he does know your strategy. (This is equivalent to the usual assumption we make about adversaries knowing your algorithm but not your random bits.)
Unfortunately, you stay up so late studying your row that you sleep in and miss the exam. You go to the professor’s office to plead for a make-up exam. He is initially reluctant but then a devious smile crosses his face. Sure, he’ll still give you the exam. You’re relieved that you’ll still pass the course with probability at least 3/5. The professor doesn’t look so sure.
The first question he asks is to name a city in the US. “Philadelphia” you answer with confidence. As the professor nods you see that devious smile again. The second question… name a city in the UK.
And then you realize that the probability of success was never as large as 3/5 since the professor could always guarantee that you fail by first asking for a US city and then choosing his second question accordingly.
Morals of the story: Don’t miss exams, study hard, and don’t blindly trust the union bound.
Here’s another bite-sized stream algorithm for your delectation. This time we want to simulate a random walk from a given node in a graph whose edges arrive as an arbitrarily-ordered stream. I’ll allow you multiple passes and semi-streaming space, i.e., space where is the number of nodes in the graph. You need to return the final vertex of a length random walk.
This is trivial if you take passes: in each pass pick a random neighbor of the node picked in the last pass. Can you do it in fewer passes?
Well, here’s an algorithm from [Das Sarma, Gollapudi, Panigrahy] that simulates a random walk of length in space while only taking passes. As in the trivial algorithm, we build up the random walk sequentially. But rather than making a single hop of progress in each pass, we’ll construct the random walk by stitching together shorter random walks.
- We first compute short random walks from each node. Using the trivial algorithm, do a length walk from each node and let be the end point.
- We can’t reuse short random walks (otherwise the steps in the random walk won’t be independent) so let be the set of nodes from which we’re already taken a short random walk. To start, let and where is the vertex that is reached by the random walk constructed so far and is the length of this random walk.
- If then set
- Otherwise, sample edges (with replacement) incident on each node in . Find the maximal path from such that on the -th visit to node , we take the -th edge that was sampled for node . The path terminates either when a node in is visited more than times or we reach a node that isn’t in . Reset to be the final node of this path and increase by the length of the path. (If we complete the length random walk during this process we may terminate at this point and return the current node.)
- Perform the remaining steps of the walk using the trivial algorithm.
So why does it work? First note that the maximum size of is because is only incremented when increases by at least and we know that . The total space required to store the vertices is . When we sample edges incident on each node in , this requires space. Hence the total space is . For the number of passes, note that when we need to take a pass to sample edges incident on , we make hops of progress because either we reach a node with an unused short walk or the walk uses samples edges. Hence, including the passes used at the start and end of the algorithm, the total number of passes is .
Das Sarma et al. also present a trade-off result that reduces the space to for any at the expense of increasing the number of passes to . They then use this for estimating the PageRank vector of the graph.
Following on from the last post on maximum weighted matchings in the stream model, I wanted to briefly mention a result that appeared in STACS last week. The set-up is the same as before: there’s a stream of undirected weighted edges where is the weight of edge and we want to find an approximate maximum weighted matching. Last time, we showed a simple algorithm that gave a 5.828 approximation that was “preemptive online”, i.e., at any point in time the algorithm remembered no information other than the feasible matching found so far. The follow-up work achieved a 5.585 approximation but didn’t have this property. This begs numerous questions including if 5-ish is the “right answer” for this problem and whether there is a separation between preemptive-online algorithms and stream algorithms that use roughly the same amount of memory.
Which brings us to [Epstein, Levin, Mestre, Segev]. They showed a lower bound of 4.967 for any deterministic preemptive online algorithm. Furthermore, they show that there exists a deterministic stream algorithm that breaks through this bound and achieves a 4.911 approximation. As a commenter pointed out, the algorithm is nice and simple. Furthermore, it makes a good teaching example of how algorithms don’t necessary appear from thin air, but rather can be developed in a series of less mysterious steps. Here’s what they do in the paper:
- Design a simple deterministic 8-approximation: Geometrically group edge weights by rounding edge weight to the nearest power of 2 and ignore the “very light” edges. Greedily find a maximal matching among each group of edges. At the end of the stream, find a maximum matching among the stored edges.
- Randomize the algorithm to get a 4.911-approximation: Instead of grouping to the nearest power of 2, do randomized geometric grouping with the appropriate parameters.
Thought… It might be interesting to investigate which other problems exhibit a separation between the preemptive online model and the stream model. You may need to be flexible with your definition of “preemptive online.”
Readers with a particularly long memory may recall that we’re in the midst of a series of “bite-sized results” in data stream algorithms. Back in October, we discussed graph spanners and now it’s time for some graph matchings.
So consider a stream of undirected weighted edges where is the weight of edge . Can we find a matching, i.e., a subset of the edges that don’t share any endpoints, such that the sum of the associated weights is maximized?
Here’s a really simple algorithm that maintains a matching as the edges are processed:
- The initial matching is the empty matching
- When processing edge , the weight of is compared with the total weight of the conflicting edges, i.e., the edges in the currently stored matching that share an endpoint with . If then is added to the matching and the conflicting edges are removed.
It makes a nice exercise to prove that the above algorithm achieves a approximation ratio (see pages 149 and 150 of this thing if you get stuck) and choosing the optimal gives you a 5.828 approximation. Incidentally, this algorithm later found use in a paper on auctioning ad slots [Constantin, Feldman, Muthukrishnan, Pal].
Subsequently, a 5.585 approximation algorithm was presented in [Zelke]. The main idea was to remember some of the edges that were ejected from the matching in the second step of the above algorithm. These edges are potentially added back into the matching at some later point. While a natural idea, it requires substantially more analysis to show the improved approximation ratio.
The last four BSS posts considered streams of numerical values. While there are still many more bite-sized results to cover for such streams, I thought we’d mix things up a bit and switch to graph streams for the next four posts. We’ll consider a stream of unweighted, undirected edges and try to estimate properties of the graph defined by this stream subject to the usual constraints of stream computation (sequential access to the stream and limited memory). For this post, we focus on estimating distances.
The results we’ll discuss revolve around constructing a -spanner, a sub-graph of such that for any
where is the length of the shortest path between in graph . Obviously, given such a graph we can approximate any distance up to a factor of
So how can we construct a -spanner in the data-stream model? Here’s a very simple algorithm: Intialize to and on receiving edge , add the edge to if and only if .
It’s easy to show that is indeed a -spanner because, for each edge on a shortest path in , either includes or a path between and of length at most . The space taken by the algorithm is just the space required to store . How large can this be? Well, note that has girth at least , i.e., it contains no cycles of length or shorter. It can be shown that such a graph has at most edges if is odd (see e.g., [Bollobás] or [Indyk's lecture slides]) and hence we have a stream algorithm using bits of memory that finds a approximation for any shortest path distance.
Unfortunately, the time to process each edge in the above algorithm is rather large and this spurred some nice follow-up work including [Feigenbaum et al.], [Baswana], and [Elkin]. It’s now known how to construct a -spanner in space and constant worst-case time per edge.
It’s not hard to extend these algorithms to weighted graphs (consider geometrically grouping the edge weights and constructing a spanner for each distinct edge weight) but there’s bad news for directed graphs: even determining if there exists a directed path between two nodes requires bits of memory.
BONUS! Some British English… Back in the UK, we call this
a spanner. Apparently that’s not the usual term in US English and I found this out after giving a talk at SODA back in 2005. The talk went reasonably well but, during the coffee break, numerous people asked why I’d kept on displaying pictures of wrenches throughout my talk. Whoops.
In technical writing, I normally endeavor [sic] to use US English (although I did once engage in some “colouring of paths” in a paper with a Canadian co-author). But I still think “math and stat” sounds funny :)
My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.
Consider a numerical stream that defines a frequency vector where is the frequency of in the stream. Here’s a simple sketch algorithm that allows you to approximate Let be a random vector where the are 4-wise independent and unbiased. Consider and note that can be computed incrementally as the stream arrives (given that is implicitly stored by the algorithm.)
By the weaker assumption of 2-wise independence, we observe that if and so:
By the assumption of 4-wise independence, we also observe that unless or or and so:
Hence, if we repeat the process with independent copies of , it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor of with probability at least . Note that it was lucky that we didn’t need full coordinate independence because that would have required bits just to remember . It can be shown that remembering bits is sufficient if we only need 4-wise independence.
BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of . However, it was observed in [Indyk, McGregor] that the following also worked. First pick where are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of be . It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that and this is good enough for our purposes.
In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define based on random vectors of length . The culmination of this work shows that the variance increases to at most and the resultant algorithm uses space.
At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from , can we determine if the coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.
If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case , measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.
We’re presented with a numerical stream and we want to compute some small-space approximation of where is the frequency of . In particular, we want to be able to return estimates that satisfy
where is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)
Let be the first prime numbers and define “hash” functions . Based on these functions, the CR-Precis algorithm maintains counters which are initially zero. When we observe in the stream we increment each of for .
Say we’re interested in the frequency of . At the end of the day, we end up with overestimates for , i.e., . Hence, it makes sense to return . How much of an overestimate can this be?
Note that for any , there are at most functions under which collide, i.e., . This follows from the Chinese Remainder Theorem. Hence, we deduce that,
and therefore . Setting gives the desired result.
But wait, there’s more! There are actually other deterministic algorithms for this problem that use only space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a matrix in a natural way such that the algorithm described above is simply computing . This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream
Assuming that all , the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing and incrementing when processing .
Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.
We consider a stream formed by arranging a set of distinct values in random order, i.e., for a permutation of chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).
Here’s a simple algorithm that maintains a set of elements and two counters and which are initialized to zero. Store the first values from the stream in and as each new element arrives, do one of the following:
- If Increment
- If Increment
- If and : Add to , remove smallest element from , and increment
- If and : Add to , remove largest element from , and increment
At the end of the stream, if then return the -th smallest element in . Otherwise fail.
Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of and values less than and values greater that . Hence, if at the end of the stream, has a -th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering and noting that, for some constant ,
- Probability a non-zero is incremented at any step is at most 1/2 if
- is either incremented or decremented at each step
- If at the final step, then as required
Hence, we have a simple “random-walk-like” set-up and there’s a that ensures that, with high probability, the algorithm doesn’t fail.
Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].
We’re presented with a numerical stream and we want to compute some function where is the frequency of and is an arbitrary function with . One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:
- Pick uniformly at random from
Note that can be computed using space as the data arrives. It’s not hard to show that because,
Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of . Quite how many basic estimators are required depends on the variance of and this determines the total space used by the algorithm.
Using this approach, it was shown that the -th order frequency moment, , can be approximated up to a factor with probability using only bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].