You are currently browsing the category archive for the ‘educational’ category.
Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:
The outline was:
- Basic Framework: If you have a small-space algorithm for stream problem
, you can turn this into a low-communication protocol for a related communication problem
. Hence, a lower bound on the communication required to solve
implies a lower bound on the space required to solve
. Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.
- Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.
- Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]
Tout le monde! Here’s the group photo from this year’s L’Ecole de Printemps d’Informatique Théorique.
Everyone loves the union bound. It is the most dependable of all probability bounds. No matter what mess of dependencies threaten your analysis, the union bound will ride in and save the day. Except when it doesn’t.
I’ve been giving a talk about some recent results (here are some slides) and have received a particular question from a couple of different audiences. The exact details aren’t important but the basic gist was why I didn’t simplify some of the analysis by just applying the union bound. The short answer was it didn’t apply. But how could that be? My longer answer was longer and probably less helpful. So let me distill things down to the following paradox (which isn’t really a paradox of course).
The Geography Test. Suppose you have a geography test tomorrow. In the test you’ll be asked two questions out of five possible questions (and you already know the list of possible questions). If you get both right, you’ll pass the course. However, rather than revise everything you adopt a random strategy that ensures that for any question, you’ll have a correct answer with probability at least 4/5. Hence, by the union bound, you’ll get both questions right with probability at least 3/5.
Right? Sure, it’s a straight-forward application of the union bound: the probability of being wrong on one or both questions is at most the probability of being wrong on the first (at most 1/5) plus the probability of being wrong on the second (at most 1/5). And the union bound never let’s you down.
The Bad News. You’re going to fail both your geography and probability classes. Consider the following scenario. The five possible exam questions are: 1) name a city in the US, 2) name a city in France, 3) name a city in Germany, 4) name a city in the UK, and 5) name a city in Italy. Your cunning strategy is to memorize one of the rows in the following table.
| US | France | Germany | UK | Italy |
|---|---|---|---|---|
| New York | Paris | Berlin | Glasgow | Reykjavik |
| Philadelphia | Paris | Berlin | Reykjavik | Rome |
| Los Angeles | Paris | Reykjavik | Glasgow | Rome |
| Boston | Reykjavik | Berlin | Glasgow | Rome |
| Reykjavik | Paris | Berlin | Glasgow | Rome |
You choose the row to remember uniformly at random thus guaranteeing that you can answer any question correctly with probability 4/5. Of course the professor doesn’t know which row you’ll pick but he does know your strategy. (This is equivalent to the usual assumption we make about adversaries knowing your algorithm but not your random bits.)
Unfortunately, you stay up so late studying your row that you sleep in and miss the exam. You go to the professor’s office to plead for a make-up exam. He is initially reluctant but then a devious smile crosses his face. Sure, he’ll still give you the exam. You’re relieved that you’ll still pass the course with probability at least 3/5. The professor doesn’t look so sure.
The first question he asks is to name a city in the US. “Philadelphia” you answer with confidence. As the professor nods you see that devious smile again. The second question… name a city in the UK.
And then you realize that the probability of success was never as large as 3/5 since the professor could always guarantee that you fail by first asking for a US city and then choosing his second question accordingly.
Morals of the story: Don’t miss exams, study hard, and don’t blindly trust the union bound.
Here’s another bite-sized stream algorithm for your delectation. This time we want to simulate a random walk from a given node in a graph whose edges arrive as an arbitrarily-ordered stream. I’ll allow you multiple passes and semi-streaming space, i.e.,
space where
is the number of nodes in the graph. You need to return the final vertex of a length
random walk.
This is trivial if you take passes: in each pass pick a random neighbor of the node picked in the last pass. Can you do it in fewer passes?
Well, here’s an algorithm from [Das Sarma, Gollapudi, Panigrahy] that simulates a random walk of length in
space while only taking
passes. As in the trivial algorithm, we build up the random walk sequentially. But rather than making a single hop of progress in each pass, we’ll construct the random walk by stitching together shorter random walks.
- We first compute short random walks from each node. Using the trivial algorithm, do a length
walk from each node
and let
be the end point.
- We can’t reuse short random walks (otherwise the steps in the random walk won’t be independent) so let
be the set of nodes from which we’re already taken a short random walk. To start, let
and
where
is the vertex that is reached by the random walk constructed so far and
is the length of this random walk.
- While
- If
then set
- Otherwise, sample
edges (with replacement) incident on each node in
. Find the maximal path from
such that on the
-th visit to node
, we take the
-th edge that was sampled for node
. The path terminates either when a node in
is visited more than
times or we reach a node that isn’t in
. Reset
to be the final node of this path and increase
by the length of the path. (If we complete the length
random walk during this process we may terminate at this point and return the current node.)
- Perform the remaining
steps of the walk using the trivial algorithm.
So why does it work? First note that the maximum size of is
because
is only incremented when
increases by at least
and we know that
. The total space required to store the vertices
is
. When we sample
edges incident on each node in
, this requires
space. Hence the total space is
. For the number of passes, note that when we need to take a pass to sample edges incident on
, we make
hops of progress because either we reach a node with an unused short walk or the walk uses
samples edges. Hence, including the
passes used at the start and end of the algorithm, the total number of passes is
.
Das Sarma et al. also present a trade-off result that reduces the space to for any
at the expense of increasing the number of passes to
. They then use this for estimating the PageRank vector of the graph.
Following on from the last post on maximum weighted matchings in the stream model, I wanted to briefly mention a result that appeared in STACS last week. The set-up is the same as before: there’s a stream of undirected weighted edges where
is the weight of edge
and we want to find an approximate maximum weighted matching. Last time, we showed a simple algorithm that gave a 5.828 approximation that was “preemptive online”, i.e., at any point in time the algorithm remembered no information other than the feasible matching found so far. The follow-up work achieved a 5.585 approximation but didn’t have this property. This begs numerous questions including if 5-ish is the “right answer” for this problem and whether there is a separation between preemptive-online algorithms and stream algorithms that use roughly the same amount of memory.
Which brings us to [Epstein, Levin, Mestre, Segev]. They showed a lower bound of 4.967 for any deterministic preemptive online algorithm. Furthermore, they show that there exists a deterministic stream algorithm that breaks through this bound and achieves a 4.911 approximation. As a commenter pointed out, the algorithm is nice and simple. Furthermore, it makes a good teaching example of how algorithms don’t necessary appear from thin air, but rather can be developed in a series of less mysterious steps. Here’s what they do in the paper:
- Design a simple deterministic 8-approximation: Geometrically group edge weights by rounding edge weight to the nearest power of 2 and ignore the “very light” edges. Greedily find a maximal matching among each group of edges. At the end of the stream, find a maximum matching among the stored edges.
- Randomize the algorithm to get a 4.911-approximation: Instead of grouping to the nearest power of 2, do randomized geometric grouping with the appropriate parameters.
- Derandomize!

Thought… It might be interesting to investigate which other problems exhibit a separation between the preemptive online model and the stream model. You may need to be flexible with your definition of “preemptive online.”
Readers with a particularly long memory may recall that we’re in the midst of a series of “bite-sized results” in data stream algorithms. Back in October, we discussed graph spanners and now it’s time for some graph matchings.
So consider a stream of undirected weighted edges where
is the weight of edge
. Can we find a matching, i.e., a subset of the edges that don’t share any endpoints, such that the sum of the associated weights is maximized?
Here’s a really simple algorithm that maintains a matching as the edges are processed:
- The initial matching is the empty matching
- When processing edge
, the weight of
is compared with the total weight
of the conflicting edges, i.e., the edges in the currently stored matching that share an endpoint with
. If
then
is added to the matching and the conflicting edges are removed.
It makes a nice exercise to prove that the above algorithm achieves a approximation ratio (see pages 149 and 150 of this thing if you get stuck) and choosing the optimal
gives you a 5.828 approximation. Incidentally, this algorithm later found use in a paper on auctioning ad slots [Constantin, Feldman, Muthukrishnan, Pal].
Subsequently, a 5.585 approximation algorithm was presented in [Zelke]. The main idea was to remember some of the edges that were ejected from the matching in the second step of the above algorithm. These edges are potentially added back into the matching at some later point. While a natural idea, it requires substantially more analysis to show the improved approximation ratio.
The last four BSS posts considered streams of numerical values. While there are still many more bite-sized results to cover for such streams, I thought we’d mix things up a bit and switch to graph streams for the next four posts. We’ll consider a stream of unweighted, undirected edges and try to estimate properties of the graph
defined by this stream subject to the usual constraints of stream computation (sequential access to the stream and limited memory). For this post, we focus on estimating distances.
The results we’ll discuss revolve around constructing a -spanner, a sub-graph
of
such that for any
where is the length of the shortest path between
in graph
. Obviously, given such a graph we can approximate any distance up to a factor of
So how can we construct a -spanner in the data-stream model? Here’s a very simple algorithm: Intialize
to
and on receiving edge
, add the edge to
if and only if
.
It’s easy to show that is indeed a
-spanner because, for each edge
on a shortest path in
,
either includes
or a path between
and
of length at most
. The space taken by the algorithm is just the space required to store
. How large can this be? Well, note that
has girth at least
, i.e., it contains no cycles of length
or shorter. It can be shown that such a graph has at most
edges if
is odd (see e.g., [Bollobás] or [Indyk's lecture slides]) and hence we have a stream algorithm using
bits of memory that finds a
approximation for any shortest path distance.
Unfortunately, the time to process each edge in the above algorithm is rather large and this spurred some nice follow-up work including [Feigenbaum et al.], [Baswana], and [Elkin]. It’s now known how to construct a -spanner in
space and constant worst-case time per edge.
It’s not hard to extend these algorithms to weighted graphs (consider geometrically grouping the edge weights and constructing a spanner for each distinct edge weight) but there’s bad news for directed graphs: even determining if there exists a directed path between two nodes requires bits of memory.
BONUS! Some British English… Back in the UK, we call this

a spanner. Apparently that’s not the usual term in US English and I found this out after giving a talk at SODA back in 2005. The talk went reasonably well but, during the coffee break, numerous people asked why I’d kept on displaying pictures of wrenches throughout my talk. Whoops.
In technical writing, I normally endeavor [sic] to use US English (although I did once engage in some “colouring of paths” in a paper with a Canadian co-author). But I still think “math and stat” sounds funny :)
My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.
Consider a numerical stream that defines a frequency vector where
is the frequency of
in the stream. Here’s a simple sketch algorithm that allows you to approximate
Let
be a random vector where the
are 4-wise independent and unbiased. Consider
and note that
can be computed incrementally as the stream arrives (given that
is implicitly stored by the algorithm.)
By the weaker assumption of 2-wise independence, we observe that if
and so:
By the assumption of 4-wise independence, we also observe that unless
or
or
and so:
Hence, if we repeat the process with independent copies of
, it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor
of
with probability at least
. Note that it was lucky that we didn’t need full coordinate independence because that would have required
bits just to remember
. It can be shown that remembering
bits is sufficient if we only need 4-wise independence.
BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of . However, it was observed in [Indyk, McGregor] that the following also worked. First pick
where
are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of
be
. It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that
and this is good enough for our purposes.
In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define based on
random vectors of length
. The culmination of this work shows that the variance increases to at most
and the resultant algorithm uses
space.
At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from , can we determine if the
coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.
If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case , measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.
We’re presented with a numerical stream and we want to compute some small-space approximation of
where
is the frequency of
. In particular, we want to be able to return estimates
that satisfy
where is not known ahead of time. CR-Precis [Ganguly, Majumder] is a neat little deterministic stream algorithm that achieves this in
space. (Also check out [pg. 31, Gasieniec, Muthukrisnan] for a special case.)
Let be the first
prime numbers and define “hash” functions
. Based on these functions, the CR-Precis algorithm maintains
counters
which are initially zero. When we observe
in the stream we increment each of
for
.
Say we’re interested in the frequency of . At the end of the day, we end up with
overestimates for
, i.e.,
. Hence, it makes sense to return
. How much of an overestimate can this be?
Note that for any , there are at most
functions
under which
collide, i.e.,
. This follows from the Chinese Remainder Theorem. Hence, we deduce that,
and therefore . Setting
gives the desired result.
But wait, there’s more! There are actually other deterministic algorithms for this problem that use only space: e.g. “Misra-Gries”, “Space-Saving”, and “Lossy-Counting” algorithms (see [Cormode, Hadjieleftheriou] for a survey and experimental comparison). What makes CR-Precis more useful is that it’s a sketch algorithm, i.e., it’s possible to transform the hash functions into a
matrix
in a natural way such that the algorithm described above is simply computing
. This confers numerous advantages to the algorithm including the ability to handle “deletes”, i.e., we consider a stream
and define
.
Assuming that all , the above algorithm is easily adapted to this new setting by decrementing the appropriate counters when processing
and incrementing when processing
.
Gödel’s Lost Letter recently mentioned one of my favourite papers from the prehistory of data streams: Selection and Sorting with Limited Storage [Munro, Paterson]. It’s a great paper that prefigured many interesting research directions including multi-pass algorithms and processing streams in random order. Here’s one of the nice “bite-sized” results.
We consider a stream formed by arranging a set of distinct values
in random order, i.e.,
for a permutation
of
chosen uniformly at random. The goal is to process the stream in small-ish space and yet find the exact median with high probability (over the random ordering).
Here’s a simple algorithm that maintains a set of
elements and two counters
and
which are initialized to zero. Store the first
values from the stream in
and as each new element
arrives, do one of the following:
- If
Increment
- If
Increment
- If
and
: Add
to
, remove smallest element from
, and increment
- If
and
: Add
to
, remove largest element from
, and increment
At the end of the stream, if then return the
-th smallest element in
. Otherwise fail.
Why does the algorithm make sense? It’s easy to show that at any point, the set of elements that has arrived thus far consists of and
values less than
and
values greater that
. Hence, if at the end of the stream,
has a
-th smallest element, this element is the median. All that remains is to show that the algorithm fails with small probability. This follows (after a bit of analysis) by considering
and noting that, for some constant
,
- Probability a non-zero
is incremented at any step is at most 1/2 if
is either incremented or decremented at each step
- If
at the final step, then
as required
Hence, we have a simple “random-walk-like” set-up and there’s a that ensures that, with high probability, the algorithm doesn’t fail.
Of course you could ask about finding approximate medians of randomly ordered streams in one or more passes. If you do, check out [Guha, McGregor], [Chakrabarti, Jayram, Patrascu], and [Chakrabarti, Cormode, McGregor].
We’re presented with a numerical stream and we want to compute some function
where
is the frequency of
and
is an arbitrary function with
. One of the many neat ideas in [Alon, Matias, Szegedy] is to use the following basic estimator for this type of problem:
- Pick
uniformly at random from
- Compute
- Let
Note that can be computed using
space as the data arrives. It’s not hard to show that
because,
and
Hence, computing multiple basic estimators in parallel and averaging appropriately will give a good estimate of . Quite how many basic estimators are required depends on the variance of
and this determines the total space used by the algorithm.
Using this approach, it was shown that the -th order frequency moment,
, can be approximated up to a factor
with probability
using only
bits of space. This was subsequently improved to an (essentially) optimal result in [Indyk, Woodruff].



Recent Comments