You are currently browsing the monthly archive for July 2010.
A rather long time ago, I mentioned that Krzysztof Onak and I were compiling a list of open research problems for data stream algorithms and related topics. We’re starting with some of the problems that were explicitly raised at the IITK Workshop on Algorithms for Processing Massive Data Sets but we’d also like to add additional questions from the rest of the community. Please email (mcgregor at cs.umass.edu) if you have a question that you’d like us to include (see the previous list for some examples). I’ll also be posting some of the problems here while we work on compiling the final document. Here’s one now…
Given a stream , how much space is required to approximate the length of the longest increasing subsequence up to a factor
?
Background. [Gopalan, Jayram, Krauthgamer, Kumar] presented a single-pass algorithm that uses space and a matching (in terms of
) lower bound for deterministic algorithms was proven by [Gal, Gopalan] and [Ergun, Jowhari]. Is there a randomized algorithm that uses less space or can the lower bound be extended to randomized algorithms? Very recently, [Chakrabarti] presented an “anti-lowerbound”, i.e., he showed that the randomized communication complexity of the communication problems used to establish the lower bounds is very small. Hence, if it is possible to extend the lower bound to randomized algorithms, this will require new ideas.
Note that solving the problem exactly is known to require space [Gopalan, Jayram, Krauthgamer, Kumar] and [Sun,Woodruff]. The related problem of finding an increasing subsequence of length
has been resolved:
space is known to be necessary [Guha, McGregor] and sufficient [Liben-Nowell, Vee, Zhu] where
is the number of passes permitted. However, there are no results on finding “most” of the elements.
Here’s another bite-sized stream algorithm for your delectation. This time we want to simulate a random walk from a given node in a graph whose edges arrive as an arbitrarily-ordered stream. I’ll allow you multiple passes and semi-streaming space, i.e.,
space where
is the number of nodes in the graph. You need to return the final vertex of a length
random walk.
This is trivial if you take passes: in each pass pick a random neighbor of the node picked in the last pass. Can you do it in fewer passes?
Well, here’s an algorithm from [Das Sarma, Gollapudi, Panigrahy] that simulates a random walk of length in
space while only taking
passes. As in the trivial algorithm, we build up the random walk sequentially. But rather than making a single hop of progress in each pass, we’ll construct the random walk by stitching together shorter random walks.
- We first compute short random walks from each node. Using the trivial algorithm, do a length
walk from each node
and let
be the end point.
- We can’t reuse short random walks (otherwise the steps in the random walk won’t be independent) so let
be the set of nodes from which we’re already taken a short random walk. To start, let
and
where
is the vertex that is reached by the random walk constructed so far and
is the length of this random walk.
- While
- If
then set
- Otherwise, sample
edges (with replacement) incident on each node in
. Find the maximal path from
such that on the
-th visit to node
, we take the
-th edge that was sampled for node
. The path terminates either when a node in
is visited more than
times or we reach a node that isn’t in
. Reset
to be the final node of this path and increase
by the length of the path. (If we complete the length
random walk during this process we may terminate at this point and return the current node.)
- Perform the remaining
steps of the walk using the trivial algorithm.
So why does it work? First note that the maximum size of is
because
is only incremented when
increases by at least
and we know that
. The total space required to store the vertices
is
. When we sample
edges incident on each node in
, this requires
space. Hence the total space is
. For the number of passes, note that when we need to take a pass to sample edges incident on
, we make
hops of progress because either we reach a node with an unused short walk or the walk uses
samples edges. Hence, including the
passes used at the start and end of the algorithm, the total number of passes is
.
Das Sarma et al. also present a trade-off result that reduces the space to for any
at the expense of increasing the number of passes to
. They then use this for estimating the PageRank vector of the graph.
Luca has just announced the accepted papers for FOCS. Papers that have direct or indirect connections to streaming and communication complexity include:
- Lower Bounds for Near Neighbor Search via Metric Expansion [Panigrahy, Talwar, Wieder]
- The Limits of Two-Party Differential Privacy [McGregor, Mironov, Pitassi, Reingold, Talwar, Vadhan]
- Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition [Chakrabarti, Cormode, Kondapally, McGregor]
- Bounded Independence Fools Degree-2 Threshold Functions [Diakonikolas, Kane, Nelson]
- Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity [Andoni, Krauthgamer, Onak]
- Sublinear Optimization for Machine Learning [Clarkson, Hazan, Woodruff]
- The Coin Problem, and Pseudorandomness for Branching Programs [Brody, Verbin]
Abstracts can be found here. More pdfs can be found at My Brain is Open.


Recent Comments