You are currently browsing the category archive for the ‘open problems’ category.

Piotr Indyk, Ilan Newman, Krzysztof Onak, and I have just finished up a new open problems list. Here it is! It mainly covers topics in data streams and property testing but there’s also some other cool stuff. The list is compiled from the recent Bertinoro workshop and the slightly less recent Kanpur workshop.

(Cartoon from XKCD obviously.)

Today, we consider estimating the median of a set of n values given a single pass over a stream formed by randomly permuting the set. Seems like a pretty simple problem, right?

It’s known that we can find an O(\sqrt{n})-approximate median, i.e., an element with rank n/2 \pm O(\sqrt{n}), in polylog space, with probability 9/10 where the probability is taken over both the coin flips of the algorithm and the random permutation. However, the best known lower bound states that finding a n^\gamma-approximate median requires \Omega(n^{1/2-3\gamma/2}) space [Guha, McGregor]. If this bound is optimal, it would be possible to find a n^{1/3}-approximate median in polylogarithmic space. Does there exist such an algorithm or can the lower bound be improved?

Background. The exact median can be found in O(\sqrt{n}) space [Munro, Paterson]. It seems reasonable to conjecture that \Theta(n^{1/2-\gamma}) space is necessary and sufficient to find a O(n^\gamma)-approximate median.

(Disclaimer: I’ve suppressed some poly-log and poly-constant terms above. And, actually, you can shave-off one of the O(\log n) terms I didn’t mention in the lower bound if you consider some public random bits in the argument.)

Already solved the previous open problem? Well here’s another question that was raised at the IITK workshop.

In an earlier post, I sketched a result by [Das Sarma, Gollapudi, Panigrahy] that showed that it was possible to simulate a random walk of length t with \tilde{O}(n) space and O(\sqrt{t}) passes over the input stream where n is the number of nodes. Is it possible to simulate such a random walk using \tilde O(n) space and a much smaller number of passes, say polylogarithmic in n and t? The goal is either to show an algorithm or prove a lower bound.

Das Sarma et al. simulate random walks to approximate the probability distribution on the vertices of the graph after a random walk of length t. What is the streaming complexity of approximating this distribution? What is the streaming complexity of finding the k (approximately) most likely vertices after a walk of length k?

Finally, is it possible to sparsify the input graph so that performing random walks in the sparsified graph gives similar results to performing random walks in the original graph?

BONUS! Moving pictures… Words bore you? Well, you could check out the video of the open problems session for a richer multimedia experience. Apologies in advance that it’s not in 3D.

A rather long time ago, I mentioned that Krzysztof Onak and I were compiling a list of open research problems for data stream algorithms and related topics. We’re starting with some of the problems that were explicitly raised at the IITK Workshop on Algorithms for Processing Massive Data Sets but we’d also like to add additional questions from the rest of the community. Please email (mcgregor at if you have a question that you’d like us to include (see the previous list for some examples). I’ll also be posting some of the problems here while we work on compiling the final document. Here’s one now…

Given a stream \langle a_1, \ldots, a_m \rangle, how much space is required to approximate the length of the longest increasing subsequence up to a factor 1+\epsilon?

Background. [Gopalan, Jayram, Krauthgamer, Kumar] presented a single-pass algorithm that uses \tilde{O}((m/\epsilon)^{0.5}) space and a matching (in terms of m) lower bound for deterministic algorithms was proven by [Gal, Gopalan] and [Ergun, Jowhari]. Is there a randomized algorithm that uses less space or can the lower bound be extended to randomized algorithms? Very recently, [Chakrabarti] presented an “anti-lowerbound”, i.e., he showed that the randomized communication complexity of the communication problems used to establish the lower bounds is very small. Hence, if it is possible to extend the lower bound to randomized algorithms, this will require new ideas.

Note that solving the problem exactly is known to require \Omega(m) space [Gopalan, Jayram, Krauthgamer, Kumar] and [Sun,Woodruff]. The related problem of finding an increasing subsequence of length k has been resolved: \tilde{\Theta}(k^{1+1/(2^p-1)}) space is known to be necessary [Guha, McGregor] and sufficient [Liben-Nowell, Vee, Zhu] where p is the number of passes permitted. However, there are no results on finding “most” of the elements.


A research blog about data streams and related topics.

Recently Tweeted