A rather long time ago, I mentioned that Krzysztof Onak and I were compiling a list of open research problems for data stream algorithms and related topics. We’re starting with some of the problems that were explicitly raised at the IITK Workshop on Algorithms for Processing Massive Data Sets but we’d also like to add additional questions from the rest of the community. Please email (mcgregor at cs.umass.edu) if you have a question that you’d like us to include (see the previous list for some examples). I’ll also be posting some of the problems here while we work on compiling the final document. Here’s one now…

Given a stream $\langle a_1, \ldots, a_m \rangle$, how much space is required to approximate the length of the longest increasing subsequence up to a factor $1+\epsilon$?

Background. [Gopalan, Jayram, Krauthgamer, Kumar] presented a single-pass algorithm that uses $\tilde{O}((m/\epsilon)^{0.5})$ space and a matching (in terms of $m$) lower bound for deterministic algorithms was proven by [Gal, Gopalan] and [Ergun, Jowhari]. Is there a randomized algorithm that uses less space or can the lower bound be extended to randomized algorithms? Very recently, [Chakrabarti] presented an “anti-lowerbound”, i.e., he showed that the randomized communication complexity of the communication problems used to establish the lower bounds is very small. Hence, if it is possible to extend the lower bound to randomized algorithms, this will require new ideas.

Note that solving the problem exactly is known to require $\Omega(m)$ space [Gopalan, Jayram, Krauthgamer, Kumar] and [Sun,Woodruff]. The related problem of finding an increasing subsequence of length $k$ has been resolved: $\tilde{\Theta}(k^{1+1/(2^p-1)})$ space is known to be necessary [Guha, McGregor] and sufficient [Liben-Nowell, Vee, Zhu] where $p$ is the number of passes permitted. However, there are no results on finding “most” of the elements.