My Biased Coin recently discussed a new paper extending some work I’d done a few years back. I’ll briefly mention this work at the end of the post but first here’s another bite-sized result from [Alon, Matias, Szegedy] that is closely related.
Consider a numerical stream that defines a frequency vector where is the frequency of in the stream. Here’s a simple sketch algorithm that allows you to approximate Let be a random vector where the are 4-wise independent and unbiased. Consider and note that can be computed incrementally as the stream arrives (given that is implicitly stored by the algorithm.)
By the weaker assumption of 2-wise independence, we observe that if and so:
By the assumption of 4-wise independence, we also observe that unless or or and so:
Hence, if we repeat the process with independent copies of , it’s possible to show (via Chebyshev and Chernoff bounds) that by appropriately averaging the results, we get a value that is within a factor of with probability at least . Note that it was lucky that we didn’t need full coordinate independence because that would have required bits just to remember . It can be shown that remembering bits is sufficient if we only need 4-wise independence.
BONUS! The recent work… Having at least 4-wise independence seemed pretty important to getting a good bound on the variance of . However, it was observed in [Indyk, McGregor] that the following also worked. First pick where are independent and the coordinates of each are 4-wise independent. Then let the coordinate values of be . It’s no longer the case that the coordinates are 4-wise independent but it’s still possible to show that and this is good enough for our purposes.
In follow up work by [Braverman, Ostrovsky] and [Chung, Liu, Mitzenmacher], it was shown that you can push this idea further and define based on random vectors of length . The culmination of this work shows that the variance increases to at most and the resultant algorithm uses space.
At this point, you’d be excused for asking why we all cared about such a construction. The reason is that it’s an important technical step in solving the following problem: given a stream of tuples from , can we determine if the coordinates are independent? In particular, how “far” is the joint distribution from the product distribution defined by considering the frequency of each value in each coordinate separately. When “far” is measured in terms of the Euclidean distance, a neat solution is based on the above analysis. Check out the papers for the details.
If you want to measure independence in terms of the variational distance, check out [Braverman, Ostrovsky]. In the case , measuring independence in terms of the KL-divergence gives the mutual information. For this, see [Indyk, McGregor] again.