Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:
This blog is called myNoSQL and it is written by me, Alex Popescu, a software architect with a passion for open source and communities.
It records my readings, learnings, and opinions on NoSQL databases, polyglot persistence, and distributed systems -- subjects that I'm passionate about.
The opinions expressed here are my own, and no other party necessarily agrees with them.
If you feel I'm biased, I probably am.