I couldn’t find anything that could possibly handle many terabytes of data, though. Most Ruby implementations, like the classifier gem, have only a simplistic implementation […]. I decided to create a better naive bayes implementation (for instance, using a Laplacian smoother) that could also handle up to many terabytes of corpus data.
We already have a Hadoop cluster with HBase running, and HBase is perfect for storing data like word counts.
I guess you can imagine what happened next. GitHub project name ☞ ankusa.
This blog is called myNoSQL and it is written by me, Alex Popescu, a software architect with a passion for open source and communities.
It records my readings, learnings, and opinions on NoSQL databases, polyglot persistence, and distributed systems -- subjects that I'm passionate about.
The opinions expressed here are my own, and no other party necessarily agrees with them.
If you feel I'm biased, I probably am.