I couldn’t find anything that could possibly handle many terabytes of data, though. Most Ruby implementations, like the classifier gem, have only a simplistic implementation […]. I decided to create a better naive bayes implementation (for instance, using a Laplacian smoother) that could also handle up to many terabytes of corpus data.
We already have a Hadoop cluster with HBase running, and HBase is perfect for storing data like word counts.
I guess you can imagine what happened next. GitHub project name ☞ ankusa.
Original title and link: Naive Bayes Classification in Ruby using Hadoop and HBase (NoSQL databases © myNoSQL)