Behemoth: All content tagged as Behemoth in NoSQL databases and polyglot persistence
Tuesday, 6 March 2012
Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth
Grant Ingersoll:
Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results.
I have to agree with Karussell:
Scaling Solr means using Solr AND X AND Y AND… Scaling ElasticSearch means using ElasticSearch
Original title and link: Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth (©myNoSQL)