A problem everyone using a NoSQL databases faces (nb: actually I think this applies to most storage engines that don’t support full text indexing):
The problem now is: what to use? Currently I’m toying with 3 options:
- Use Sphinx Search; it’s pretty powerful, pretty damn fast, but requires me to feed it data through XML, but only when the indexer runs. Basically it’s quite hard to get real-time indexes going, and the delta updates are something I’d rather not mess with.
- Use Solr; I’d go for this if it wasn’t for the fact it’s Java and requires Tomcat to work. Our entire application infrastructure is basically MongoDB and Perl, and I don’t want to go and set up a Tomcat instance just for Solr; on top of which I have a pathologically deep hatred for Java, but that aside…
- Roll my own. Full text search the way we need it doesn’t actually require things like stemming or fancy analysis of things. What it does need is the ability to search a schema-less database… Solr and Sphinx both suffer from the fact you need to tell them what to index, and even then you run into the fact that it’ll need a double pass. First pass is getting the search results, and the second pass entails the checking to see whether the user doing the search can actually see the document.
Couple of thoughts:
- there are a couple of solutions out there, both relational and NoSQL databases, that support different degrees of full text indexing (e.g. Riak Search, MarkLogic)
- even if your database supports some form of full text search, the implementation might not be complete/optimal.
- initially it may sounds like building a reverse index is the best solution.
Twitter’s story of migrating from their own reverse indexes in MySQL to a Lucene based solution should change your mind.
- some NoSQL databases provide good mechanisms for enabling full text indexing. Riak has post commit hooks, CouchDB has a
Original title and link: Full Text Search: What to Use? (NoSQL database©myNoSQL)