Lucene: All content tagged as Lucene in NoSQL databases and polyglot persistence
I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.
Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:
- Riak search
- MapReduce improvements
- Bitcask storage backend improvements
- improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules
While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.
Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).
At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.
Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.
Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.
Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:
- First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL. (↩)
While waiting for the release of Riak Search, I think that you can already start doing full text indexing using one of the existing indexing solutions (Lucene, Solr, ElasticSearch, etc.) and Riak post-commit hooks.
Simply put, all you’ll have to do is to create a Riak post-commit hook that feeds data into your indexing system.
The downside of this solution is that:
- you’ll still have to make sure that your indexing system is scalable, elastic, etc.
- you’ll not be able to use indexed data directly from Riak mapreduce functions, a feature that will be available through Riak Search.
Anyways, until Riak Search is out, why not having some fun!
Update: Embedded below a presentation on Riak Search providing some more details about this upcoming Basho product:
Update: Looks like the other presentation is not available anymore, so here is another on Riak search:
Now even if there are some basic ways (see  and ) to achieve this with MongoDB alone, people are still looking for more scalable solutions as shown by this thread ☞ covering Solr integration with MongoDB. The thread also mentions a couple of existing Ruby or Rails plugins for this integration.
One concern that I’ve expressed about the integration with Lucene alone is that you’ll have to deal with its scalability. Solr is one way to do that automatically. Lately I have heard of a new solution for scalable search: ☞ ElasticSearch which sounds quite interesting (nb: I haven’t yet gone through its docs or played with it, but the creator of the project has a long search/indexing history behind. You can find more details about Elastic Search here).
In a previous post, I was writing that Neo4j, as CouchDB, is using Lucene for full text indexing. While agreeing that this is definitely better than reinventing the wheel, I was also raising my concern about the complexity and scalability of this approach.
Now it looks like there is some work to integrate Neo4j with Solr, the standalone full-text search server based on Lucene . This would definitely address the issue I have raised. Anyway it is not yet clear from the original message  how this integration will work though (it sounds like a two-way integration, but I may be misinterpreting the details). The code is availalbe on Neo4j ☞ SVN.
It looks like CouchDB is not the only NoSQL store that uses Lucene for full text indexing. Neo4j, the graph database, has no built-in indexing features, but provides a plugable mechanism for supporting it. You can read more about this integration on ☞ Neo4j wiki.
There is also a post from Arin Sarkissian providing ☞ a quick example of how node indexing should be implemented.
While I do appreciate the fact that these projects are not suffering from the “not invented here” syndrome (and I read that Lucene can scale), I would definitely find very useful to see some good references/recommendations on how to deal with Lucene scaling once Lucene-based full text/node indexing is used.
Update: Neo4j is getting closer to its 1.0 release and the latest RCs include some improvements on the node indexing. You can read more about it in the ☞ changelog
As its name implies, couchdb-lucene is based on the well known Lucene library. While I think that such a solution is providing a lot of features and flexibility, my concern is that it also brings additional complexity in terms of scalability as you’ll not only need to take care of scaling CouchDB, but also your Lucene indexes. On the other hand, indexer is using a much simpler approach and stores the indexes directly in the CouchDB, but it is still a prototype version.
But that’s just my opinion, so I’m wondering which one of these would you favor?
To learn more about these projects you can check the following resources:
And for more libraries and projects make sure you check the NoSQL Libraries.