NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Lucene: All content tagged as Lucene in NoSQL databases and polyglot persistence

Full text search with MongoDB and Lucene analyzers

Johan Rask:

It is important to understand that for a full fledged full text search engine, Lucene or Solr is still your choice since it has many other powerful features. This example only includes simple text searching and not i.e phrase searching or other types of text searches, nor does it include ranking of hits. But, for many occasions this is all you need but then you must be aware of that especially write performance will be worse or much worse depending on the size of the data your are indexing. I have not yet done any search performance tests for this so I am currently totally unaware of this but I will publish this as soon as I can.

Just a couple of thoughts:

  • Besides Lucene and Solr, ☞ ElasticSearch is another option you should keep in mind
  • your application will have to deal maintaining the index (adding, updating, removing). MongoDB currently lacks a notification mechanism that would help you decouple this. Something a la CouchDB _changes feed or Riak post-commit hooks (nb: leaving aside that starting with version 0.133 Riak search is available)

Original title and link: Full text search with Mongodb and Lucene analyzers (NoSQL databases © myNoSQL)


Riak 0.13, Featuring Riak Search

I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.

After writing about Riak search a couple of times already[1], I finally missed exactly the release of Riak that includes Riak search.

Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:

  • Riak search
  • MapReduce improvements
  • Bitcask storage backend improvements
  • improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules

While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.

Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).

At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

☞ The Basho Blog

The Basho blog explains this feature extensively ☞ here and ☞ here.

Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.

Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.

Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:

  1. First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL.  ()

Original title and link: Riak 0.13, Featuring Riak Search (NoSQL databases © myNoSQL)

Riak Search and Riak Full Text Indexing

Announced a while back and ☞ not quite here yet, Riak Search is Basho’s solution to the full text indexing problem.

While waiting for the release of Riak Search, I think that you can already start doing full text indexing using one of the existing indexing solutions (Lucene[1], Solr[2], ElasticSearch[3], etc.) and Riak post-commit hooks.

Simply put, all you’ll have to do is to create a Riak post-commit hook that feeds data into your indexing system.

The downside of this solution is that:

  1. you’ll still have to make sure that your indexing system is scalable, elastic, etc.
  2. you’ll not be able to use indexed data directly from Riak mapreduce functions, a feature that will be available through Riak Search.

Anyways, until Riak Search is out, why not having some fun!

Update: Embedded below a presentation on Riak Search providing some more details about this upcoming Basho product:

Update: Looks like the other presentation is not available anymore, so here is another on Riak search:


Presentation: CouchDB and Lucene

We’ve looked in the past at two possible approaches to deal with full text indexing in CouchDB. Now, I’ve found a great slidedeck from Martin Rehfeld on the subject:

Integrating MongoDB with Solr

Sounds like quite a few NoSQL projects are externalizing the full text indexing to either Lucene or Solr (take for example CouchDB integration with Lucene or Neo4j integration with Lucene and Solr).

Now even if there are some basic ways (see [1] and [2]) to achieve this with MongoDB alone, people are still looking for more scalable solutions as shown by this thread ☞ covering Solr integration with MongoDB. The thread also mentions a couple of existing Ruby or Rails plugins for this integration.

One concern that I’ve expressed about the integration with Lucene alone is that you’ll have to deal with its scalability. Solr is one way to do that automatically. Lately I have heard of a new solution for scalable search: ☞ ElasticSearch which sounds quite interesting (nb: I haven’t yet gone through its docs or played with it, but the creator of the project has a long search/indexing history behind. You can find more details about Elastic Search here[3]).

Neo4j Extending Integration with Lucene Family. Now Solr

In a previous post, I was writing that Neo4j, as CouchDB, is using Lucene for full text indexing. While agreeing that this is definitely better than reinventing the wheel, I was also raising my concern about the complexity and scalability of this approach.

Now it looks like there is some work to integrate Neo4j with Solr, the standalone full-text search server based on Lucene [1]. This would definitely address the issue I have raised. Anyway it is not yet clear from the original message [2] how this integration will work though (it sounds like a two-way integration, but I may be misinterpreting the details). The code is availalbe on Neo4j ☞ SVN.

Neo4j Node Indexing

It looks like CouchDB is not the only NoSQL store that uses Lucene for full text indexing. Neo4j, the graph database, has no built-in indexing features, but provides a plugable mechanism for supporting it. You can read more about this integration on ☞ Neo4j wiki.

There is also a post from Arin Sarkissian providing ☞ a quick example of how node indexing should be implemented.

While I do appreciate the fact that these projects are not suffering from the “not invented here” syndrome (and I read that Lucene can scale), I would definitely find very useful to see some good references/recommendations on how to deal with Lucene scaling once Lucene-based full text/node indexing is used.

Update: Neo4j is getting closer to its 1.0 release and the latest RCs include some improvements on the node indexing. You can read more about it in the ☞ changelog

CouchDB Full Text Indexing

Currently there seems to be two approaches to get full text indexing in CouchDB: couchdb-lucene [1] and indexer [2].

As its name implies, couchdb-lucene is based on the well known Lucene library. While I think that such a solution is providing a lot of features and flexibility, my concern is that it also brings additional complexity in terms of scalability as you’ll not only need to take care of scaling CouchDB, but also your Lucene indexes. On the other hand, indexer is using a much simpler approach and stores the indexes directly in the CouchDB, but it is still a prototype version.

But that’s just my opinion, so I’m wondering which one of these would you favor?

To learn more about these projects you can check the following resources:

And for more libraries and projects make sure you check the NoSQL Libraries.

CouchDB Full Text Indexing Prototype and Riak Search

A prototype for CouchDB full text indexing based on Joe Armstrong’s code from ☞ Programming Erlang: Software for a Concurrent World

The implementation is quite naive, using a couch database to store the inverted index, but it works surprisingly well for my use case and is very simple.

Not sure though that this prototype would have stopped ☞ the guys from Collecta to migrate to Riak and Riak Search.

The CouchDB full text indexing prototype code can be accessed on ☞ GitHub.