ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

full text indexing: All content tagged as full text indexing in NoSQL databases and polyglot persistence

Lucene and Solr Development Merged

Full text indexing in NoSQL databases has been addressed so far only by Riak search, the others relying on integrations with Lucene, Solr, or ElasticSearch.

With merged dev, there is now a single set of committers across both projects. Everyone in both communities can now drive releases – so when Solr releases, Lucene will also release – easing concerns about releasing Solr on a development version of Lucene. So now, Solr will always be on the latest trunk version of Lucene and code can be easily shared between projects – Lucene will likely benefit from Analyzers and QueryParsers that were only available to Solr users in the past. Lucene will also benefit from greater test coverage, as now you can make a single change in Lucene and run tests for both projects – getting immediate feedback on the change by testing an application that extensively uses the Lucene libraries. Both projects will also gain from a wider development community, as this change will foster more cross pollination between Lucene and Solr devs (now just Lucene/Solr devs).

Hopefully NoSQL databases will benefit from this merge too, by having a more solid product to rely on.

Original title and link: Lucene and Solr Development Merged (NoSQL databases © myNoSQL)

via: http://www.lucidimagination.com/blog/2010/03/26/lucene-and-solr-development-have-merged/


Full text search with MongoDB and Lucene analyzers

Johan Rask:

It is important to understand that for a full fledged full text search engine, Lucene or Solr is still your choice since it has many other powerful features. This example only includes simple text searching and not i.e phrase searching or other types of text searches, nor does it include ranking of hits. But, for many occasions this is all you need but then you must be aware of that especially write performance will be worse or much worse depending on the size of the data your are indexing. I have not yet done any search performance tests for this so I am currently totally unaware of this but I will publish this as soon as I can.

Just a couple of thoughts:

  • Besides Lucene and Solr, ☞ ElasticSearch is another option you should keep in mind
  • your application will have to deal maintaining the index (adding, updating, removing). MongoDB currently lacks a notification mechanism that would help you decouple this. Something a la CouchDB _changes feed or Riak post-commit hooks (nb: leaving aside that starting with version 0.133 Riak search is available)

Original title and link: Full text search with Mongodb and Lucene analyzers (NoSQL databases © myNoSQL)

via: http://blog.jayway.com/2010/11/14/full-text-search-with-mongodb-and-lucene-analyzers/


Real-Time Searching of Big Data with Solr and Hadoop

Great presentation on searching BigData in real-time integrating Solr and Hadoop from ☞ OpenLogic’s Rod Cope:

And they are definitely not the only one using Hadoop and HBase for search. I guess this would also be a counter-example to Beyond Hadoop - Next-Generation Big Data Architectures.

Original title and link: Real-Time Searching of Big Data with Solr and Hadoop (NoSQL databases © myNoSQL)


Integrating ElasticSearch and CouchDB

This tutorial explains the process of setting up ElasticSearch to automatically index data in CouchDB and make it search-able. ElasticSearch 0.11 introduced a feature named The River, which allows it to connect to external systems and listen for documents updates. On receiving a notification, Elasticsearch indexes the data and makes it available for search.

In a nutshell, the solution uses what I’ve mentioned in previous posts: a combination of CouchDB _changes and an ElasticSearch automatic pull mechanism.

Original title and link: Integrating ElasticSearch and CouchDB (NoSQL databases © myNoSQL)

via: http://github.com/elasticsearch/elasticsearch/wiki/Couchdb-integration


Riak 0.13, Featuring Riak Search

I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.

After writing about Riak search a couple of times already[1], I finally missed exactly the release of Riak that includes Riak search.

Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:

  • Riak search
  • MapReduce improvements
  • Bitcask storage backend improvements
  • improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules

While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.

Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).

At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

☞ The Basho Blog

The Basho blog explains this feature extensively ☞ here and ☞ here.

Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.

Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.

Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:


  1. First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL.  ()

Original title and link: Riak 0.13, Featuring Riak Search (NoSQL databases © myNoSQL)


Searchable CouchDB with ElasticSearch

Shay Banon (@kimchy) about ElasticSeach integration with CouchDB:

The CouchDB River allows to automatically index couchdb and make it searchable using the excellent _changes stream couchdb provides. […] On top of that, in case of a failover, the couchdb river will automatically be started on another elasticsearch node, and continue indexing from the last indexed seq.

Full text indexing in the NoSQL space seems to see some interesting solutions.

Update: if you are interested to find out more about CouchDB _changes, you should check the video below:

Original title and link: Searchable CouchDB with ElasticSearch (NoSQL databases © myNoSQL)

via: http://www.elasticsearch.com/blog/2010/09/28/the_river_searchable_couchdb.html


CouchDB: Relaxed Searching with Bitstore

Missing from CouchDB are two key components, the ability to search over documents and the ability to relate documents to one another.

[…]

So Bitstore proposes adding these two components as extensions to CouchDB. […] This newest prototype of Bitstore adds the ability to filter searches over the field names of the documents, along with a number of other minor packaging features to help it play better with CouchDB.

Bitstore can be found on ☞ GitHub.

Original title and link: CouchDB: Relaxed Searching with Bitstore (NoSQL databases © myNoSQL)

via: http://dionne.posterous.com/relaxed-searching-in-couchdb


Neo4j: Advanced Indexes Using Multiple Keys

There’s a prototype implementation of a new index which solves this (and some other issues as well, f.ex. indexing for relationships). The code is at https://svn.neo4j.org/laboratory/components/lucene-index/ and it’s built and deployed over at http://m2.neo4j.org/org/neo4j/neo4j-lucene-index/

The new index isn’t compatible with the old one so you’ll have to index your data with the new index framework to be able to use it.

Before you were only able to search by a single property.

Original title and link for this post: Neo4j: Advanced Indexes Using Multiple Keys (published on the NoSQL blog: myNoSQL)

via: http://lists.neo4j.org/pipermail/user/2010-August/004781.html


Building a search engine using Redis

This idea is definitely not new, but the post shares quite a few good principles on how to build a search engine using Redis. Plus there’s some ☞ code available:

I know what you are thinking. Why would we want to build a search engine from scratch when Lucene, Xapian, and other software is available? What could possibly be gained? To start, simplicity, speed, and flexibility. We’re going to be building a search engine implementing TF/IDF search Redis, redis-py, and just a few lines of Python. With a few small changes to what I provide, you can integrate your own document importance scoring, and if one of my patches gets merged into Redis, you could combine TF/IDF with your pre-computed Pagerank… Building an index and search engine using Redis offers so much more flexibility out of the box than is available using any of the provided options. Convinced?

Anyways, as I’ve already said it myself, there are a couple of things you should be aware of:

Using Redis to build search is great for your personal site, your company intranet, your internal customer search, maybe even one of your core products. But be aware that Redis keeps everything in memory, so as your index grows, so does your machine requirements. Naive sharding tricks may work to a point, but there will be a point where your merging will have to turn into a tree, and your layers of merges start increasing your latency to scary levels.

via: http://dr-josiah.blogspot.com/2010/07/building-search-engine-using-redis-and.html


Redis and a Full Text Indexing Solution

The guys from PlayNice.ly, which are building a bug tracker that uses Redis for storing all app data (users, projects, bugs, comments, audit data, etc.), have posted recently ☞ here and ☞ here about their work to support search within their product.

While the general idea is to simply store the inverted index into Redis, there are a couple of interesting things to be noted:

  1. Redis native support for SET data type and its set operations (union, intersection, difference) makes working with Redis stored reversed indexes pretty handy
  2. While you might be tempted to use every term as an index key, this will not work with fuzzy searches (i.e. searches for the word “numbers” will not contain documents containing the word “number”). Using “smart keys” — the article mentions using phonetic algorithms for calculating the keys; another solution can employ stemming algorithms — will help you reduce the number of index keys and also to perform fuzzy searches
  3. Building a good API for working with a custom solution will make things feel more natural.

Anyways, before consider this problem completely solved there are a couple of additional things that you should keep in mind:

  1. index updates: there are many different scenarios in which you’ll have to update the inverted index and this can raise different problems starting with:
    1. the increased number of operations (writes explosion) and rountrips to the storage
    2. dealing with concurrent updates
  2. index size (or data explosion) : even if the number of keys in the index is limited, the total amount of data stored will grow over time with the number of source documents. Keeping in mind that Redis stores all data in memory the hardware requirements for your machine will be higher. The upcoming Redis version will help alleviate this issue by introducing Redis virtual memory about which you can read more here.

Full text indexing is definitely not a new problem in the NoSQL space and there are different approaches to tackle it. Pick yours carefully!


Riak Search and Riak Full Text Indexing

Announced a while back and ☞ not quite here yet, Riak Search is Basho’s solution to the full text indexing problem.

While waiting for the release of Riak Search, I think that you can already start doing full text indexing using one of the existing indexing solutions (Lucene[1], Solr[2], ElasticSearch[3], etc.) and Riak post-commit hooks.

Simply put, all you’ll have to do is to create a Riak post-commit hook that feeds data into your indexing system.

The downside of this solution is that:

  1. you’ll still have to make sure that your indexing system is scalable, elastic, etc.
  2. you’ll not be able to use indexed data directly from Riak mapreduce functions, a feature that will be available through Riak Search.

Anyways, until Riak Search is out, why not having some fun!

Update: Embedded below a presentation on Riak Search providing some more details about this upcoming Basho product:

Update: Looks like the other presentation is not available anymore, so here is another on Riak search:

References


Presentation: CouchDB and Lucene

We’ve looked in the past at two possible approaches to deal with full text indexing in CouchDB. Now, I’ve found a great slidedeck from Martin Rehfeld on the subject: