full text indexing: All content tagged as full text indexing in NoSQL databases and polyglot persistence
Great presentation on searching BigData in real-time integrating Solr and Hadoop from ☞ OpenLogic’s Rod Cope:
And they are definitely not the only one using Hadoop and HBase for search. I guess this would also be a counter-example to Beyond Hadoop - Next-Generation Big Data Architectures.
Original title and link: Real-Time Searching of Big Data with Solr and Hadoop (NoSQL databases © myNoSQL)
I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.
Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:
- Riak search
- MapReduce improvements
- Bitcask storage backend improvements
- improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules
While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.
Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).
At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.
Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.
Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.
Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:
- First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL. (↩)
The guys from PlayNice.ly, which are building a bug tracker that uses Redis for storing all app data (users, projects, bugs, comments, audit data, etc.), have posted recently ☞ here and ☞ here about their work to support search within their product.
While the general idea is to simply store the inverted index into Redis, there are a couple of interesting things to be noted:
- Redis native support for
SETdata type and its set operations (union, intersection, difference) makes working with Redis stored reversed indexes pretty handy
- While you might be tempted to use every term as an index key, this will not work with fuzzy searches (i.e. searches for the word “numbers” will not contain documents containing the word “number”). Using “smart keys” — the article mentions using phonetic algorithms for calculating the keys; another solution can employ stemming algorithms — will help you reduce the number of index keys and also to perform fuzzy searches
- Building a good API for working with a custom solution will make things feel more natural.
Anyways, before consider this problem completely solved there are a couple of additional things that you should keep in mind:
- index updates: there are many different scenarios in which you’ll have to update the inverted index and this can raise different problems starting with:
- the increased number of operations (writes explosion) and rountrips to the storage
- dealing with concurrent updates
- index size (or data explosion) : even if the number of keys in the index is limited, the total amount of data stored will grow over time with the number of source documents. Keeping in mind that Redis stores all data in memory the hardware requirements for your machine will be higher. The upcoming Redis version will help alleviate this issue by introducing Redis virtual memory about which you can read more here.
While waiting for the release of Riak Search, I think that you can already start doing full text indexing using one of the existing indexing solutions (Lucene, Solr, ElasticSearch, etc.) and Riak post-commit hooks.
Simply put, all you’ll have to do is to create a Riak post-commit hook that feeds data into your indexing system.
The downside of this solution is that:
- you’ll still have to make sure that your indexing system is scalable, elastic, etc.
- you’ll not be able to use indexed data directly from Riak mapreduce functions, a feature that will be available through Riak Search.
Anyways, until Riak Search is out, why not having some fun!
Update: Embedded below a presentation on Riak Search providing some more details about this upcoming Basho product:
Update: Looks like the other presentation is not available anymore, so here is another on Riak search: