solr: All content tagged as solr in NoSQL databases and polyglot persistence
Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.
When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.
The answer to the the first one is provided in the post.
What is Index Tank? IndexTank is mainly three things:
- IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
- API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
- Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.
For the second, I’ve reached out the the old IndexTank FAQ.
How does IndexTank compare to Lucene and Solr?
- IndexTank was a hosted, scalable service
- IndexTank can add documents to the index
- IndexTank supports updating document variables without re-indexing
- IndexTank supports geolocation functions
For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.
Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr ( ©myNoSQL)
Great presentation on searching BigData in real-time integrating Solr and Hadoop from ☞ OpenLogic’s Rod Cope:
And they are definitely not the only one using Hadoop and HBase for search. I guess this would also be a counter-example to Beyond Hadoop - Next-Generation Big Data Architectures.
Original title and link: Real-Time Searching of Big Data with Solr and Hadoop (NoSQL databases © myNoSQL)
I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.
Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:
- Riak search
- MapReduce improvements
- Bitcask storage backend improvements
- improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules
While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.
Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).
At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.
Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.
Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.
Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:
- First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL. (↩)
Given a set of requirements (prepared to scale, data models can evolve, data must be searchable, common access to entities), a data definition language (think Protocol Buffers
), a NoSQL database, how do you build a searchable, evolvable entity store?
Sam Pullara explains how he solved these while ☞ creating HAvroBase:
The first choice you have to make against these requirements is which data definition language are you going to use?
Whereas the data definition choice is basically commodity at this point and your choice can be somewhat arbitrary, the choice of storage technology will likely be something that has more trade-offs to consider.
When it comes to text search you really don’t get better than Lucene in open source and the features that Solr builds on top of Lucene make it even better. I don’t think there is reasonable argument for using something besides Solr at this point. Especially with their support for sharding and replication that comes with Solr Cloud.
The only remark is that the solution might also use other NoSQL databases especially key-value stores (basically, once entities are encoded with Avro, data will become opaque to HBase so its wide-column data model is not a strong requirement).
Source code is available on ☞ GitHub.