NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



solr: All content tagged as solr in NoSQL databases and polyglot persistence

IndexTank vs Thinking Sphinx vs WebSolr

In the light of IndexTank being open sourced by LinkedIn, here is a post in which Gautam Rege compares IndexTank with Thinking Sphinx and WebSolr. Feature-wise IndexTank has some advantages over Solr and almost none when compared wtih Thinking Sphinx.

When I first set out needing full text searching, I used Solr. It was pretty good though re-indexing took ages and to ensure consistency, I had to re-index every day via cron. Then I found Thinking Sphinx – and loved it because it managed delta indexes! Wow – no more daily re-index cron jobs. Even the re-indexing was way quicker.

The big issue with both Solr and TS was that it required tight integration with models and my database. For example – in TS, if a relationship was changed, I had to ensure to trigger the parent / child delta index in order to ensure it gets indexed too.  Both TS and Solr add methods to ActiveRecord, which I find a little annoying. These nuances gets my code too dependent on TS or Solr and switching from them to something else becomes a big pain!

Original title and link: IndexTank vs Thinking Sphinx vs WebSolr (NoSQL database©myNoSQL)


LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr

Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.

The projects can be found already on GitHub: index tank-engine (the indexing engine) and the API, BackOffice, Storefront, and Nebulizer.

When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.

The answer to the the first one is provided in the post.

What is Index Tank? IndexTank is mainly three things:

  • IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
  • API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
  • Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.

For the second, I’ve reached out the the old IndexTank FAQ.

How does IndexTank compare to Lucene and Solr?

  1. IndexTank was a hosted, scalable service
  2. IndexTank can add documents to the index
  3. IndexTank supports updating document variables without re-indexing
  4. IndexTank supports geolocation functions

For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.

Happy hacking!

Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr (NoSQL database©myNoSQL)

Factual API Powered by Node.js and Redis

Continuing my search for non trivial node.js + NoSQL database application, here’s Factual stack for serving their API:

Factual API Stack

Factual architectural components:

  • Varnish
  • HAProxy
  • Node.js
  • Redis
  • Solr

Why Node.js?

We chose Node because of three F’s: it’s fast, flexible, and familiar. In particular, the flexibility is  what allowed us to use our Node layer to handle things like caching logic and load balancing, in addition to the aforementioned authentication and authorization. To make our Node layer scalable, we use multiple instances of Node tied together with Redis to keep things in sync.

Also worth mentioning is that data served through Factual API is always JSON, so having a server side JavaScript engine alsa takes reduces the need for converting data to different formats.

Original title and link: Factual API Powered by Node.js and Redis (NoSQL database©myNoSQL)


Seven Java Projects That Changed the World

Over the last decade, several projects have traveled beyond mere adoption and had effects dominating the Java world, into software development in general, and some even further into the daily lives of users.

Not sure how Edd Dumbill came up with the list[1], but it includes Solr (and implicitely Lucene) and Hadoop. I concur.

  1. The list looks good to me.  

Original title and link: Seven Java Projects That Changed the World (NoSQL database©myNoSQL)


ThriftDB: The Amazon Web Services of Search

ThriftDB presented today at TechCrunch Disrupt:

Technically speaking, ThriftDB is a flexible key-value datastore with search built in that has the flexibility, scalability, and performance of a NoSQL datastore with the capabilities of full-text search. Essentially, what this means is that, by combining the datastore and the search engine, ThriftDB is offering a service that makes it easy for developers to build fast, horizontally-scalable applications with integrated search.

The website says ThriftDB is a document database built on top of Thrift with full-text search support. I’m not really sure about the Amazon Web Services for Search, but it sounds like it would go against Marklogic, ElasticSearch, Solr, and so on.

Original title and link: ThriftDB: The Amazon Web Services of Search (NoSQL databases © myNoSQL)


The HBase+Solr CMS Lily Reaches 1.0

Lily, the only CMS built on top of HBase and using Solr as its search engine, has reached the 1.0 version.

Lily is dead serious about Scale. The Lily repository has been tested to scale beyond any common content repository technology out there, due to its inherently distributed architecture, providing economically affordable, robust, and high-performing data management services for any kind of enterprise application.

Outerthought has talked in the past about their technical choices:

Original title and link: The HBase+Solr CMS Lily Reaches 1.0 (NoSQL databases © myNoSQL)


Using Solr and Hadoop as a NoSQL database

The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS.

Using Solr and Hadoop as a NoSQL database

I think the first time I’ve heard about Solr and Lucene mentioned as NoSQL-like storages was from Grant Ingersoll and from the guys.

From a NoSQL perspective:

  • there’s no fixed schema
  • there’s key-value access — hopefully that’s very fast and scalable
  • even if not standardized, there’s an advanced querying language

But as the original article points out some characteristics are missing:

  • Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance.
  • Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead.

Original title and link: Using Solr and Hadoop as a NoSQL database (NoSQL databases © myNoSQL)


Lucene and Solr Development Merged

Full text indexing in NoSQL databases has been addressed so far only by Riak search, the others relying on integrations with Lucene, Solr, or ElasticSearch.

With merged dev, there is now a single set of committers across both projects. Everyone in both communities can now drive releases – so when Solr releases, Lucene will also release – easing concerns about releasing Solr on a development version of Lucene. So now, Solr will always be on the latest trunk version of Lucene and code can be easily shared between projects – Lucene will likely benefit from Analyzers and QueryParsers that were only available to Solr users in the past. Lucene will also benefit from greater test coverage, as now you can make a single change in Lucene and run tests for both projects – getting immediate feedback on the change by testing an application that extensively uses the Lucene libraries. Both projects will also gain from a wider development community, as this change will foster more cross pollination between Lucene and Solr devs (now just Lucene/Solr devs).

Hopefully NoSQL databases will benefit from this merge too, by having a more solid product to rely on.

Original title and link: Lucene and Solr Development Merged (NoSQL databases © myNoSQL)


Full text search with MongoDB and Lucene analyzers

Johan Rask:

It is important to understand that for a full fledged full text search engine, Lucene or Solr is still your choice since it has many other powerful features. This example only includes simple text searching and not i.e phrase searching or other types of text searches, nor does it include ranking of hits. But, for many occasions this is all you need but then you must be aware of that especially write performance will be worse or much worse depending on the size of the data your are indexing. I have not yet done any search performance tests for this so I am currently totally unaware of this but I will publish this as soon as I can.

Just a couple of thoughts:

  • Besides Lucene and Solr, ☞ ElasticSearch is another option you should keep in mind
  • your application will have to deal maintaining the index (adding, updating, removing). MongoDB currently lacks a notification mechanism that would help you decouple this. Something a la CouchDB _changes feed or Riak post-commit hooks (nb: leaving aside that starting with version 0.133 Riak search is available)

Original title and link: Full text search with Mongodb and Lucene analyzers (NoSQL databases © myNoSQL)


Real-Time Searching of Big Data with Solr and Hadoop

Great presentation on searching BigData in real-time integrating Solr and Hadoop from ☞ OpenLogic’s Rod Cope:

And they are definitely not the only one using Hadoop and HBase for search. I guess this would also be a counter-example to Beyond Hadoop - Next-Generation Big Data Architectures.

Original title and link: Real-Time Searching of Big Data with Solr and Hadoop (NoSQL databases © myNoSQL)

Riak 0.13, Featuring Riak Search

I’m not very sure how I’ve managed to be the last to the Riak 0.13 party :(. And I can tell you it is a big party.

After writing about Riak search a couple of times already[1], I finally missed exactly the release of Riak that includes Riak search.

Riak 0.13, ☞ announced a couple of days ago, brings quite a few new exciting features:

  • Riak search
  • MapReduce improvements
  • Bitcask storage backend improvements
  • improvements to the riak_code and riak_kv modules — the building blocks of Dynamo-like distributed systems — and better code organization allowing easier use of these modules

While everything in this release sounds like an important step forward for Riak, what sets it aside the Riak search a feature that is currently unique in the NoSQL databases space.

Riak search is using Lucene and builds a Solr like API on top of it (nb I think that reusing known interfaces and protocols is most of the time the right approach).

At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

☞ The Basho Blog

The Basho blog explains this feature extensively ☞ here and ☞ here.

Riak Search shows a lot of great decisions made by the Basho team, as it avoids reinventing the wheel or creating some new protocols/interfaces. I’ve stressed these aspects a couple of times already, when writing that NoSQL databases should follow the Unix Philosophy and also when writing about how important NoSQL protocols are. Mathias Meyer has a ☞ post detailing why these are important.

Last, but not least the Ruby Riak ripple library ☞ got updated too, but not sure it supports all the new features in Riak 0.13.

Here is a Rusty Klophaus (Basho) talking about Riak search at Berlin Buzzwords NoSQL event:

  1. First post about Riak search Notes on scaling out with Riak and Riak search podcast dates back to December 14th, 2009, just a couple of days after setting up myNoSQL.  ()

Original title and link: Riak 0.13, Featuring Riak Search (NoSQL databases © myNoSQL)

How to build a searchable, evolvable entity store?

Given a set of requirements (prepared to scale, data models can evolve, data must be searchable, common access to entities), a data definition language (think Protocol Buffers[1], Thrift[2], Avro[3], JSON[4], BSON[5]), a NoSQL database, how do you build a searchable, evolvable entity store?

Sam Pullara explains how he solved these while ☞ creating HAvroBase:

The first choice you have to make against these requirements is which data definition language are you going to use?


Whereas the data definition choice is basically commodity at this point and your choice can be somewhat arbitrary, the choice of storage technology will likely be something that has more trade-offs to consider.


When it comes to text search you really don’t get better than Lucene in open source and the features that Solr builds on top of Lucene make it even better. I don’t think there is reasonable argument for using something besides Solr at this point. Especially with their support for sharding and replication that comes with Solr Cloud.

The only remark is that the solution might also use other NoSQL databases especially key-value stores (basically, once entities are encoded with Avro, data will become opaque to HBase so its wide-column data model is not a strong requirement).

Source code is available on ☞ GitHub.