NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



SOLR: All content tagged as SOLR in NoSQL databases and polyglot persistence

Announcing Open Source, Interactive Search on Hadoop

Announced through a webinar with all big name analysts listening, Cloudera announced Cloudera Search:

Cloudera Search brings full-text, interactive search and scalable indexing to your data in Hadoop. Cloudera Search adds to and extends the value of Apache Solr™, the enterprise standard for open source search. With Cloudera’s 100% open source Big Data platform, CDH, Cloudera Search gains the same fault tolerance, scale, visibility, and flexibility provided to other workloads, like MapReduce, Apache Hive™, and Cloudera Impala.

You know who did this first, right? DataStax. And it was over a year ago.

Original title and link: Announcing Open Source, Interactive Search on Hadoop (NoSQL database©myNoSQL)


NoSQL and Full Text Indexing: Two Trends

On one side:

  1. DataStax with Solr
  2. MapR with LucidWorks Search (nb: Solr)

and on the other side:

  1. Riak Searching: Solr-like but custom prioprietary implementation
  2. MongoDB text search: custom prioprietary implementation

I’m not going to argue about the pros and cons of each of these approaches, but I’m sure you already know which of these approaches I’m in favor of.

Original title and link: NoSQL and Full Text Indexing: Two Trends (NoSQL database©myNoSQL)

Apache Solr Versus ElasticSearch - the Feature Smackdown

Pretty thorough comparison of the feature sets in Solr and ElasticSearch put together by Kelvin Tan with 4 main sections: API, indexing, searching, customizability, distributed, but many many features considered for each of them.

Apache Solr vs ElasticSearch

✚ The complete website source is on GitHub so if one would like to improve it, it’s easy.

✚ Feature checklists should not be used to making final technical decisions. But they are extremely useful in the early stages of the decision process when having to go through a lot of options.

✚ I know this will Solr vs ElasticSearch comparison will evolve over time, so I’ve starred the project on Github and also saved the current version as PDF.

Original title and link: Apache Solr Versus ElasticSearch - the Feature Smackdown (NoSQL database©myNoSQL)


Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr

It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:

Architecture of Big Data at Aadhaar

Big Data at Aadhaar Data Stores

The slide deck presenting architecture principles and numbers about the platform after the break.

Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0

The tl;dr version is: DataStax has announced

Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0

For the longer version, there are a couple of new things worth emphasizing in this release:

  1. Fully integrated enterprise search
  2. RDBMS data migration
  3. Snap-in application log ingestion
  4. improvements to OpsCenter
  5. Elastic workload provisioning

Let’s take these one by one:

Fully integrated enterprise search or Solr on top of Cassandra

Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.

Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.

DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?

There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.

As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.

RDBMS data migration: it must be Sqoop

Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.

Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j

When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.

OpsCenter Enterprise 2.0

The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:

  • multi-cluster monitoring
  • visual backup
  • search monitoring

Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.

DataStax OpsCenter Enterprise:

DataStax OpsCenter

Riak Control:

Riak Control

Elastic workload provisioning

I’ve left at the end the feature that got me most interested into: elastic workload provisioning.

To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).

Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.

Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 (NoSQL database©myNoSQL)

Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth

Grant Ingersoll:

Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results.

I have to agree with Karussell:

Scaling Solr means using Solr AND X AND Y AND… Scaling ElasticSearch means using ElasticSearch

Original title and link: Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth (NoSQL database©myNoSQL)


Fulltext search your CouchDB in Ruby

When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:

  • Xapit is still under active development
  • You need to trigger Index update manually
  • It doesn’t Incremental index update at the moment

I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.

Original title and link: Fulltext search your CouchDB in Ruby (NoSQL database©myNoSQL)


Getting off the CouchDB... or Lessons Learned while Experimenting in Production

The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.

And that’s when the wheels came off.

Word of caution: this is not the “CouchDB sucks so we went with MongoDB” type of post. It’s more of “we thought CouchDB can solve one of our problems, but then got confused and thought it can solve world hunger. So we decided to throw a bunch of data to it to see if it sticks. Surprise! It didn’t.”

Just to be clear, I’m not defending CouchDB and everything John Wood writes about it is correct. It’s just that experimenting with CouchDB in a non-production environment or at least reading myNoSQL would have already offered all those answers.

Original title and link: Getting off the CouchDB… or Lessons Learned while Experimenting in Production (NoSQL database©myNoSQL)


Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud

Just a quick roundup of the latest releases and announcements.

Hortonworks Data Platform (HDP) version 2

HDP v2 will include:

  • NextGen MapReduce architecture
  • HDFS NameNode HA
  • HDFS Federation
  • up-to-date HCatalog, HBase, Hive, Pig

According to the announcement:

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.

SolrCloud Completes Phase 2

Mark Miller about the completion of phase 2:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Not there yet, but it’s coming.

DataStax Community Server 1.0.7

A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7

HBase 0.92

Don’t let the version number trick you. This is an important release for HBase featuring:

  • coprocessors
  • security
  • new (self-migrating) file format
  • AWS improvements: EBS support, building a HA cluster

The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).

I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:

Solr Index Replication at Etsy: From HTTP to BitTorrent

Etsy went from using HTTP to BitTorrent for replicating Solr indexes:

By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.


Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!

Make sure you don’t miss the part where they were experimenting with multicast UDP rsync.

Original title and link: Solr Index Replication at Etsy: From HTTP to BitTorrent (NoSQL database©myNoSQL)


Hadoop and Solr for Archiving Emails

Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.

Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

Let’s look again at the problem and see what the requirements are:

  • store a large and continuously growing amount of messages
  • retrieve messages either directly (key-based access) or by searches (full text indexing)

The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:

If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.

Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:

Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.

Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.

Original title and link: Hadoop and Solr for Archiving Emails (NoSQL database©myNoSQL)

Lucene & Solr Year 2011 in Review

I much prefer reviews to predictions. Moreover so when there are so many worthy things to be mention as what Lucene and Solr have accomplished in 2011:

  • Near Real-Time search (freshly added documents can be immediately made visible in search results)
  • Field collapsing or result grouping
  • faceting module
  • language support

Plus the promise of the SolrCloud:

In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  

On the other hand, last December LinkedIn open sourced IndexTank a real-time fulltext search-and-indexing system. Some of its features will definitely sound interesting to Lucene and Solr users.

Original title and link: Lucene & Solr Year 2011 in Review (NoSQL database©myNoSQL)