NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



elasticsearch: All content tagged as elasticsearch in NoSQL databases and polyglot persistence

Storage technologies at HipChat - CouchDB, ElasticSearch, Redis, RDS

As per the list below, HipChat’s storage solution is based on a couple of different solutions:

  • Hosting: AWS EC2 East with 75 Instance currently all Ubuntu 12.04 LTS
  • Database: CouchDB currently for Chat History, transitioning to ElasticSearch. MySQL-RDS for everything else
  • Caching: Redis
  • Search: ElasticSearch
  1. This post made me wonder what led HipChat team to use CouchDB in the first place. I’m tempted to say that it was the master-master replication and the early integration with Lucene.
  2. This is only the 2nd time in quite a while I’m reading an article mentioning CouchDB — after the February “no-releases-but-we’re-still-merging-BigCouch” report for ASF. And according to the story, CouchDB is on the way out.

Original title and link: Storage technologies at HipChat - CouchDB, ElasticSearch, Redis, RDS (NoSQL database©myNoSQL)


Full-text Search in your Database: Algolia vs Elasticsearch

Until now, Elasticsearch has been the fall-back solution for developers. Although a beautiful product for big data analysis or document search, it hasn’t been designed for object searches. Algolia has. The purpose of this blog post is to answer a question we’re frequently asked: If Algolia brings a specific answer when Elasticsearch offers a broad set of tools, how do they compare for database search?

This is the first time I’ve heard of Algolia. Unfortunately the docs page doesn’t reveal anything of the Algol’s secret sauce. So without knowing anything about it, I’d speculate that the performance difference comes from a highly optimized short ngrams storage/retrieval approach.

Original title and link: Full-text Search in your Database: Algolia vs Elasticsearch (NoSQL database©myNoSQL)


Apache Solr Versus ElasticSearch - the Feature Smackdown

Pretty thorough comparison of the feature sets in Solr and ElasticSearch put together by Kelvin Tan with 4 main sections: API, indexing, searching, customizability, distributed, but many many features considered for each of them.

Apache Solr vs ElasticSearch

✚ The complete website source is on GitHub so if one would like to improve it, it’s easy.

✚ Feature checklists should not be used to making final technical decisions. But they are extremely useful in the early stages of the decision process when having to go through a lot of options.

✚ I know this will Solr vs ElasticSearch comparison will evolve over time, so I’ve starred the project on Github and also saved the current version as PDF.

Original title and link: Apache Solr Versus ElasticSearch - the Feature Smackdown (NoSQL database©myNoSQL)


Real-Time Search With MongoDB and Elasticsearch

Interesting usage of the MongoDB oplog to replace the lack of storage notifications:

ElasticSearch has a built in feature of Rivers, which are essentially plugins for specific services to constantly stream in new updates for indexing. Unfortunately, there’s no MongoDB River (probably due to the lack of built-in database triggers), so I did some research and realized that I could use the MongoDB oplog to continually capture updates to our main databases.

Kristina Chodorow has two posts—here and here—detailing what’s stored in the oplog.

Original title and link: Real-Time Search With MongoDB and Elasticsearch (NoSQL database©myNoSQL)


Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth

Grant Ingersoll:

Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results.

I have to agree with Karussell:

Scaling Solr means using Solr AND X AND Y AND… Scaling ElasticSearch means using ElasticSearch

Original title and link: Scaling Solr Indexing With SolrCloud, Hadoop and Behemoth (NoSQL database©myNoSQL)


Fulltext search your CouchDB in Ruby

When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:

  • Xapit is still under active development
  • You need to trigger Index update manually
  • It doesn’t Incremental index update at the moment

I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.

Original title and link: Fulltext search your CouchDB in Ruby (NoSQL database©myNoSQL)


Getting off the CouchDB... or Lessons Learned while Experimenting in Production

The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.

And that’s when the wheels came off.

Word of caution: this is not the “CouchDB sucks so we went with MongoDB” type of post. It’s more of “we thought CouchDB can solve one of our problems, but then got confused and thought it can solve world hunger. So we decided to throw a bunch of data to it to see if it sticks. Surprise! It didn’t.”

Just to be clear, I’m not defending CouchDB and everything John Wood writes about it is correct. It’s just that experimenting with CouchDB in a non-production environment or at least reading myNoSQL would have already offered all those answers.

Original title and link: Getting off the CouchDB… or Lessons Learned while Experimenting in Production (NoSQL database©myNoSQL)


Choosing Technologies: The Library of Congress and the Twitter Archive

Remember when everyone was suggesting solutions for Twitter architecture? Now the Library of Congress is trying to figure out what technologies to use to store the Twitter archive:

The project is still very much under construction, and the team is weighing a number of different open source technologies in order to build out the storage, management and querying of the Twitter archive. While the decision hasn’t been made yet on which tools to use, the library is testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop.

Note that in terms of storage only HBase is mentioned—Twitter’s main tweet storage is MySQL though.

Original title and link: Choosing Technologies: The Library of Congress and the Twitter Archive (NoSQL database©myNoSQL)


ThriftDB: The Amazon Web Services of Search

ThriftDB presented today at TechCrunch Disrupt:

Technically speaking, ThriftDB is a flexible key-value datastore with search built in that has the flexibility, scalability, and performance of a NoSQL datastore with the capabilities of full-text search. Essentially, what this means is that, by combining the datastore and the search engine, ThriftDB is offering a service that makes it easy for developers to build fast, horizontally-scalable applications with integrated search.

The website says ThriftDB is a document database built on top of Thrift with full-text search support. I’m not really sure about the Amazon Web Services for Search, but it sounds like it would go against Marklogic, ElasticSearch, Solr, and so on.

Original title and link: ThriftDB: The Amazon Web Services of Search (NoSQL databases © myNoSQL)


Full text search with MongoDB and Lucene analyzers

Johan Rask:

It is important to understand that for a full fledged full text search engine, Lucene or Solr is still your choice since it has many other powerful features. This example only includes simple text searching and not i.e phrase searching or other types of text searches, nor does it include ranking of hits. But, for many occasions this is all you need but then you must be aware of that especially write performance will be worse or much worse depending on the size of the data your are indexing. I have not yet done any search performance tests for this so I am currently totally unaware of this but I will publish this as soon as I can.

Just a couple of thoughts:

  • Besides Lucene and Solr, ☞ ElasticSearch is another option you should keep in mind
  • your application will have to deal maintaining the index (adding, updating, removing). MongoDB currently lacks a notification mechanism that would help you decouple this. Something a la CouchDB _changes feed or Riak post-commit hooks (nb: leaving aside that starting with version 0.133 Riak search is available)

Original title and link: Full text search with Mongodb and Lucene analyzers (NoSQL databases © myNoSQL)


Cassandra and ElasticSearch backends for Django-nonrel in development

Django continues his path towards NoSQL:

Rob Vaterlaus has started working on a Cassandra backend and Alberto Paro is working on an ElasticSearch backend for Django-nonrel.

The Cassandra backend is still experimental and lacks support for ListField (from djangotoolbox.fields), but overall it already looks very very interesting. This backend comes with experimental secondary indexes support for Cassandra and requires a recent Cassandra 0.7 build.

Currently supported: App Engine and MongoDB.

Original title and link: Cassandra and ElasticSearch backends for Django-nonrel in development (NoSQL databases © myNoSQL)


Why Redis? And Memcached, Cassandra, Lucene, ElasticSearch

Why do we keep jumping from one storage engine to another? Can’t we make up our minds already and settle with the “best” storage engine that meets our needs?

In short, No.

A common misconception is the belief that all storage engines are created equal, all designed to simply “store stuff” and provide fast access to your data. Unless your application performs one clearly defined simple task, it is a dire mistake to expect a single storage engine will effectively fulfill all of your data warehousing and processing needs.

I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency.

Original title and link: Why Redis? And Memcached, Cassandra, Lucene, ElasticSearch (NoSQL databases © myNoSQL)