ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

What types of applications might a graph database be well suited for?

Found this list of use cases for graph databases in a follow up of a Neo4j webinar:

  • Social networks
  • Collaboration programs
  • Configuration Management
  • Geo-Spatial applications
  • Impact Analysis
  • Master Data Management
  • Network Management
  • Product Line Management
  • Recommendation Engines

The more generic answer would be that graph databases can be a great fit for problems handling highly connected data.

The examples above are clear cases of use cases involving highly connected data , but as of now I’m not aware of any social networks, network management, or large scale recommendation engines built on top of one of the existing graph databases.

Original title and link: What types of applications might a graph database be well suited for? (NoSQL database©myNoSQL)


How Web giants store big data

An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS).

Original title and link: How Web giants store big data (NoSQL database©myNoSQL)

via: http://arstechnica.com/business/news/2012/01/the-big-disk-drive-in-the-sky-how-the-giants-of-the-web-store-big-data.ars/1


The document is the single source of truth

Paul Hammant:

When it comes to data storage the obvious conclusion is that the backend should save something pretty close to the document that the client presents, mutates, and sends back to the server for posterity. […] Use a document store instead. When would you use a normalized DB design today? The answer to that is: only when you have other processes reading and writing to your database.

There are a few scenarios where data is always accessed in the same format and that’s where document stores excel. For the rest of the scenarios, there’ll always be a trade-off between optimizing for the most frequent access patterns vs the additional processing required to provide different perspectives on the data.

Original title and link: The document is the single source of truth (NoSQL database©myNoSQL)

via: http://paulhammant.com/2012/02/08/document-is-the-single-source-of-truth/


Scaling Video Analytics with Cassandra by Ilya Maykov - Powered by NoSQL

To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time.


Cassandra Data Modeling Examples with Matthew F. Dennis - NoSQL videos

Continuing the Cassandra NYC 2011 video series, made available by the folks from DataStax, this week we have Matthew F. Dennis which covers a couple of different Cassandra data modeling use cases.


Big Data Search: Perfect Search

Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:

  • offers a unique architectural approach that significantly reduces the total computations required to query
  • creates terms and pattern indexes (basically combinations of terms at indexing time)
  • uses jump tables and bloom filters
  • heavily optimizes disk I/O
  • doesn’t require indexes in memory
  • “can often do same query with less than 1% computations”
  • “when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
    • according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
    • if Perfect Search is a search engine why comparing with relational databases?
  • “Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
    • Google is using 100 servers for reliability and guaranteeing the speed of results
  • “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”

With this preamble, you can watch the video after the break:


Hadoop Versions Take 3… Can you follow it?

I’ve just read the Hortonworks’s post about the improvements in Hadoop .Next, jumped up and screamed “Super!”:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

But then my eyes stopped on this part:

We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release.

With the image of the Hadoop versions in mind, I’ve asked myself and on Twitter what’s the plan with the Hadoop 1.0 and Hadoop 0.23 branches? Will they get unified in a next version? Will they continue in paralle? As you’d expect I was hoping to hear something like “once we finalize the major changes we will focus on clarifying “.

What I heard instead from Arun C.Murthy1 is that:

  • 0.23 is the next major production ready version
  • 1.0 will become the “old” deprecated version

Are you still with me?

I’m starting to wonder if this is some sort of strategy to get everyone confused. If it’s not, then I really hope someone will do something to clarify this mess.

Update: The conversation with Arun C. Murthy trying to clarify the future direction of Hadoop continued over a series of tweets. As he posted here too, the conclusion is:

Hadoop-0.23 will soon be Hadoop-Y where Y > 1. Thus Hadoop 1.0 is currently stable release, and Hadoop-Y will be next major release continuing lots of new features etc.


  1. Arun C. Murthy is Founder and Architect at Hortonworks, Hadoop PMC 

Original title and link: Hadoop Versions Take 3… Can you follow it? (NoSQL database©myNoSQL)


Oracle NoSQL Database in Review

Daniel Abadi in probably the most detailed high level review of the Oracle NoSQL database:

Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability.

The above part has also led to a very interesting exchange between Daniel and a couple of Oracle NoSQL team members about different definitions of eventual consistency.

Original title and link: Oracle NoSQL Database in Review (NoSQL database©myNoSQL)

via: http://dbmsmusings.blogspot.com/2011/10/overview-of-oracle-nosql-database.html


The Future is Polyglot Persistence

Marting Fowler and Pramod Sadalage in an infographic promoting their upcoming book (PDF):

Polyglot persistence will occur over the enterprise as different applications use different data storage technologies. It will also occur within a single application as different parts of an application’s data store have different access characteristics.

There are over 2 years since I’ve begun evangelizing polyglot persistence. By now, most thought leaders agree it is the future. Next on my agenda is having the top relational vendors sign off too. Actually, I’m almost there: Oracle is promoting an Oracle NoSQL Database and Microsoft is offering both relational and non-relational solutions with Azure. They just need to say it.

Original title and link: The Future is Polyglot Persistence (NoSQL database©myNoSQL)


MongoDB in Review

A high level review of MongoDB by Andrew Glover with a bullet point pros and cons

MongoDB Pros and Cons

and a MongoDB scorecard:

MongoDB Scorecard

I’ve spent some time trying to figure out what’s behind these scores, but I’ve had to give up.

Original title and link: MongoDB in Review (NoSQL database©myNoSQL)

via: http://www.infoworld.com/print/185922


Tropo and CouchDB: SMS Voting App in 10 Minutes

Mark Headd:

By pairing Tropo with CouchDB and a CouchApp running in IrisiCouch, you can have an SMS and phone voting app running entirely in the cloud in about 10 minutes. It should actually take you longer to write up the categories for your voting app than it should to deploy this solution.

Code available on GitHub.

Original title and link: Tropo and CouchDB: SMS Voting App in 10 Minutes (NoSQL database©myNoSQL)

via: http://blog.tropo.com/2012/02/08/sms-voting-app-in-10-minutes-with-tropo-and-couchdb/


MapReduce Patterns, Algorithms, and Use Cases

Ilya Katsov’s post enumerates an extensive set of patterns and algorithms each accompanied by use cases and pseudocode:

  • counting and summing (log analysis, data querying)
  • collating (inverted indexes, ETL)
  • filtering, parsing, validation (log analysis, data querying, ETL, data validation)
  • distributed task execution (physical and engineering simulations, numerical analysis, performance testing)
  • sorting (ETL, data analysis)
  • iterative message passing/graph processing (graph analysis, web indexing)
  • distinct values (log analysis, uniqueness)
  • cross-correlation (text analysis, market analysis)
  • relational patterns: selection, projection, union, intersection, difference, aggregation, joining

As you can see there’s a wide range of problems that can be addressed using MapReduce algorithms. The complexity of applying MapReduce techniques comes from identifying the phases that lead to both effective and efficient analysis.

Original title and link: MapReduce Patterns, Algorithms, and Use Cases (NoSQL database©myNoSQL)

via: http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/