Monday, 13 February 2012
What types of applications might a graph database be well suited for?
Found this list of use cases for graph databases in a follow up of a Neo4j webinar:
- Social networks
- Collaboration programs
- Configuration Management
- Geo-Spatial applications
- Impact Analysis
- Master Data Management
- Network Management
- Product Line Management
- Recommendation Engines
The more generic answer would be that graph databases can be a great fit for problems handling highly connected data.
The examples above are clear cases of use cases involving highly connected data , but as of now I’m not aware of any social networks, network management, or large scale recommendation engines built on top of one of the existing graph databases.
Original title and link: What types of applications might a graph database be well suited for? (©myNoSQL)
How Web giants store big data
An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS).
Original title and link: How Web giants store big data (©myNoSQL)
The document is the single source of truth
Paul Hammant:
When it comes to data storage the obvious conclusion is that the backend should save something pretty close to the document that the client presents, mutates, and sends back to the server for posterity. […] Use a document store instead. When would you use a normalized DB design today? The answer to that is: only when you have other processes reading and writing to your database.
There are a few scenarios where data is always accessed in the same format and that’s where document stores excel. For the rest of the scenarios, there’ll always be a trade-off between optimizing for the most frequent access patterns vs the additional processing required to provide different perspectives on the data.
Original title and link: The document is the single source of truth (©myNoSQL)
via: http://paulhammant.com/2012/02/08/document-is-the-single-source-of-truth/
Sunday, 12 February 2012
Scaling Video Analytics with Cassandra by Ilya Maykov - Powered by NoSQL
To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time.
Saturday, 11 February 2012
Cassandra Data Modeling Examples with Matthew F. Dennis - NoSQL videos
Continuing the Cassandra NYC 2011 video series, made available by the folks from DataStax, this week we have Matthew F. Dennis which covers a couple of different Cassandra data modeling use cases.
Big Data Search: Perfect Search
Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:
- offers a unique architectural approach that significantly reduces the total computations required to query
- creates terms and pattern indexes (basically combinations of terms at indexing time)
- uses jump tables and bloom filters
- heavily optimizes disk I/O
- doesn’t require indexes in memory
- “can often do same query with less than 1% computations”
-
“when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
- according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
- if Perfect Search is a search engine why comparing with relational databases?
-
“Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
- Google is using 100 servers for reliability and guaranteeing the speed of results
- “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”
With this preamble, you can watch the video after the break:
Friday, 10 February 2012
Hadoop Versions Take 3… Can you follow it?
I’ve just read the Hortonworks’s post about the improvements in Hadoop .Next, jumped up and screamed “Super!”:
- Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
- NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.
But then my eyes stopped on this part:
We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release.
With the image of the Hadoop versions in mind, I’ve asked myself and on Twitter what’s the plan with the Hadoop 1.0 and Hadoop 0.23 branches? Will they get unified in a next version? Will they continue in paralle? As you’d expect I was hoping to hear something like “once we finalize the major changes we will focus on clarifying “.
What I heard instead from Arun C.Murthy1 is that:
- 0.23 is the next major production ready version
- 1.0 will become the “old” deprecated version
Are you still with me?
I’m starting to wonder if this is some sort of strategy to get everyone confused. If it’s not, then I really hope someone will do something to clarify this mess.
Update: The conversation with Arun C. Murthy trying to clarify the future direction of Hadoop continued over a series of tweets. As he posted here too, the conclusion is:
Hadoop-0.23 will soon be Hadoop-Y where Y > 1. Thus Hadoop 1.0 is currently stable release, and Hadoop-Y will be next major release continuing lots of new features etc.
-
Arun C. Murthy is Founder and Architect at Hortonworks, Hadoop PMC ↩
Original title and link: Hadoop Versions Take 3… Can you follow it? (©myNoSQL)
Oracle NoSQL Database in Review
Daniel Abadi in probably the most detailed high level review of the Oracle NoSQL database:
Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability.
The above part has also led to a very interesting exchange between Daniel and a couple of Oracle NoSQL team members about different definitions of eventual consistency.
Original title and link: Oracle NoSQL Database in Review (©myNoSQL)
via: http://dbmsmusings.blogspot.com/2011/10/overview-of-oracle-nosql-database.html
Thursday, 9 February 2012
The Future is Polyglot Persistence
Marting Fowler and Pramod Sadalage in an infographic promoting their upcoming book (PDF):
Polyglot persistence will occur over the enterprise as different applications use different data storage technologies. It will also occur within a single application as different parts of an application’s data store have different access characteristics.
There are over 2 years since I’ve begun evangelizing polyglot persistence. By now, most thought leaders agree it is the future. Next on my agenda is having the top relational vendors sign off too. Actually, I’m almost there: Oracle is promoting an Oracle NoSQL Database and Microsoft is offering both relational and non-relational solutions with Azure. They just need to say it.
Original title and link: The Future is Polyglot Persistence (©myNoSQL)
MongoDB in Review
A high level review of MongoDB by Andrew Glover with a bullet point pros and cons

and a MongoDB scorecard:

I’ve spent some time trying to figure out what’s behind these scores, but I’ve had to give up.
Original title and link: MongoDB in Review (©myNoSQL)
Tropo and CouchDB: SMS Voting App in 10 Minutes
Mark Headd:
By pairing Tropo with CouchDB and a CouchApp running in IrisiCouch, you can have an SMS and phone voting app running entirely in the cloud in about 10 minutes. It should actually take you longer to write up the categories for your voting app than it should to deploy this solution.
Code available on GitHub.
Original title and link: Tropo and CouchDB: SMS Voting App in 10 Minutes (©myNoSQL)
via: http://blog.tropo.com/2012/02/08/sms-voting-app-in-10-minutes-with-tropo-and-couchdb/
MapReduce Patterns, Algorithms, and Use Cases
Ilya Katsov’s post enumerates an extensive set of patterns and algorithms each accompanied by use cases and pseudocode:
- counting and summing (log analysis, data querying)
- collating (inverted indexes, ETL)
- filtering, parsing, validation (log analysis, data querying, ETL, data validation)
- distributed task execution (physical and engineering simulations, numerical analysis, performance testing)
- sorting (ETL, data analysis)
- iterative message passing/graph processing (graph analysis, web indexing)
- distinct values (log analysis, uniqueness)
- cross-correlation (text analysis, market analysis)
- relational patterns: selection, projection, union, intersection, difference, aggregation, joining
As you can see there’s a wide range of problems that can be addressed using MapReduce algorithms. The complexity of applying MapReduce techniques comes from identifying the phases that lead to both effective and efficient analysis.
Original title and link: MapReduce Patterns, Algorithms, and Use Cases (©myNoSQL)
via: http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling