NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MapR product strategy

Maria Deutscher (SiliconAngle) quoting MapR CMO Jack Norris:

The MapR strategy centers on what chief marketing officer Jack Norris described in an interview as a “proven business model of really focusing on a product, selling a product, making a product enterprise grade, utilizing the innovations of the community but providing some [additional] advantages so customers can be even more successful.”

I thought that a part of a proven business is innovating on the product and less so utilizing the innovations of the community. Or at least finding some ways to paying back for those community innovations.

Original title and link: MapR product strategy (NoSQL database©myNoSQL)


Quick guide to CRDTs in Riak 2.0

Joel Jacobson provides a quick intro to using the new CRDT counters, sets, and maps in the Riak 2.0 preview:

Riak Data Types (also referred to as CRDTs) adds counters, sets, and maps to Riak – allowing for better conflict resolution. They enable developers to spend less time thinking about the complexities of vector clocks and sibling resolution and, instead, focusing on using familiar, distributed data types to support their applications’ data access patterns.

✚ An extra point for everyone recognizing the data sample used in the post.

Original title and link: Quick guide to CRDTs in Riak 2.0 (NoSQL database©myNoSQL)


Stranger in a strange land: HPC and Big Data

Paul Mineiro sharing his notes and thoughts after attending an HPC event:

My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They’re aware of what we’re up to, but they talk about Hadoop like it’s some giant livin’ in the hillside, comin down to visit the townspeople. Listening to them mapping what we’re up to into their conceptual landscape was very enlightening, and helped me understand them better.

No more ivory towers.

Original title and link: Stranger in a Strange Land: HPC and Big Data (NoSQL database©myNoSQL)


From IBM to… IBM: The short, but complicated history of CouchDB, Cloudant, and a lot of other companies and projects

Damien Katz created CouchDB after working at IBM on Lotus Notes: CouchDB and Me. CouchDB went the Apache way. Then things got complicated…

On the West coast, Damien Katz and a team of committers created Couchio, later renamed to CouchOne, later merged with Membase to become Couchbase, which finally dropped CouchDB. Damien Katz left Couchbase.

A confusing history with a very complicated genealogy of projects (don’t worry, this goes on) and companies. And this was only West Coast.

East Coast, Cloudant took CouchDB and made it BigCouch. I thought that Cloudant will be the CouchDB company — and in a way it was. Cloudant put BigCouch on the cloud as a service and on GitHub as open source. BigCouch is supposed to get back into Apache CouchDB, but many months later this hasn’t materialized yet.

To complete the circle, today IBM announced signing an agreement to acquire Cloudant — news coverage on GigaOm, BostInno, TechCrunch. Which probably makes sense considering Cloudant’s relationship with SoftLayer and IBM’s $1 billion Platform-as-a-Service Investment, but less so if you consider the IBM and 10genMongoDB collaboration.

Anyways, the future of Apache CouchDB is bright. Yep.

Original title and link: From IBM to… IBM: The short, but complicated history of CouchDB, Cloudant, and a lot of other companies and projects (NoSQL database©myNoSQL)

How SQL-on-JSON analytics bolstered a business

Alex Woodie (Datanami) reporting about BitYota a SQL-based data warehouse on top of JSON:

BitYota says it designed its own hosted data warehouse from scratch, and that it’s differentiated by having a JSON access layer atop the data store. “We have some uniqueness where we operate SQL directly on JSON,” says BitYota CEO Dev Patel. “We don’t need to translate that data into a structured format like a CSV. We believe that if you transform the data, you will lose some of the data quality. And once that’s transformed, you won’t get it back.”

✚ BitYota’s tagline is Analytics for mongoDB, so I assume it’s safe to say the backend is mongoDB and they are building a SQL layer on top of it. What flavor and what’s the behavior for SQL’s quirks would be a very interesting story.

✚ This related to my earlier Do all roads lead back to SQL?

Original title and link: How SQL-on-JSON analytics bolstered a business (NoSQL database©myNoSQL)


Do all roads lead back to SQL? Some might and some might not

Seth Proctor for Dr.Dobb’s:

Increasingly, NewSQL systems are showing scale, schema flexibility, and ease of use. Interestingly, many NoSQL and analytic systems are now putting limited transactional support or richer query languages into their roadmaps in a move to fill in the gaps around ACID and declarative programming. What that means for the evolution of these systems is yet to be seen, but clearly, the appeal of Codd’s model is as strong as ever 43 years later.

Spend a bit of time reading (really reading) the above paragraph—there are quite a few different concepts put together to make the point of the article.

SQL is indeed getting closer to the NoSQL databases, but mostly to Hadoop. I still stand by my thoughts in The premature return to SQL.

Most NoSQL databases already offer some limited ACID guarantees. And some flavors of transactions are supported or are being added. But only as long as the core principles can still be guaranteed or the trade-offs are made obvious and offered as clear choices to application developers.

The relational model stays with the relational databases. If some of its principles can be applied (e.g. data type integrity, optional schema enforcement), I see nothing wrong with supporting them. Good technical solutions know both what is needed and what is possible.

Original title and link: Do All Roads Lead Back to SQL? | Dr Dobb’s (NoSQL database©myNoSQL)


When should I use Greenplum Database versus HAWQ?

Jon Roberts about the use cases for Greenplum and HAWQ, both technologies offered by Pivotal:

Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

First questions that popped in my mind:

  1. why isn’t HAWQ good for reporting?
  2. why isn’t HAWQ good for predictive analytics?

I don’t have a good answer for any of these. For the first, I assume that the implied answer is Hadoop’s latency. On the other hand, what I know is that Microsoft and Hortonworks are trying to bring Hadoop data into Excel with HDInsight. This is not traditional reporting, but if that’s acceptable from a latency point of view, I’m not sure why it wouldn’t work for reporting too.

For the second question, Hadoop and the tools built around it are well known for predictive analytics. So maybe this separation is due only to HAWQ. Another explanation could be product positioning.

This last part seems to be confirmed by the rest of the post which is making the point that data stored in HDFS is temporary and once it is processed with HAWQ it is moved into Greenplum.

Greenplum and HAWQ

In other words, HAWQ is just for ETL/ELT on Hadoop.

✚ I’m pretty sure that many traditional data warehouse companies that are forced to come up with coherent proposals for architectures based on their core products and Hadoop are facing the same product positioning problem — it’s difficult to accept in front of the customers that Hadoop might be capable to replace core functionality of the products you are selling.

What is the best answer to this positioning dilemma?

  1. Find a spot for Hadoop that is not hurting your core products. Let’s say ETL.
  2. Propose an architecture where your core products and Hadoop are fully complementing and interacting with each other.

You already know my answer.

Original title and link: When should I use Greenplum Database versus HAWQ? (NoSQL database©myNoSQL)


What can Big Data mean to healthcare?

Eric Baldeschwieler:

If you had such a system, learning and hypothesis testing in medicine would accelerate tremendously! Some examples: You could do quality of care of outcome metrics on all procedures over time and ID best practices; You could quickly look for interesting correlations between infected patents to ID causes of medical conditions or at least correlations; You could datamine all reported illnesses against all unusual gene sequences; You could correlate activity, demographics, diet with outcomes and ID best practices; Etc, Etc, Etc. You would see an amazing acceleration of learning, quality of care, patent specific therapies, …

Unfortunately on the receiving side of all these extremely private data are entities that, at least in some parts of the world, don’t have the best reputation and are very focused on their own profit.

Original title and link: What can Big Data mean to healthcare? (NoSQL database©myNoSQL)


Pig cheat sheet

Cheat sheet? Check. Pig? Check. Where do I get it?


Some of my favorite data visualization resources

Pretty much everything that contains the words visualization and data in the title is getting my attention. Moreover so if it promises a list of resources that could help me learn a bit of the art of visualization. Aaron Cordova’s list contains books, sites, and tools:

Visualization is more art than science at this point, although some have used it enough to be able to identify successful techniques for various purposes. Most successful visualizations are perhaps more dependent upon the decision or task at hand than the actual original structure of the data.

Original title and link: Some of my favorite data visualization resources (NoSQL database©myNoSQL)


The birth and road ahead of TokuMX, the alternative MongoDB engine

While not a MongoDB user (or expert), I find Tokutek’s work on their alternative engine for MongoDB, TokuMX, quite interesting both for technical — what is currently broken in MongoDB — and business point of views — is the InnoDB model possible in the NoSQL space?, what are some possible outcomes of the alternative core technology for free products business model?, would a new product bringing together MongoDB’s missing features and combining them with MongoDB’s “friendliness” and product marketing still lead to a successful product?, etc.

Zardosht Kasheff’s post about the history of TokuMX and how the decision was made to pursue this direction brings some light to both these areas.

But really, the BIGGEST benefit to this approach was the following: we could innovate on more of the MongoDB core server stack in ways the other approaches would not allow. Prior to TokuMX 1.4, such innovations include (but are not limited to):

  • Document level locking
  • Multi-statement transactions (on non-sharded clusters)
  • MVCC snapshot query semantics
  • Clustering indexes (although, to be fair, this was possible in other approaches)
  • Dramatically reduced I/O utilization on secondaries (which we will elaborate on in a future post)
  • Fast bulk loading
  • Enterprise hot backup

For these reasons, we chose this option, and after some hard work, TokuMX was born.

Original title and link: The birth and road ahead of TokuMX, the alternative MongoDB engine (NoSQL database©myNoSQL)


The evolution of scalable NoSQL at Viber [sponsor]

The story of Viber’s NoSQL expedition that took them from MongoDB to Couchbase Server, told in this sponsored post by Couchbase:

As one of the fastest growing VoIP services in the world, the challenge at Viber has been to build and maintain a scalable architecture that is capable of sustaining exponential growth. The first-generation architecture was built on top of a custom, in-memory database. However, within months, the database could no longer keep up with the growth the company was experiencing.

Viber DB architecture - 1st generation

The second-generation architecture was built on top of MongoDB shards. Next, Redis was added as a cache on top of the MongoDB shards to increase throughput. Still, MongoDB was unable to meet the high throughput requirements. Finally, a second Redis cluster was added independent of the MongoDB shards. The second-generation architecture was compromised of 150 MongoDB nodes and over 100 Redis nodes.

Viber DB architecture - 2nd generation

The third-generation architecture had to support 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term. Viber chose to build their third-generation architecture on top of Couchbase Server. The third-generation architecture is compromised of 100 to 120 Couchbase Server nodes.

Viber DB architecture - 3rd generation

Viber has been able to reduce the number of database nodes required while increasing the throughput with their third-generation architecture. For example, one of their Couchbase clusters (a ten node cluster) handles 100,000 to 200,000 operations per second with over 4.5 terabytes of data.

See the full story on the Viber switch.

Original title and link: The evolution of scalable NoSQL at Viber [sponsor] (NoSQL database©myNoSQL)