ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

MongoDB: All content tagged as MongoDB in NoSQL databases and polyglot persistence

4 Reasons Perfect Market chose MongoDB

A team from Perfect Market about choosing MongoDB for their Digital Publishing Suite:

There are many NoSQL products out there, why did we bet on MongoDB? There are four major reasons: great performance, great features, ease of use and great support. Of course not every day with MongoDB is a sunshine day. Some tradeoffs we made are shared at the end of this post.

  1. I’m sure Perfect Market would get great support from almost every NoSQL database vendor — that’s what I’ve always heard in this market segment.
  2. By great performance I’ll assume Perfect Market got the numbers they needed. While presented as the top reason for choosing MongoDB, I think this was more in line with: “considering these other features, is MongoDB’s performance good enough for us?”.

    MongoDB is not the fastest NoSQL database.

  3. Great features and ease of use. Nobody can deny that, at least at the first glance, MongoDB’s feature set is very compelling. And they’ve absolutely nailed the user experience part.

    My hypothesis for MongoDB’s adoption rate has always been that it’s mostly due to it looking familiar to people with relational db experience and also removing most of the strict constraints of these. This is echoed in this post too:

    Althought MongoDB is a NoSQL document DBMS, it bears resemblance to RDBMS’s.

Original title and link: 4 Reasons Perfect Market chose MongoDB (NoSQL database©myNoSQL)

via: http://perfectmarket.com/four-reasons-perfect-market-bets-on-mongodb/


How SQL-on-JSON analytics bolstered a business

Alex Woodie (Datanami) reporting about BitYota a SQL-based data warehouse on top of JSON:

BitYota says it designed its own hosted data warehouse from scratch, and that it’s differentiated by having a JSON access layer atop the data store. “We have some uniqueness where we operate SQL directly on JSON,” says BitYota CEO Dev Patel. “We don’t need to translate that data into a structured format like a CSV. We believe that if you transform the data, you will lose some of the data quality. And once that’s transformed, you won’t get it back.”

✚ BitYota’s tagline is Analytics for mongoDB, so I assume it’s safe to say the backend is mongoDB and they are building a SQL layer on top of it. What flavor and what’s the behavior for SQL’s quirks would be a very interesting story.

✚ This related to my earlier Do all roads lead back to SQL?

Original title and link: How SQL-on-JSON analytics bolstered a business (NoSQL database©myNoSQL)

via: http://www.datanami.com/datanami/2014-02-12/how_sql-on-json_analytics_bolstered_a_business.html


The birth and road ahead of TokuMX, the alternative MongoDB engine

While not a MongoDB user (or expert), I find Tokutek’s work on their alternative engine for MongoDB, TokuMX, quite interesting both for technical — what is currently broken in MongoDB — and business point of views — is the InnoDB model possible in the NoSQL space?, what are some possible outcomes of the alternative core technology for free products business model?, would a new product bringing together MongoDB’s missing features and combining them with MongoDB’s “friendliness” and product marketing still lead to a successful product?, etc.

Zardosht Kasheff’s post about the history of TokuMX and how the decision was made to pursue this direction brings some light to both these areas.

But really, the BIGGEST benefit to this approach was the following: we could innovate on more of the MongoDB core server stack in ways the other approaches would not allow. Prior to TokuMX 1.4, such innovations include (but are not limited to):

  • Document level locking
  • Multi-statement transactions (on non-sharded clusters)
  • MVCC snapshot query semantics
  • Clustering indexes (although, to be fair, this was possible in other approaches)
  • Dramatically reduced I/O utilization on secondaries (which we will elaborate on in a future post)
  • Fast bulk loading
  • Enterprise hot backup

For these reasons, we chose this option, and after some hard work, TokuMX was born.

Original title and link: The birth and road ahead of TokuMX, the alternative MongoDB engine (NoSQL database©myNoSQL)

via: http://www.tokutek.com/2014/02/how-tokumx-was-born/


The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)

via: http://www.fullcontact.com/blog/mongo-to-cassandra-migration/


Mapping relational databases terms and SQL to MongoDB

A tuts+ guide to MongoDB for people familiar with SQL and relational databases:

We will start with mapping the basic relational concepts like table, row, column, etc and move to discuss indexing and joins. We will then look over the SQL queries and discuss their corresponding MongoDB database queries.

By the end of it you’ll probably not be able to convert your app to MongoDB, but at the next meetup or hackaton you’ll have an idea of what those Mongo guys are talking about.

Original title and link: Mapping relational databases terms and SQL to MongoDB (NoSQL database©myNoSQL)

via: http://code.tutsplus.com/articles/mapping-relational-databases-and-sql-to-mongodb--net-35650


MongoDB data storage structure, dbStats, and managing disk space

Two great posts from mongolab covering details about the structure of MongoDB’s data on disk, how this is reflected in the results returned by the dbStats API, and last some attempts to recover disk space:

  1. How big is your MongoDB?
  2. Managing disk space in MongoDB

MongoDB data files

Original title and link: MongoDB data storage structure, dbStats, and managing disk space (NoSQL database©myNoSQL)


Top 5 syntactic weirdnesses to be aware of in MongoDB

Slava Kim, a developer using MongoDB on a daily basis:

This article is not one of those. While most of the posts focus on operations part, benchmarks and performance characteristics, I want to talk a little bit about MongoDB query interfaces. That’s right - programming interfaces, specifically about node.js native driver but those are nearly identical across different platform drivers and Mongo-shell.

You might consider some of these as corner cases. Or worse, things you’d get used with over time.

There is inherent complexity in developing a database. Adding such “quirks” — or allowing them to slip into your product — will just make things worse. And in case you think about all those products that cut corners — or the 80/20 principle — just to get to market sooner, I’ll let you answer if a database is the right place for applying these principles.

Original title and link: Top 5 syntactic weirdnesses to be aware of in MongoDB (NoSQL database©myNoSQL)

via: http://devblog.me/wtf-mongo


Partitioning MongoDB Data on the Fly

I’ve bookmarked this article initially because it was mentioning the same strategy for migrating data with zero downtime.

So during the update period, there could be writes sent to the old service, which is writing to the old, single MongoDB cluster. After updating, there’s a period of time where both servers are writing before the second machine is updated.

But then I’ve realized that this is all about MongoDB. Tony Tam of Reverb:

To partition your data with the standard MongoDB toolset, significant downtime is unavoidable. You’ll either need to write a bunch of application logic, or get creative with some third party tools. This is a problem that we’ve hit at Reverb more than once, and are the exact same tools + technique that we used to migrate across datacenters (see From the Cloud and Back).

Isn’t MongoDB’s autosharding supposed to address exactly this scenario? What am I missing?

Original title and link: Partitioning MongoDB Data on the Fly (NoSQL database©myNoSQL)

via: http://developers-blog.helloreverb.com/partitioning-mongodb-data-on-the-fly/


Quick links for how to backup different NoSQL databases

After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:

Cassandra backups

Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.

You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.

After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).

MongoDB backups

Basically three are three ways to backup MongoDB:

  1. Using MMS
  2. Copying underlying files
  3. Using mongodump

Riak backups

Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:

Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.

Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.

✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.

Original title and link: Quick links for how to backup different NoSQL databases (NoSQL database©myNoSQL)


Comparing NoSQL backup solutions

In a post introducing HyperDex backups, Robert Escriva compares the different backup solutions available in Cassandra, MongoDB, and Riak:

Cassandra: Cassandra’s backups are inconsistent, as they are taken at each server independently without coordination. Further, “Restoring from snapshots and incremental backups temporarily causes intensive CPU and I/O activity on the node being restored.”

MongoDB: MongoDB provides two backup strategies. The first strategy copies the data on backup, and re-inserts it on restore. This approach introduces high overhead because it copies the entire data set without opportunity for incremental backup.

The second approach is to use filesystem-provided snapshots to quickly backup the data of a mongod instance. This approach requires operating system support and will produce larger backup sizes.

Riak: Riak backups are inconsistent, as they are taken at each server independently without coordination, and require care when migrating between IP addresses. Further, Riak requires that each server be shut down before backing up LevelDB-powered backends.

How is HyperDex’s new backup described:

The HyperDex backup/restore process is strongly consistent, doesn’t require shutting down servers, and enables incremental backup support. Further, the process is quite efficient; it completes quickly, and does not consume CPU or I/O for extended periods of time.

The caveat is that HyperDex puts the cluster in read-only mode for backing up. That’s loss of availability. Considering both Cassandra and Riak promise is high availability, their choice was clear.

Update: This comment from Emin Gün Sirer makes me wonder if I missed something:

HyperDex quiesces the network, takes a snapshot, resumes. Whole operation takes sub-second latency.

The key point is that the system is online, available while the data copying is taking place.

Original title and link: Comparing NoSQL backup solutions (NoSQL database©myNoSQL)

via: http://hackingdistributed.com/2014/01/14/back-that-nosql-up/


MySQL is a great Open Source project. How about open source NoSQL databases?

In a post titled Some myths on Open Source, the way I see it, Anders Karlsson writes about MySQL:

As far as code, adoption and reaching out to create an SQL-based RDBMS that anyone can afford, MySQL / MariaDB has been immensely successful. But as an Open Source project, something being developed together with the community where everyone work on their end with their skills to create a great combined piece of work, MySQL has failed. This is sad, but on the other hand I’m not so sure that it would have as much influence and as wide adoption if the project would have been a “clean” Open Source project.

The article offers a very black-and-white perspective on open source versus commercial code. But that’s not why I’m linking to it.

The above paragraph made me think about how many of the most popular open source NoSQL databases would die without the companies (or people) that created them.

Here’s my list: MongoDB, Riak, Neo4j, Redis, Couchbase, etc. And I could continue for quite a while considering how many there are out there: RavenDB, RethinkDB, Voldemort, Tokyo, Titan.

Actually if you reverse the question, the list would get extremely short: Cassandra, CouchDB (still struggling though), HBase. All these were at some point driven by community. Probably the only special case could be LevelDB.

✚ As a follow up to Anders Karlsson post, Robert Hodges posted The Scale-Out Blog: Why I Love Open Source.

Original title and link: MySQL is a great Open Source project. How about open source NoSQL databases? (NoSQL database©myNoSQL)

via: http://karlssonondatabases.blogspot.com/2014/01/some-myths-on-open-source-way-i-see-it.html


Look how fast it is… actually it’s not, but who cares

This is how it goes:

  1. someone declares a solution being fast. It’s usually a micro benchmark presented with almost no context.
  2. then someone else shows better numbers from a competing product. It’s a similar micro benchmark performed with a completely different hardware. An apple-to-oranges comparison.
  3. the first person revists the topic and says that actually performance doesn’t matter.

What’s wrong with this?

  1. most of the readers will only see the first post. The attraction for numbers is irresistible.
  2. the very few people seeing the second type of post will already be segregated and dismiss the other results.

The bottom line is that we end up with 2 posts with irrelevant numbers that each group could use to claim theirs is bigger than others. And very few actually learn what’s so (completely) wrong about them.

Original title and link: Look how fast it is… actually it’s not, but who cares (NoSQL database©myNoSQL)