NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MongoDB: All content tagged as MongoDB in NoSQL databases and polyglot persistence

10gen changes name to MongoDB Inc

That’s all.

Well, except I couldn’t miss this one:

Original title and link: 10gen changes name to MongoDB Inc (NoSQL database©myNoSQL)

Top Five MongoDB Alerts

The 5 alerts 10gen is recommending to use with their MongoDB Management Service:

  • Host Recovering (All, but by definition Secondary)
  • Repl Lag (Secondary)
  • Connections (All mongos, mongod)
  • Lock % (Primary, Secondary)
  • Replica (Primary, Secondary)
  1. It’s great that MMS offers help to their customers with these alerts;
  2. These also represent the top 5 problems you might have with a MongoDB deployment. And alerting is not going to help you fix them. So you better have a well established and rehearsed plan for each.
  3. Or you could use one of those solutions, like this or this, that don’t wake you at night.

Original title and link: Top Five MongoDB Alerts (NoSQL database©myNoSQL)


What's really in it for MongoDB's 3rd parties?

Luca Olivari, Director of Business Development at 10gen:

With MongoDB you can cover 80% of the use cases of Relational plus NoSQL databases.

Leaving aside for a second the last part of this sentence as being obviously not accurate, let’s look at what the first part might mean:

  1. fewer than 20% of the use cases need strong transactional semantics
  2. fewer than 20% of the use cases need strong data integrity constraints
  3. fewer than 20% of the use cases require integration with other existing data processing tools that imply SQL access
  4. fewer than 20% of the use cases require one or more of the still unique to relational database features (triggers, materialized views, etc.)
  5. fewer than 20% of use cases require to be always available.

I’d (probably) be OK with the fact that each of the above could be true, but I don’t believe that adding together all these cases makes only for 20% of the use cases.

So, what’s another answer to the question:

If you were to choose a new technology, what would you choose? There’s a chance that you’ll pick the one that gives you more advantages in more cases.

It’s well known for many that adoption, thus opportunity, is not always related to the technological merits. Actually most of the time a 3rd party business opportunity is directly connected with the complexity or incompleteness or fragility of a technology.

If you’d be a business, wouldn’t you choose a market where there is sizable opportunity but the competition (nb your competition, not the solution competition) is not that strong and there’s a chance for recurring business (i.e. a business that requires a client to call multiple times is definitely better than one which once delivered it just works).

Original title and link: What’s really in it for MongoDB’s 3rd parties? (NoSQL database©myNoSQL)


How to speed up MongoDB Map Reduce by 20x

Antoine Girbal:

Looking back, we’ve started at 1200s and ended at 60s for the same MR job, which represents a 20x improvement! This improvement should be available to most use cases, even if some of the tricks are not ideal (e.g. using multiple output dbs / collections). Nevertheless this can give people ideas on how to speed up their MR jobs and hopefully some of those features will be made easier to use in the future. The following ticket will make ‘splitVector’ command more available, and this ticket will improve multiple MR jobs on the same database.

Looking back at the article, it reads like a series of tricks to go around the limitations of MongoDB’s MapReduce implementation:

  1. a single thread use for MapReduce jobs
  2. lock contention
  3. BSON-to-JSON-and-back serializations

Original title and link: How to speed up MongoDB Map Reduce by 20x (NoSQL database©myNoSQL)


TokuMX means for MongoDB the same as InnoDB to MySQL

Vadim Tkachenko (MySQL Performance blog) about TokuMX, the fractal tree-based storage for MongoDB:

Why is TokuMX interesting? A few reasons:

  • It comes with transactions, and all that good stuff that transactions provide: a concurrent access to documents (no more global write-lock in MongoDB); crash recovery; atomicity
  • Performance in IO-bound operations
  • A good compression rate, which is a money-saver if you use SSD/Flash
  • But it is also SSD/Flash life-time friendly, which is double money-saver

Some thoughts:

  1. TokuMX brings to the table some features that might not be top priorities or even features that 10gen wants into MongoDB.
    1. I seriously doubt 10gen engineering or sales are recommending TokuMX.
    2. While the advantages of the TokuMX engine are quite interesting, how isTokutek closing sales (considering 10gen is not sharing their list of customers)?
  2. How would this mix of 10gen and Tokutek work at the business level? I don’t think Tokutek wants to sell or that 10gen is ready to acquire/merge with Tokutek.
  3. How would this work for customers? The InnoDB-MySQL and TokuMX-MongoDB parallel looks good on paper, but I cannot imagine how a user will interact with these 2 providers. Buy a license from Tokutek, then pay 10gen support for MongoDB and then call both?
  4. How will this integration work long term considering the complete control 10gen has over the core MongoDB? While 10gen could come up with a compatibility certification, I don’t think they’ll actually do it (see point 1).

Original title and link: TokuMX means for MongoDB the same as InnoDB to MySQL (NoSQL database©myNoSQL)


Top 5 Presentations from MongoNYC

If I’ve posted the Cassandra Summit’s Bests, I’ve also found the top 5 slides and videos from MongoNYC on the 10gen blog.

✚ You might take this as one of my biases but the fact that a presentation with the title “How to keep your data safe in MongoDB” is still in top 5 after so many years of MongoDB makes me think how much some of the early decisions hurt its adoption; maybe even the adoption of NoSQL in general.

Original title and link: Top 5 Presentations from MongoNYC (NoSQL database©myNoSQL)

Integrating MongoDB and Hadoop at Groupon

After looking at the 2 default options, GroupOn engineers came up with their custom solution that involves a complicated procedure for backing up MongoDB’s data files into a Hadoop cluster and then a custom InputFormat reader:

To solve this problem we backup raw Mongo data files to our Hadoop Distributed File System (HDFS), then read them directly using an InputFormat. This approach has the drawback of not reading the most current Mongo data for each MapReduce, but it means we have a backup of our data in HDFS and can map over an entire collection faster because of the throughput of our Hadoop cluster. Moving data from a sharded Mongo cluster into HDFS, however, has challenges of its own.

While I used integrating in the title, this looks more like patching the two to work together.

Original title and link: Integrating MongoDB and Hadoop at Groupon (NoSQL database©myNoSQL)


Understand MongoDB's request routing

A. Jesse Jiryu Davis put together a couple of scripts to explain how requests are routed in a MongoDB cluster:

In a sharded cluster of replica sets, which server or servers handle each of your queries? What about each insert, update, or command? […] Operations are routed according to the type of operation, your shard key, and your read preference.

I actually think there are more hops involved between the config servers, mongos, and mongod instances. But the basic rules are pretty simple:

  1. if the query contains the shard key then it’s routed to specific shards
  2. if the query doesn’t contain shard keys then the request is send to all shards.

Original title and link: Understand MongoDB’s request routing (NoSQL database©myNoSQL)


IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2

The details are pretty confusing1

[…] the new standard — which encompasses the MongoDB API, data representation (BSON), query language and wire protocol — appears to be all about establishing a way for mobile and other next-generation applications to connect with enterprise database systems such as IBM’s popular DB2 database and its WebSphere eXtreme Scale data grid.

But the juicy part is in the comments; if you can ignore the pitches.

  1. if this is a new standard and it is all based on the already existing MongoDB API, BSON, and wire protocol, then 1) what’s new about it and 2) what exactly will make it a standard

Original title and link: IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2 (NoSQL database©myNoSQL)


What is TokuMX fractal tree-based storage?

A post on Tokutek’s blog explaining TokuMX, the fractal tree-based storage engine for MongoDB:

TokuMX has replaced ALL of the storage code in MongoDB with fractal trees. […]

TokuMX achieves high compression for the same reason TokuDB for MySQL does: fractal trees compress really well by ensuring they compress data in large chunks. TokuMX achieves high insertion rates on index-rich collections for the same reason TokuDB for MySQL performs so well on iiBench, fractal trees are a write-optimized data structure designed to maintain insertion performance on larger than memory workloads. TokuMX does not require constant compaction for the same reason that TokuDB for MySQL does not require users to constantly run “optimize table” to reorganize data, fractal trees don’t fragment. MongoDB and MySQL are very different products with very different user experiences, but the underlying data structure of their storage is the same: the B-Tree. Fractal trees are better.

The post has a lot of links to go through.

✚ Has Tokutek published any papers about the fractal tree engine? I remember reading that the technology was waiting to be patented, but I don’t think I’ve found any papers about it.

Original title and link: What is TokuMX fractal tree-based storage? (NoSQL database©myNoSQL)


MongoDB Indexes - I helped a customer optimize his MongoDB

Recently, I helped a cus­tomer opti­mize his data­base. Write lock on the data­base was run­ning con­sis­tently at 95%. CPU was spik­ing con­sis­tently, and mak­ing for a poor expe­ri­ence.

How long until we’ll see profitable consulting businesses focused on optimizing MongoDB? Wait… we already have them.

Original title and link: MongoDB Indexes - I helped a customer optimize his MongoDB (NoSQL database©myNoSQL)


New Geo Features in MongoDB 2.4

The primary conceptual difference (though there are also many functional differences) between the 2d and 2dsphere indexes, is the type of coordinate system that they consider. Planar coordinate systems are useful for certain applications, and can serve as a simplifying approximation of spherical coordinates. As you consider larger geometries, or consider geometries near the meridians and poles however, the requirement to use proper spherical coordinates becomes important.

I don’t know anything about geo, so I’ll leave this up for experts to comment on.

✚ There’s actually something I like about this announcement: the fact that MongoDB decided to use an existing standard instead of coming up with its own custom solution.

Original title and link: New Geo Features in MongoDB 2.4 (NoSQL database©myNoSQL)