ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

document database: All content tagged as document database in NoSQL databases and polyglot persistence

Top 5 Presentations from MongoNYC

If I’ve posted the Cassandra Summit’s Bests, I’ve also found the top 5 slides and videos from MongoNYC on the 10gen blog.

✚ You might take this as one of my biases but the fact that a presentation with the title “How to keep your data safe in MongoDB” is still in top 5 after so many years of MongoDB makes me think how much some of the early decisions hurt its adoption; maybe even the adoption of NoSQL in general.

Original title and link: Top 5 Presentations from MongoNYC (NoSQL database©myNoSQL)


RavenDB document indexing process

Itamar Syn-Hershko explains the indexing process in RavenDB:

RavenDB has a background process that is handed new documents and document updates as they come in, right after they were stored in the Document Store, and it passes them in batches through all the indexes in the system. For write operations, the user gets an immediate confirmation on their transaction—even before the indexing process started processing these updates—without waiting for indexing, but being 100 percent certain the changes were recorded in the database. Queries do not wait for indexing either—they just use the indexes that exist at the time the query was issued. This ensures both smooth operation on all fronts, and that no documents are left behind.

Asynchronous indexing is tricky. While it looks like addressing the performance penalty on both read and write, it actually has a few drawbacks:

  1. immediate inconsistency: with asynchronous indexes, there are no consistency guarantees.
  2. impossibility of defining unique indexes. When using async indexes, it’s impossible to define unique indexes as by the time the index would be updated it would be too late to acknowledge the client that the uniqueness constraint is not satisfied.
  3. complicated crash recovery. With async indexing, the server must be able to continue the indexing process from where it was left. If this information is not persistent, crash recovery might lead to permanent data inconsistencies.

Any other obvious ones I’ve missed?

Original title and link: RavenDB document indexing process (NoSQL database©myNoSQL)

via: http://www.alvinashcraft.com/2013/06/27/the-ravendb-indexing-process/


Integrating MongoDB and Hadoop at Groupon

After looking at the 2 default options, GroupOn engineers came up with their custom solution that involves a complicated procedure for backing up MongoDB’s data files into a Hadoop cluster and then a custom InputFormat reader:

To solve this problem we backup raw Mongo data files to our Hadoop Distributed File System (HDFS), then read them directly using an InputFormat. This approach has the drawback of not reading the most current Mongo data for each MapReduce, but it means we have a backup of our data in HDFS and can map over an entire collection faster because of the throughput of our Hadoop cluster. Moving data from a sharded Mongo cluster into HDFS, however, has challenges of its own.

While I used integrating in the title, this looks more like patching the two to work together.

Original title and link: Integrating MongoDB and Hadoop at Groupon (NoSQL database©myNoSQL)

via: https://engineering.groupon.com/2013/big-data/mongodb-mapreduce-with-hadoop/


Understand MongoDB's request routing

A. Jesse Jiryu Davis put together a couple of scripts to explain how requests are routed in a MongoDB cluster:

In a sharded cluster of replica sets, which server or servers handle each of your queries? What about each insert, update, or command? […] Operations are routed according to the type of operation, your shard key, and your read preference.

I actually think there are more hops involved between the config servers, mongos, and mongod instances. But the basic rules are pretty simple:

  1. if the query contains the shard key then it’s routed to specific shards
  2. if the query doesn’t contain shard keys then the request is send to all shards.

Original title and link: Understand MongoDB’s request routing (NoSQL database©myNoSQL)

via: http://blog.mongodb.org/post/53841037541/real-time-profiling-a-mongodb-cluster


IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2

The details are pretty confusing1

[…] the new standard — which encompasses the MongoDB API, data representation (BSON), query language and wire protocol — appears to be all about establishing a way for mobile and other next-generation applications to connect with enterprise database systems such as IBM’s popular DB2 database and its WebSphere eXtreme Scale data grid.

But the juicy part is in the comments; if you can ignore the pitches.


  1. if this is a new standard and it is all based on the already existing MongoDB API, BSON, and wire protocol, then 1) what’s new about it and 2) what exactly will make it a standard

Original title and link: IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2 (NoSQL database©myNoSQL)

via: http://gigaom.com/2013/06/04/ibm-throws-its-weight-behind-mongodb-for-mobile-apps/


Cloudant's phenomenal response time

James Mundy writing about using Cloudant from his app deployed on Microsoft Azure cloud:

When I began implementing Cloudant’s CouchDB based distributed database as a service (daas) to replace our NoSQL Azure Table solution I had some reservations about the time making calls from our Azure Web Roles to their separate data centre would add to response times.

Turns out that really wasn’t anything to worry about at all.

This is very interesting (even if James’s experiment is not really a benchmark). I assume that the way Cloudant pulls this is by offering their service only from top notch connected datacenters. That on top of making sure the service is correctly tuned.

Original title and link: Cloudant’s phenomenal response time (NoSQL database©myNoSQL)

via: http://mendez.quora.com/Cloudants-phenomenal-response-time?srid=3nu1&share=1


4 Good Things About CouchDB

Will Conant:

CouchDB has four features that really make it stand out:

  1. It has no read locks.
  2. You can back up a database with cp without shutting it down.
  3. Any record (row, document, whatever) can participate in any index any number of times.
  4. Replication is easy and can be bidirectional.

I totally agree with the author. But when using a database, it’s not only about the features that stand out. It’s also about the unique features that fit the project, the missing features, the frequency with which those missing features are addressed. And I could go on for a while.

CouchDB’s bidirectional replication has always been its strongest, differentiating feature. But in my books, users had to fight too much on other parts of the database.

Original title and link: 4 Good Things About CouchDB (NoSQL database©myNoSQL)

via: http://willconant.com/posts/2013-06-02/4-good-things-about-couchdb


What is TokuMX fractal tree-based storage?

A post on Tokutek’s blog explaining TokuMX, the fractal tree-based storage engine for MongoDB:

TokuMX has replaced ALL of the storage code in MongoDB with fractal trees. […]

TokuMX achieves high compression for the same reason TokuDB for MySQL does: fractal trees compress really well by ensuring they compress data in large chunks. TokuMX achieves high insertion rates on index-rich collections for the same reason TokuDB for MySQL performs so well on iiBench, fractal trees are a write-optimized data structure designed to maintain insertion performance on larger than memory workloads. TokuMX does not require constant compaction for the same reason that TokuDB for MySQL does not require users to constantly run “optimize table” to reorganize data, fractal trees don’t fragment. MongoDB and MySQL are very different products with very different user experiences, but the underlying data structure of their storage is the same: the B-Tree. Fractal trees are better.

The post has a lot of links to go through.

✚ Has Tokutek published any papers about the fractal tree engine? I remember reading that the technology was waiting to be patented, but I don’t think I’ve found any papers about it.

Original title and link: What is TokuMX fractal tree-based storage? (NoSQL database©myNoSQL)

via: http://www.tokutek.com/2013/06/tokumx-fractal-trees-with-mongodb/


MongoDB Indexes - I helped a customer optimize his MongoDB

Recently, I helped a cus­tomer opti­mize his data­base. Write lock on the data­base was run­ning con­sis­tently at 95%. CPU was spik­ing con­sis­tently, and mak­ing for a poor expe­ri­ence.

How long until we’ll see profitable consulting businesses focused on optimizing MongoDB? Wait… we already have them.

Original title and link: MongoDB Indexes - I helped a customer optimize his MongoDB (NoSQL database©myNoSQL)

via: http://blog.mongohq.com/mongodb-indexing-best-practices/


New Geo Features in MongoDB 2.4

The primary conceptual difference (though there are also many functional differences) between the 2d and 2dsphere indexes, is the type of coordinate system that they consider. Planar coordinate systems are useful for certain applications, and can serve as a simplifying approximation of spherical coordinates. As you consider larger geometries, or consider geometries near the meridians and poles however, the requirement to use proper spherical coordinates becomes important.

I don’t know anything about geo, so I’ll leave this up for experts to comment on.

✚ There’s actually something I like about this announcement: the fact that MongoDB decided to use an existing standard instead of coming up with its own custom solution.

Original title and link: New Geo Features in MongoDB 2.4 (NoSQL database©myNoSQL)

via: http://blog.mongodb.org/post/50984169045/new-geo-features-in-mongodb-2-4


Memory-Mapped I/O in SQLite

Beginning with version 3.7.17, SQLite has the option of accessing disk content directly using memory-mapped I/O and the new xFetch() and xUnfetch() methods on sqlite3_io_methods.

As with the docs about atomic commits, this will be one of the best, succinct, and clear docs you’ll read about memory mapped files, the pros and cons, and how SQLite uses them.

If you are a MongoDB user you should read this.

✚ Check out the HN thread to see how many people love SQLite.

Original title and link: Memory-Mapped I/O in SQLite (NoSQL database©myNoSQL)

via: http://www.sqlite.org/mmap.html


NoSQL and Full Text Indexing: Two Trends

On one side:

  1. DataStax with Solr
  2. MapR with LucidWorks Search (nb: Solr)

and on the other side:

  1. Riak Searching: Solr-like but custom prioprietary implementation
  2. MongoDB text search: custom prioprietary implementation

I’m not going to argue about the pros and cons of each of these approaches, but I’m sure you already know which of these approaches I’m in favor of.

Original title and link: NoSQL and Full Text Indexing: Two Trends (NoSQL database©myNoSQL)