NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Powered by NoSQL: All content tagged as Powered by NoSQL in NoSQL databases and polyglot persistence

Powered by Hadoop and Hive: Budgeting for snow removal in your local community

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.

Hadoop FTW!

Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (NoSQL database©myNoSQL)


MongoDB at Viber Media: The Platform Enabling Free Phone Calls and Text Messaging for Over 18 Million Active Users

Back in November there has been quite a bit of buzz around MongoDB being behind Viber Media’s technology for free phone calls and text messaging. Understandingly so, considering we are talking about a platform with more than 18 million active users talking for more than 11 million minutes every day—and these numbers have probably grown quite a bit over the holiday season.

The nice folks from Viber Media[1] have been kind enough to share more details about their platform and the way MongoDB is used. Here is the complete exchange:

Q: Could you briefly describe how your application works so we could better understand where MongoDB fit into your architecture?

Viber’s mobile clients connect to a central service that can route messages to other such clients. These messages can either be text messages or “signals” for establishing a phone call. These front-end servers use MongoDB as a common data-store. We store variable length documents that include dictionaries.

Q: What were the main reasons that led you to use MongoDB? Were there other solutions that you’ve been tempted to use for your architecture?

We started with a proprietary code, but with the large increase in the number of new registrations per day, we realized that we needed a database that will be both scalable and redundant. At that time, this was the only database that looked like a good fit for both.

Q: The announcement mentioned that currently your clusters run on 130 nodes in the Amazon cloud. Could you describe the deployment and what components of the Amazon cloud are involved?

We have 65 MongoDB shards. Each shard consists of a master and a slave. A single EC2 instance is used for running arbiters for all shards. We are using a RAID5 (moving to RAID10) volume consisting of 6 EBS volumes for each MongoDB machine. All instances are m2x.large but we plan to migrate into larger instances.

More Amazon technology at work:

  • we are using ELB as a front-end for our proprietary load-balancers and for off-loading HTTPS processing
  • we are using S3 for storing pictures sent between users.

Q: How do you monitor your MongoDB cluster? Are there people in your team dedicated to managing the MongoDB cluster?

We have a small team to support our application and MongoDB cluster (we’re looking for MongoDB admins, BTW). We use our own monitoring server to monitor both cluster and a 10Gen MMS (Mongo Monitoring Service) to solely monitor MongoDB.

Q: Your platform has seen amazing growth reaching 18 mil. active users in less than 1 year. What has this growth meant in terms of evolving and managing the MongoDB deployment?

Hard work :). MongoDB has been very useful for increasing our reach to active users. Our exact methods are proprietary and therefore cannot be disclosed.

Q: What were the most notable moments in the evolution of your MongoDB cluster? Has it seen any radical changes over the time? Did you have to migrate your cluster to newer versions of MongoDB, etc.?

We have migrated versions from 1.7.6 to 1.8 and now to 2.0. We are still having a few problems with the last version, but we keep improving all the time.

Q: Were there any (major) bumps in the road with MongoDB? Or differently put, are there areas in which you’d like to see MongoDB improving?

  1. The database of the config server is not recovering (no master-slave). This misunderstanding has caused us to have 24 hours’ downtime with Viber at the beginning.
  2. The memory consumption of MongoDB is too high.

Thanks guys and good luck growing your platform!

  1. My thanks also to Meghan Gill and Darah Roslyn which helped getting this interview.  

Original title and link: MongoDB at Viber Media: The Platform Enabling Free Phone Calls and Text Messaging for Over 18 Million Active Users (NoSQL database©myNoSQL)

Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress

Drew Conway:

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API. […] But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.

Original title and link: Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress (NoSQL database©myNoSQL)


VoltDB for Real-Time Network Monitoring

From the announcement of VoltDB being used by the Japanese ISP, Sakura Internet, for their real-time Internet traffic monitoring and analysis platform for detecting and mitigating large-scale distributed denial of service (DDoS) attacks:

Tamihiro Yuzawa[1]: Our system needs to be capable of sifting through massive amounts of traffic flow data in real-time.  VoltDB was our choice from the beginning because it’s a super-fast datastore that supports SQL. 

Scott Jarr[2]: Sakura’s security infrastructure requires a datastore that can scale massively and on demand, without sacrificing data accuracy.

Mark these VoltDB keywords:

  1. fast (read in-memory)
  2. data consistency
  3. SQL

  1. Tamihiro Yuzawa: Systems Engineer at Sakura Internet  

  2. Scott Jarr: VoltDB CEO  

Original title and link: VoltDB for Real-Time Network Monitoring (NoSQL database©myNoSQL)

Why We Chose HBase for AppFirst APM

Its performance had a significant impact on our decision making as well. It sustains an enormous number of writes and the read cycle times were much better than we had anticipated. Further, it gives us the option to interact with the Hadoop Ecosystem, including HDFS, Mapreduce, and Zookeeper frameworks. Our enthusiasm for HBase skyrocketed when we discovered how to create map-reduce apps to do a number of management tasks. While Cassandra also has these capabilities, its data model was fundamentally more complex.

What if the whole post would have said: we chose HBase because of

  1. its seamless integration in the Hadoop ecosystem
  2. the scalable time series OpenTSDB is built on top of HBase?

Original title and link: Why We Chose HBase for AppFirst APM (NoSQL database©myNoSQL)


MongoDB in Numbers: Foursquare, Wordnik, Disney

Derrick Harris:

If you’re wondering what kind of performance and scalability requirements forced these companies to MongoDB, and then to customize it so heavily, here are some statistics:

  • Foursquare:
    • 15 million users;
    • 8 production MongoDB clusters;
    • 8 shards of user data;
    • 12 shards of check-in data;
    • ~250 updates per second on user database, with maximum output of 46 MBps;
    • ~80 check-ins per second on check-in database, with maximum output of 45 MBps;
    • up to 2,500 HTTP queries per second.
  • Wordnik:
    • Tens of billions of documents with more always being added;
    • more than 20 million REST API calls per day;
    • mapping layer supports 35,000 records per second.
  • Disney:
    • More than 1,400 MongoDB instances (although “your eyes start watering after 30,” Stevens said);
    • adding new instances every day, via a custom-built self-service portal, to test, stage and host new games.

Add to these Viber Media numbers:

  • 30 million plus registered mobile users
  • 18 million active users talking 11 million minutes every day

I have an exclusive interview with Viber Media people queued up for the next days.

Original title and link: MongoDB in Numbers: Foursquare, Wordnik, Disney (NoSQL database©myNoSQL)


Cassandra, Zookeeper, Scribe, and Node.js Powering Rackspace Cloud Monitoring

Paul Querna describes the original architecture of Cloudkick and the one that powers the recently announced Rackspace Cloud Monitoring service:

Development framework: from Twisted Python and Django to Node.js

Cloudkick was primarily written in Python. Most backend services were written in Twisted Python. The API endpoints and web server were written in Django, and used mod_wsgi. […] Cloud Monitoring is primarily written in Node.js.

Storage: from master-slave MySQL to Cassandra

Cloudkick was reliant upon a MySQL master and slaves for most of its configuration storage. This severely limited both scalability, performance and multi-region durability. These issues aren’t necessarily a property of MySQL, but Cloudkick’s use of the Django ORM made it very difficult to use MySQL radically differently. The use of MySQL was not continued in Cloud Monitoring, where metadata is stored in Apache Cassandra.

Even more Cassandra:

Cloudkick used Apache Cassandra primarily for metrics storage. This was a key element in keeping up with metrics processing, and providing a high quality user experience, with fast loading graphs. Cassandra’s role was expanded in Cloud Monitoring to include both configuration data and metrics storage.

Event processing: from RabbitMQ to Zookeeper and a bit more Cassandra

RabbitMQ is not used by Cloud Monitoring. Its use cases are being filled by a combination of Apache Zookeeper, point to point REST or Thrift APIs, state storage in Cassandra and changes in architecture.

And finally Scribe:

Cloudkick used an internal fork of Facebook’s Scribe for transporting certain types of high volume messages and data. Scribe’s simple configuration model and API made it easy to extend for our bulk messaging needs. Cloudkick extended Scribe to include a write ahead journal and other features to improve durability. Cloud Monitoring continues to use Scribe for some of our event processing flows.

Original title and link: Cassandra, Zookeeper, Scribe, and Node.js Powering Rackspace Cloud Monitoring (NoSQL database©myNoSQL)


Neo4j and Spring Data for Configuration Management Database

Willie Wheeler describing the challenges of a configuration management database:

My experience has been that the data persistence layer is the one that’s most challenging to change. Besides the actual schema changes, we have to write data migration scripts, we have to make corresponding changes to our integration test data scripts, we have to make sure Hibernate’s eager- and lazy-loading are doing the right things, sometimes we have to change the domain object APIs and associated Hibernate queries, etc. Certainly doable, but there’s generally a good deal of planning, discussion and testing involved.

Then the benefits of using Neo4j and Spring Data for building it:

  • There are many entities and relationships.
  • We need schema agility to experiment with different CMDB approaches.
  • We need schema agility to accommodate continuing innovations in infrastructure.
  • We need schema flexibility to accommodate the needs of different organizations.
  • But we still need structure.
  • A schemaless backend makes zero-downtime deployments easier.
  • We want to support intuitive querying.

Solving the same problem, Puppet is using CouchDB for configuration management.

Original title and link: Neo4j and Spring Data for Configuration Management Database (NoSQL database©myNoSQL)


Redis Bitmaps for Real-Time Metrics at Spool

Chandra Patni:

Traditionally, metrics are performed by a batch job (running hourly, daily, etc.). Redis backed bitmaps allow us to perform such calculations in realtime and are extremely space efficient. In a simulation of 128 million users, a typical metric such as “daily unique users” takes less than 50 ms on a MacBook Pro and only takes 16 MB of memory. Spool doesn’t have 128 million users yet but it’s nice to know our approach will scale. We thought we’d share how we do it, in case other startups find our approach useful.

A very different approach to the classical approach of historical event logging.

Original title and link: Redis Bitmaps for Real-Time Metrics at Spool (NoSQL database©myNoSQL)


Muscula Architecture: Node.js, MongoDB, and CDN

Allan Ebdrup:

The frontend interfaces with the backend entirely through JSONP calls and the only thing passed through those JSONP calls is pure data as JSON strings. There is no html-markup whatsoever on the backend. The separation between frontend and backend is logical as well as physical. The frontend is hosted on entirely different servers than the backend and on a different domain. […] The backend is build with Node.JS and MongoDB. The backend is also layered in tiers. There is a layer for security, a layer for returning uniform error messages and status codes and other layers.

Among the listed advantages of such an architecture:

  • one programming language in the entire technology stack: JavaScript.
  • scalable
  • very fast load time for the users of the application
  • schemaless database
  • no ORM

And obviously this is a much better solution than having the front-end directly access the data store.

Original title and link: Muscula Architecture: Node.js, MongoDB, and CDN (NoSQL database©myNoSQL)