NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Powered by NoSQL: All content tagged as Powered by NoSQL in NoSQL databases and polyglot persistence

Polyglot Persistence Architecture at Socialize: Splunk for MapReduce & Big Data Analysis

Very informative post on Socialize blog about their data flow and the data analysis stack used to processing it. The post is missing the architecture diagram, so I took the time to reconstruct it based on the details in the article:

Socialize polyglot persistence architecture

Click to view full size diagram of Socialize architecture

The traditional solution is to use aggregate functions in the RDBMS such as count() to get the aggregate results but this presents a few problems at a large scale:

  1. Aggregating rows in a database creates unneeded load on the server
  2. Data could be stored in multiple sharded databases and the aggregated results would be inaccurate.
  3. Data could be stored in other datastore like a NoSQL datastore or even flat log files.
  4. Data is stored in an uncommon format across many sources.

Original title and link: Polyglot Persistence Architecture at Socialize: Splunk for MapReduce & Big Data Analysis (NoSQL database©myNoSQL)

LinkedIn NoSQL Paper: Serving Large-Scale Batch Computed Data With Project Voldemort

The abstract of a new paper from a team at LinkedIn (Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, Sam Shah):

Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have ex- tended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon’s Dy- namo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deploy- ment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.

Read or download the paper after the break.

Cassandra as the Central Nervous System of Your Distributed Systems with Joe Stein - Powered by NoSQL

In the 4th week of the DataStax’s Cassandra NYC 2011 video series, we have Joe Stein from Medialets talking about the architecture

Before diving into the video here are some interesting data points:

  • Medialets serves rich media ads
    • they handle 3-4TB of daily data
    • microsecond-level response times
  • Cassandra is used for time series and aggregate metrics
  • all MapReduce jobs written in Python. This reminded me of the recent post about the performance impact of operations in Hadoop Map phase
  • Medialets architecture:

    Medialets architecture

  • Major components of the Medialets’s architecture:

    • Kafka
    • MySQL
    • Cassandra: 6 node cluster, 100k requests, single DC
    • Hadoop
    • ZooKeeper: coordinates all the services on the platform
  • some of the data in MySQL is replicated in Cassandra (and coordinated with ZooKeeper)
  • data is fed back to MySQL
  • Kafka for collecting analytics data:
    • aggregates go into Cassandra
    • events in Hadoop
  • GROUP BY with Cassandra
  • for real-time systems aggregations must be done upfront
  • the way data is segmented is critical
  • aggregation leads to data explosion

Cassandra at Clearspring with Chris Burroughs - Powered by NoSQL

For today’s Powered by Cassandra video from the Cassandra NYC 2011 event organized by DataStax, I chose Chris Burroughs’s presentation about Clearspring’s usage of Cassandra. Just in case you wonder what Clearspring is doing, the sharing buttons you see here on myNoSQL are powered by AddThis product from Clearspring.

Cassandra 101 for System Administrators with Nathan Milford - Powered by NoSQL

While today was supposed to be a new educational video from the Cassandra NYC 2011 video series, I thought that learning from the lessons of operating Cassandra at Outbrain to serve over 30 billion impressions monthly will be quite educational.

Polyglot persistence at Pinterest: Redis, Membase, MySQL

Pinterest architecture

I’ve created the diagram above based on this very brief answer on Quora:

We use python + heavily-modified Django at the application layer.  Tornado and (very selectively) node.js as web-servers.  Memcached and membase / redis for object- and logical-caching, respectively.  RabbitMQ as a message queue.  Nginx, HAproxy and Varnish for static-delivery and load-balancing.  Persistent data storage using MySQL.  MrJob on EMR for map-reduce.

Data from October 2011 showed Pinterest having over 3 million users generating 400+ million pageviews. There are plently of questions to be answered though:

  1. what is node.js used for? what is RabbitMQ used for?

    Note: the whole section in the diagram about node.js and RabbitMQ is speculative.

  2. is Amazon Elastic MapReduce used for clickstream analysis only (log based analysis) or more than that?

  3. how is data loaded in the Amazon cloud?

    Note: if Amazon Elastic MapReduce is used only for analyzing logs, these are probably uploaded regularly on Amazon S3.

  4. why the need for both Redis and Membase?

Original title and link: Polyglot persistence at Pinterest: Redis, Membase, MySQL (NoSQL database©myNoSQL)

Scaling Video Analytics with Cassandra by Ilya Maykov - Powered by NoSQL

To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time.

The Design of 99designs - A Clean Tens of Millions Pageviews Architecture

By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.

Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)

99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached

While the authoritative storage is Amazon RDS, 99designs is using Redis, MongoDB, and Memcached for transient data:

We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft launches and incremental feature rollouts.

It’s also worth noting the nice things they say about using Amazon RDS:

An RDS instance configured to use multiple availability zones provides master-master replication, providing crucial redundancy for our DB layer. This feature has already saved our bacon multiple times: the fail over has been smooth enough that by the time we realised anything was wrong, another master was correctly serving requests. Its rolling backups provide a means of disaster recovery. We load-balance reads across multiple slaves as a means of maintaining performance as the load on our database increases.

Original title and link: 99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached (NoSQL database©myNoSQL)


Cassandra at SocialFlow with Drew Robb - Powered by NoSQL

To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and