ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

scalability: All content tagged as scalability in NoSQL databases and polyglot persistence

Videos from Surge Conference

Even if not focused on NoSQL, the videos from the Surge conference are covering very interesting aspects related to scalability. Here are a couple of examples:

  • Theo Schlossnagle: Scalable Design Patterns
  • Justin Sheehy: Embracing Concurrency at Scale
  • Ronald Bradford: The most common MySQL scalability mistakes, and how to avoid them
  • Ruslan Belkin: Going 0 to 60: Scaling LinkedIn
  • Robert Treat: Database Scalability Patterns
  • Artur Bergman: Scaling and Loadbalancing Wikia Across The World
  • Mike Malone: Working with Dimensional Data in a Distributed Hash Table
  • Gavin M. Roy: Scaling myYearbook.com - Lessons Learned From Rapid Growth
  • Benjamin Black: Go with the flow - Meditations on network infrastructure analysis
  • John Allspaw: The “Go or No-Go”: Operability and Contingency at Etsy
  • Rod Cope: Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud

Last but not least there’s also a “SQL vs NoSQL” panel featuring Geir Magnusson Jr (Moderator), Robert Treat, Baron Schwartz, Mike Malone and Justin Sheehy.

Enjoy!

Original title and link: Videos from Surge Conference (NoSQL databases © myNoSQL)


Citrix sprinkles apps magic on SQL, NoSQL data

The Register:

Native support for SQL and NoSQL has been added to Citrix Systems’ NetScaler, which until now had specialized in high-availability only for applications. […] The change means NetScaler works with Microsoft’s SQL Server, MySQL and NoSQL data stores and Yahoo!’s number-crunching platform Hadoop and Google’s MapReduce, Citrix said. […] NetScaler sits in front of the web server to mange and balance the traffic. With the addition of SQL and NoSQL, NetScaler will now sit in front of relational and non relational data stores and clusters such as Hadoop and architectures like MapReduce and serve up photos, Tweets, status updates, e-transactions, or enterprise sales reports.

What if you don’t believe in magic and ask:

  • what NoSQL databases are supported?
  • how can NetScaler be more than a load balancer?

When you read about Twitter’s Gizzard — the library for creating distributed datastores — there’s no mention of magic.

Original title and link: Citrix sprinkles apps magic on SQL, NoSQL data (NoSQL databases © myNoSQL)

via: http://www.theregister.co.uk/2011/03/29/netscaper_citric_data/


MongoDB, NoSQL, and Web Scale with 10gen CTO Eliot Horowitz

From thechangelog podcast:

  • MongoDB roadmap: MongoDB 2.0 will be focusing on concurrency, aggregation, online compaction, and TTL temporal collections
  • MongoDB single server durability as the most awaited feature of the recent MongoDB 1.8 release
  • MongoDB compared with Riak and CouchDB
  • MongoDB geo support

And it is cool that Eliot and 10gen think the Web Scale meme is all in good fun.

Original title and link: MongoDB, NoSQL, and Web Scale with 10gen CTO Eliot Horowitz (NoSQL databases © myNoSQL)


How Scalable is VoltDB?

Percona guys[1] have run, analyzed, and concluded about VoltDB scalability:

VoltDB is very scalable; it should scale to 120 partitions, 39 servers, and 1.6 million complex transactions per second at over 300 CPU cores

Considering the definition: “A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.”, the conclusion should be slightly updated:

VoltDB can scale up to 120 partitions on 39 servers with 300 CPU cores and 1.6 million TPS.

Bottom line:

  • if you can fit your data into 40 servers’ memory
  • you need ACID and SQL
  • you are OK precompiled Java based stored procedures
  • you don’t need multi data center deployments

now you can estimate how far you can go with VoltDB.


  1. The company specialized on MySQL services and behind the MySQL Performance Blog  

Original title and link: How Scalable is VoltDB? (NoSQL databases © myNoSQL)

via: http://www.mysqlperformanceblog.com/2011/02/28/is-voltdb-really-as-scalable-as-they-claim/


Scaling Graph Databases: Sharding and Consistent Routing

The subject of scaling graph databases is popping up every now and then proving that sharding highly connected data is still an unresolved problem.

Jim Webber[1] has published a series of posts — here, here, and here — discussing generic graph sharding solutions, the route Neo4j is taking for addressing this problem, and, in the last post, a simple strategy — suggested by Mark Harwood — for deciding a graph database scaling approach:

  1. Dataset size: Many tens of Gigabytes
    Strategy: Fill a single machine with RAM

  2. Dataset size: Many hundreds of Gigabytes
    Strategy:Cache sharding

  3. Dataset size: Terabytes and above
    Strategy: Domain-specific sharding

The cache sharding approach — described here — suggests replacing the problem of graph sharding with “the simpler problem of consistent routing”. But I’m not sure how this solution works effectively:

  1. if there is a way to provide consistent routing isn’t that equivalent with having an solution for sharding the graph?
  2. considering graph databases are chatty — in the sense that most of the time there is a complex traversal happening — how smart should the client and the router be to work effectively and efficiently?

For now I feel that being able to solve the consistent routing in a graph implies having a strategy for sharding a graph. The reverse applies too: having a sharding strategy would provide a routing solution. Thus scaling graph databases remains a subject open for research.


  1. Jim Webber: Chief Scientist with Neo Technology, @jimwebber  

Original title and link: Scaling Graph Databases: Sharding and Consistent Routing (NoSQL databases © myNoSQL)


The Story of Scaling Union Station

You think you are ready to scale:

Even before the launch, our software architecture was designed to be redundant (no single point of failure) and to scale horizontally across multiple machines if necessary. The most important components are the web app, MySQL, MongoDB (NoSQL database), RabbitMQ (queuing server) and our background workers.

Union Station Architecture

And then the reality check:

Unfortunately MongoDB’s shard rebalancing system proved to be slower than we hoped it would be. During rebalancing there was so much disk I/O we could process neither read nor write requests in reasonable time. We were left with no choice but to take the website and the background workers offline during the rebalancing.

There are a couple of lessons to be learned here:

  • there’s no silverbullet solution for scaling
  • if you haven’t tested a specific scenario, you might have surprises when you’ll have do to it in the production environment

Original title and link: The Story of Scaling Union Station (NoSQL databases © myNoSQL)

via: http://blog.phusion.nl/2011/03/04/union-station-is-back-online-and-heres-what-we-have-been-up-to/


9 Things to Acknowledge about NoSQL Databases

Excellent list:

  1. Understand how ACID compares with BASE (Basically Available, Soft-state, Eventually Consistent)
  2. Understand persistence vs non-persistence, i.e., some NoSQL technologies are entirely in-memory data stores
  3. Recognize there are entirely different data models from traditional normalized tabular formats: Columnar (Cassandra) vs key/value (Memcached) vs document-oriented (CouchDB) vs graph oriented (Neo4j)
  4. Be ready to deal with no standard interface like JDBC/ODBC or standarized query language like SQL; every NoSQL tool has a different interface
  5. Architects: rewire your brain to the fact that web-scale/large-scale NoSQL systems are distributed across dozens to hundreds of servers and networks as opposed to a shared database system
  6. Get used to the possibly uncomfortable realization that you won’t know where data lives (most of the time)
  7. Get used to the fact that data may not always be consistent; ‘eventually consistent’ is one of the key elements of the BASE model
  8. Get used to the fact that data may not always be available
  9. Understand that some solutions are partition-tolerant and some are not

Print it out and distribute it among your colleagues.

Original title and link: 9 Things to Acknowledge about NoSQL Databases (NoSQL databases © myNoSQL)

via: http://www.evidentsoftware.com/nosql-basics-for-the-rdbms-savvy/


MongoDB Auto-Sharding

I know this will sound like bashing MongoDB. But I’ve already said it a couple of times: MongoDB scaling looks complicated.

People coming to MongoDB think their scalability issues are of no concern as MongoDB supports auto-sharding and replica sets. I would advise everyone to check the MongoDB mailing list before relaxing:

  • “Huge collection is not balanced across shards” here
  • “Questions on Enabling Sharding on a Database (pre-sharded configuration)” here
  • “mongodb crash when new shard is added” here
  • “Shards Unevenly balanced” here
  • “Imbalance on auto-sharding” here
  • “Sharding woes - All chuncks stuck on one shard”
  • “sharding makes reads/writes slower” here

It takes time to perfect complicated solutions and it is your job to verify the current status. Hopefully the next version, MongoDB 1.8.0 will address many of these issues. As it’ll also do with single server durability.

Original title and link: MongoDB Auto-Sharding (NoSQL databases © myNoSQL)


4 Database Technologies for Large Scale Data

Park Kieun (CUBRID Cluster Architect) gives an introduction to 4 large scale database technologies:

  • Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.

Examples: EBay DW, Yahoo! Everest Architecture, Greenplum, AsterData

  • Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.

Examples: Vertica, Sybase IQ, MonetDB

  • Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.

Examples: Truviso

  • Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.

Examples: many of the NoSQL databases covered here.

Even if I came up with the same 5 categories for scalable storage solutions, Park’s list is better documented. However we both left out distributed filesystems (sorry Jeff).

Original title and link: 4 Database Technologies for Large Scale Data (NoSQL databases © myNoSQL)

via: http://blog.cubrid.org/web-2-0/database-technology-for-large-scale-data/


Scaling with Cassandra

Peter Schuller’s Scaling with Apache Cassadra recorded at Oredev:

I watched only the first couple of minutes, so comments and feedback are welcome.

Original title and link: Scaling with Cassandra (NoSQL databases © myNoSQL)


Clustrix Sierra Clustered Database Engine

In the light of publicly announcing customers, I wanted to read a bit about Clustrix Clustered Database Systems.

The company homepage is describing the product:

  • scalable database appliaces for Internet-scale work loads
  • Linearly Scalable: fully distributed, parallel architecture provides unlimited scale
  • SQL functionality: full SQL relational and data consistency (ACID) functionality
  • Fault-Tolerant: highly available providing fail-over, recovery, and self-healing
  • MySQL Compatible: seamless deployment without application changes.

All these sounded pretty (too) good. And I’ve seen a very similar presentation for Xeround: Elastic, Always-on Storage Engine for MySQL.

So, I’ve continued my reading with the Sierra Clustered Database Engine whitepaper (PDF).

Here are my notes:

  • Sierra is composed of:
    • database personality module: translates queries into internal representation
    • distributed query planner and compiler
    • distributed shared-nothing execution engine
    • persistent storage
    • NVRAM transactional storage for journal changes
    • inter-node Infiniband
  • queries are decomposed into query fragments which are the unit of work. Query fragments are sent for execution to nodes containing the data.
  • query fragments are atomic operations that can:
    • insert, read, update data
    • execute functions and modify control flow
    • perform synchronization
    • send data to other nodes
    • format output
  • query fragments can be executed in parallel
  • query fragments can be cached with parameterized constants at the node level
  • determining where to sent the query fragments for execution is done using either range-based rules or hash function
  • tables are partitioned into slices, each slice having redundancy replicas
    • size of slices can be automatically determined or configured
    • adding new nodes to the cluster results in rebalancing slices
    • slices contained on a failed device are reconstructed using their replicas
  • one of the slices is considered primary
  • writes go to all replicas and are transactional
  • all reads fo the the slice primary

The paper also exemplifies the execution of 4 different queries:

SELECT * FROM T1 WHERE uid=10

SELECT uid, name FROM T1 JOIN T2 on T1.gid = T2.gid WHERE uid=10

SELECT * FROM T1 WHERE uid<100 and gid>10 ORDER BY uid LIMIT 5

INSERT INTO T1 VALUES (10,20)

Questions:

  • who is coordinating transactions that may be executed on different nodes?
  • who is maintains the topology of the slices? In case of a node failure, you’d need to determine:
    1. what slices where on the failing node
    2. where are the replicas for each of these slices
    3. where new replicas will be created
    4. when will new replicas become available for writes
  • who elects the slice primary?

Clustrix Sierra Clustered Database Engine

Original title and link: Clustrix Sierra Clustered Database Engine (NoSQL databases © myNoSQL)


Scaling with MongoDB Video

Kristina Chodorow’s giving the serious version of MongoDB is web scale in an O’Reilly webcast: scaling with MongoDB: