NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MySQL: All content tagged as MySQL in NoSQL databases and polyglot persistence

Benchmarking graph databases... with unexpected results

A team from MIT CSAIL set up to benchmark a graph database and 3 relational databases with different models: row-based (MySQL), in-memory (VoltDB), and column-based (Vertica) . The results are interesting, to say the least:

We can see that relational databases outperform Neo4j on PageRank by up to two orders of magnitude. This is because PageRank involves full scanning and joining of the nodes and edges table, something that relational databases are very good at doing. Finding Shortest Paths involves starting from a source node and successively exploring its outgoing edges, a very different access pattern from PageRank. Still, we see from Figure 1(b) that relational databases match or outperform Neo4j in most cases. In fact, Vertica is more than twice faster than Neo4j. The only exception is VoltDB over Twitter dataset.

Being beaten at your own game is not a good thing. I hope this is just a fluke in the benchmark (misconfiguration) or a result particular to those data sets.

Original title and link: Benchmarking graph databases… with unexpected results (NoSQL database©myNoSQL)


Google moves from MySQL to MariaDB

Jack Clark for TheRegister quoting Google senior systems engineer, Jeremy Cole’s talk at XLDB:

“Were running primarily on [MySQL] 5.1 which is a little outdated, and so we’re moving to MariaDB 10.0 at the moment,”

I’m wondering how much of this decision is technical and how much is political. While Jack Clark’s points to the previous “disagreements” between Google and Oracle, when I say political decisions I mean more than this: access to the various bits of the code (e.g. tests, security issues), control over the future of the product, etc.

Original title and link: Google moves from MySQL to MariaDB (NoSQL database©myNoSQL)


Benchmarking the performance impact of Foreign Keys in MySQL Cluster 7.3

FOREIGN KEYs in MySQL Cluster is a big step forward. […] It is implemented natively at the Data Node level, where NDB stores its data. It is well known that FOREIGN KEYs come with an overhead. E.g., when writing a record into a child table, the existence must be checked in the parent table. Since data is distributed across multiple Data Nodes, the child record and parent record may be on different nodes or shards (Node Groups). Hence there is extra work to be done in terms of internal triggers and network communication, the latter being the more costly. The performance impact must be taken into account when doing capacity planning of the cluster. The question is how much the impact is, and that is what we will look at next.

These micro-benchmark numbers are looking good. But here are a couple of questions I couldn’t answer after reading the post:

  1. how was the data distributed inside the cluster? Basically those results could have been achieved with most of the foreign keys actually living on the same machine
  2. how does the impact on performance vary with the size of the cluster? (differently put, how effective is the routing of FK checks and what’s the impact on the cluster network traffic and locks)

Original title and link: Benchmarking the performance impact of Foreign Keys in MySQL Cluster 7.3 (NoSQL database©myNoSQL)


Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL

Nokia’s big data ecosystem consists of a centralized, petabyte-scale Hadoop cluster that is interconnected with a 100-TB Teradata enterprise data warehouse (EDW), numerous Oracle and MySQL data marts, and visualization technologies that allow Nokia’s 60,000+ users around the world tap into the massive data store. Multi-structured data is constantly being streamed into Hadoop from the relational systems, and hundreds of thousands of Scribe processes run every day to move data from, for example, servers in Singapore to a Hadoop cluster in the UK. Nokia is also a big user of Apache Sqoop and Apache HBase.

In the coming years you’ll hear more often stories—sales pitches—about single unified platforms solving all these problems at once. But platforms that will survive and thrive are those that will accomplish two things:

  1. keep the data gates open: in and out.
  2. work with different other platform to make this efficiently for users

Original title and link: Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL (NoSQL database©myNoSQL)


MySQL 5.6, InnoDB and fast storage: 240k QPS

Mark Callaghan runs some benchmarks against MySQL 5.6.11:

Using MySQL 5.6.11 and InnoDB with a few hacks the peak throughput was about 240,000 QPS and 210,000 block reads/second. The test server has 32 cores (16 physical cores, 32 logical cores with HT enabled). This is a great result that can probably be even better. Contention on fil_system->mutex was the bottleneck and I think that can be improved (see feature request #69276). I wonder if 400,000 block reads/second is possible?

Original title and link: MySQL 5.6, InnoDB and fast storage: 240k QPS (NoSQL database©myNoSQL)


Wikipedia Adopts MariaDB

The technical details of Wikipedia’s migration from MySQL to MariaDB:

As a read-heavy site, Wikipedia aggressively uses edge caching. Approximately 90% of pageviews are served entirely from the edge while at the application layer, we utilize both memcached and redis in addition to MySQL. Despite that, the MySQL databases serving English Wikipedia alone reach a daily peak of ~50k queries/second. Most are read queries served by load-balanced slaves, depending on consistency requirements. 80% of the English Wikipedia query load (up to 40k qps) are typically handled by just two database servers at any given time. Our most common query type (40% of all) has a median execution time of ~0.2ms and a 95th percentile time of ~50ms. To successfully use MariaDB in production, we need it to keep up with the level of performance obtained from Facebook’s MySQL fork, and to behave consistently as traffic patterns change.

As you can see in this post, the only “political” point made is hidden within true reasons:

Equally important, as supporters of the free culture movement, the Wikimedia Foundation strongly prefers free software projects; that includes a preference for projects without bifurcated code bases between differently licensed free and enterprise editions. We welcome and support the MariaDB Foundation as a not-for-profit steward of the free and open MySQL related database community.

Slightly different to Wikipedia Migrates to MariaDB.

Original title and link: Wikipedia Adopts MariaDB (NoSQL database©myNoSQL)


MySQL in the Cloud: Discontinuing of Xeround Cloud Database Public Service

Cloud and MySQL related:

We are deeply sorry to announce that Xeround’s public cloud offering will be discontinued soon. All Xeround FREE database instances will be terminated on May 8th, and the paid plans terminated on May 15th.

This was announced on May 1st.

✚ This only means more for Amazon RDS.

Original title and link: MySQL in the Cloud: Discontinuing of Xeround Cloud Database Public Service (NoSQL database©myNoSQL)


Wikipedia Migrates to MariaDB... but facts are facts

Jon Buys:

There was, and continues to be, concern over Oracle’s treatment of the open source competitor to their own Oracle database. I personally have wondered what motivation, if any, Oracle has to maintain MySQL. They may simply be milking the revenue stream created by MySQL AB until the well goes dry. Since MariaDB is surpassing MySQL in performance and community goodwill, that day may come sooner rather than later.

A couple of little known things:

  1. Oracle has been house for InnoDB since 2005. InnoDB was and continues to be the default, recommended engine for MySQL. Before and after Oracle acquired MySQL through Sun Microsystems.
  2. Oracle has been house for Sleepycat’s BerkleyDB since 2006. Those products are definitely not dead. Community-wise maybe they haven’t put much effort into extending it.

Facts are facts.

Original title and link: Wikipedia Migrates to MariaDB… but facts are facts (NoSQL database©myNoSQL)


Amazon Web Services Annual Revenue Estimation

Over the weekend, Christopher Mims has published an article in which he derives a figure for Amazon Web Services’s annual revenue: $2.4 billions:

Amazon is famously reticent about sales figures, dribbling out clues without revealing actual numbers. But it appears the company has left enough hints to, finally, discern how much revenue it makes on its cloud computing business, known as Amazon Web Services, which provides the backbone for a growing portion of the internet: about $2.4 billion a year.

There’s no way to decompose this number into the revenue of each AWS solution. For the data space I’d be interested into:

  1. S3 revenues. This is the space Basho’s Riak CS competes into.

    After writing my first post about Riak CS, I’ve learned that in Japan, the same place where Riak CS is run by Yahoo! new cloud storage, Gemini Mobile Technologies has been offering to local ISPs a similar S3-service built on top of Cassandra.

  2. Redshift is pretty new and while I’m not aware of immediate competitors (what am I missing?), I don’t think it accounts for a significant part of this revenue. Even if some of the early users, like AirBnb, report getting very good performance and costs from it.

    Redshift is powered by ParAccell, which, over the weekend, has been acquired by Actian.

  3. Amazon Elastic MapReduce. This is another interesting space from which Microsoft wants a share with its Azure HDInsight developed in collaboration with Hortonworks.

    In this space there’s also MapR and Google Compute combination which seem to be extremely performant.

  4. Interestingly Amazon is making money also from some of the competitors of its Amazon Dynamo and RDS services. The advantage of owning the infrastructure.

Original title and link: Amazon Web Services Annual Revenue Estimation (NoSQL database©myNoSQL)

Using Redis to Optimize MySQL Queries

I somehow missed this post from Flickr team describing their use of (app enforced) capped sorted sets in Redis as sort of a reduced optimized secondary index for MySQL:

[…] the bottleneck was not in generating the list of photos for your most recently active contact, it was just in finding who your most recently active contact was (specifically if you have thousands or tens of thousands of contacts). What if, instead of fully denormalizing, we just maintain a list of your recently active contacts? That would allow us to optimize the slow query, much like a native MySQL index would; instead of needing to look through a list of 20,000 contacts to see which one has uploaded a photo recently, we only need to look at your most recent 5 or 10 (regardless of your total contacts count)!

This is the first time I’m encountaring this approach where a NoSQL database is used not to provide directly the final data (usually in a denormalized format), but rather to optimize the access to the master of data. Basically this is a metadata layer optimizer. Neat!

Original title and link: Using Redis to Optimize MySQL Queries (NoSQL database©myNoSQL)


Scaling Big Data Mining Infrastructure at Twitter

I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:

DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”

and then the reality check:

  1. Your boss says something vague
  2. You think very hard on how to move the needle
  3. Where’s the data?
  4. What’s in this dataset?
  5. What’s all the f#$#$ crap in the data?
  6. Clean the data
  7. Run some off-the-shelf data mining algorithm
  8. Productionize, act on the insight
  9. Rinse, repeat


Memcached vs InnoDB Memcached in MySQL 5.6

Some numbers from comparing Memcached with InnoDB Memcached in MySQL 5.6:

Keep in mind that the entire data set fits into the buffer pool, so there are no reads from disk. However, there is write activity stemming from the fact that this is using InnoDB under the hood (redo logs, etc).

There is a significant impact on the speed so deciding which solution to use gets down to analysing the costs and complexity of maintaining another tool, the cost of Memcached warmup and the performance drop of using InnoDB Memcached.

Original title and link: Memcached vs InnoDB Memcached in MySQL 5.6 (NoSQL database©myNoSQL)