ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Is There Anything You Are For?

Great start for the month from geek & poke:

geek-poke-nosql

Original title and link: Is There Anything You Are For? (NoSQL database©myNoSQL)


Making MySQL Accept Connections Faster

Mark Callaghan (Facebook) posted two graphs showing the improvements Facebook got in optimizing the speed of accepting connections in MySQL1.

First thing I thought of was persistent connections are always fast and you should always use a connection pool. Even if Facebook is a using PHP there should be a way to use a connection pool with MySQL. But maybe this is a problem that occurs only at their scale and in specific scenarios.

It was only at the time I was preparing to ask Mark for more details that I’ve noticed the link to Domas Mituzas’s post in which he profiles the MySQL connection accepting code, but he also presents a scenario that reveals this issue:

Sometimes connection avalanches come unexpected, and even if MySQL would have no trouble dealing with queries, it will have problems letting clients in.

From link to link, I then arrived at the MySQL documentation page describing how MySQL uses threads for client connections. If you are a MySQL user and haven’t seen this page I’d suggest you reading it, but here are the interesting parts:

By default, connection manager threads associate each client connection with a thread dedicated to it that handles authentication and request processing for that connection. Manager threads create a new thread when necessary but try to avoid doing so by consulting the thread cache first to see whether it contains a thread that can be used for the connection. When a connection ends, its thread is returned to the thread cache if the cache is not full.

The thread cache has a size determined by the thread_cache_size system variable. The default value is 0 (no caching), which causes a thread to be set up for each new connection and disposed of when the connection terminates2 . Set thread_cache_size to N to enable N inactive connection threads to be cached.

I guess it’s time to connect to your MySQL server, check these settings, and update them accordingly.


  1. In case you are wondering, Facebook will release the code. 

  2. My emphasis. 

Original title and link: Making MySQL Accept Connections Faster (NoSQL database©myNoSQL)


An Overview of Neo4j.rb 2.0

Andreas Ronge writing about using Neo4j in embedded mode with JRuby:

The advantage of the embedded Neo4j is better performance due to the direct use of the Java API. This means you can write queries in plain Ruby! Another advantage of the embedded Neo4j is that since it’s an embedded database there is one less piece of infrastructure (the database server) to install. The embedded database is running in the same process as your (Rails) application. Since JRuby has real threads there is no need to start up several instances of the database or of the Ruby runtime since JRuby can utilize all available cores on the CPU. There is actually even no need to start the database at all as it will be started automatically when needed. Notice it’s still possible to use the REST protocol or the web admin interface from an embedded Neo4j, see the neo4j-admin gem.

So which should I choose ? Well, if you can’t use JRuby or you don’t need an Active Model compliant Neo4j binding then the Neo4j Server is a good choice, otherwise I would suggest using the embedded Neo4j.rb gem (but I’m a bit biased)

As showed also by the earlier [migrating data from Oracle to MongoDB with JRuby], JRuby proves to be an interesting beast for handling data. I’m more on the side of Python, but Jython is not (yet?) as up-to-date as JRuby.

Original title and link: An Overview of Neo4j.rb 2.0 (NoSQL database©myNoSQL)

via: http://blog.jayway.com/2012/05/07/neo4j-rb-2-0-an-overview/


Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby

A homegrown ETL process for migrating data from Oracle to MongoDB based on JRuby chameleonic capabilities: a Ruby implementation integrating well in a Java environment:

Rather than having to re-map one database or the other in the other persistence technology to facilitate the ETL process (not DRY), JRuby allowed the two persistence technologies to interoperate. By utilizing JRuby’s powerful embedding capabilities, we were able to read data out of Oracle via Hibernate and write data to MongoDB via MongoMapper.

Original title and link: Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby (NoSQL database©myNoSQL)

via: http://blog.jruby.org/2012/05/bridging-the-gap-with-jruby/


13 R Online Resources for Big Data and Parallel Computing

A list of articles, papers, and tutorials for R put together by Yanchang Zhao.

Original title and link: 13 R Online Resources for Big Data and Parallel Computing (NoSQL database©myNoSQL)

via: http://rdatamining.wordpress.com/2012/05/06/online-resources-for-handling-big-data-and-parallel-computing-in-r/


PuppetDB: Configuration Management Database for Puppet

PuppetDB is replacing CouchDB for managing Puppet configurations and is a service layer written in Clojure with a PostgreSQL back-end. Not a graph database:

PuppetDB is a key component of the Puppet Data Library, and brings that to bear in its query API. Resources, facts, nodes, and metrics can all be queried over HTTP. For resources and nodes, there is a simple query language which can be used to form arbitrarily complex requests. The public API is the same one that Puppet uses to make storeconfigs queries (using the «||» operator) of PuppetDB, but provides a superset of the functionality provided by storeconfigs.

PuppetDB is faster, smarter, and has more complete data than ever before. […] PuppetDB offers great power over and insight into your infrastructure, and it’s only going to get bigger and better.

Original title and link: PuppetDB: Configuration Management Database for Puppet (NoSQL database©myNoSQL)

via: http://puppetlabs.com/blog/introducing-puppetdb-put-your-data-to-work/


Short Intro to Graph Databases, Manipulating and Traversing With Gremlin

A slide deck by Pierre De Wilde with a short theoretical intro to property graphs and graph databases and an extensive set of examples of manipulating and traversing graph data with Gremlin. Good reference material.


Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed

I’m not the only one confused by Michael Stonebraker’s Hadoop is dead theme. Edward Capriolo:

Let me tell you a story of how I got into hadoop and hive. I was following advice like Stonebreaker’s that said Parallel DBs are the way to go. But I quickly found out Parallel Database are too rich for my blood.  Now, I am not telling you or anyone else that you should not spend money on Parallel DBs, because maybe you have the money, or maybe you need some of those things the parallel database provides. But for things I need to do:

  • store tons of data
  • processed it reasonably fast
  • be LOW on the cost scale

Hadoop and hive work fine for me.

Original title and link: Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed (NoSQL database©myNoSQL)

via: http://www.edwardcapriolo.com/roller/edwardcapriolo/date/20120504


Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Even if I’ve been using Berkley DB for over 6 years, I very rarely heard stories about it. This presentation from Yammer tells the story of taking Berkley DB a long way:

In early 2011 Yammer set out to replace an 11 billion row PostgreSQL message delivery database with something a bit more scale-ready. They reached for several databases with which they were familiar, but none proved to be a fit for various reasons. Following in the footsteps of so few before them, they took the wheel of the SS Berkeley DB Java Edition and piloted it into the uncharted waters of horizontal scalability.

In this talk, Ryan will cover Yammer’s journey through log cleaner infested waters, being hijacked on the high seas by the BDB B-tree cache, and their eventual flotilla of a 45 node, 256 partition BDB cluster.


The Myth of Auto Scaling as a Capacity Planning Approach

A quite old, but very educative post by James Golick dissecting the mythical extra server capacity:

There’s this idea floating around that we can scale out our data services “just in time”. Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It’s a great sounding story, but sadly, things don’t quite work that way.

This is the Mythical Man-Month of the IT department.

John Allspaw

Original title and link: The Myth of Auto Scaling as a Capacity Planning Approach (NoSQL database©myNoSQL)

via: http://jamesgolick.com/2010/10/27/we-are-experiencing-too-much-load-lets-add-a-new-server..html


Quick Guide to Riak HTTP API and Using Riak as Cache Service

A two-part article by Simon Buckle introducing the Riak HTTP API and using it with Riak pluggable Memory back-end as a caching service for a web application. Somehow I missed that Riak has a pluggable memory (non-persistent) storage. The only missing piece for making it a better caching solution would be having the option to set a per-key expiry/time-to-live (TTL) value. It might be interesting to experiment with using Cache-Control and Last-Modified HTTP headers to simulate this behavior. Has anyone tried it?

Original title and link: Quick Guide to Riak HTTP API and Using Riak as Cache Service (NoSQL database©myNoSQL)


MySQL Is Done. NoSQL Is Done. It's the Postgres Age

Jeff Dickey enumerates some of the new features available in PostgreSQL—schema-less data, array columns, queuing, full-text searching, geo-spatial indexing—concluding that PosgreSQL has now everything an application needs:

Postgres has taken the features out of all of these tools and integrate it right inside the platform. Now you don’t need to spin up a mongo cluster for non-rel data, rabbitmq cluster for queueing, solr box for searching. You can just have a single postgres server. That saves a huge ops headache since each of those clusters/boxes have to be durable, replicated, and scalable.

Sounds a bit too optimistic? As we’ve learned from the NoSQL space there are no silver bullets:

Now obviously, there’s a glaring downside with this approach: you get one box. Maybe a read slave or something, but really, you can’t scale it.

As you can imagine I disagree with most of the points, the only exception being that it is great to see so many useful features packaged with PostgreSQL—these are definitely going to make like easier for some of the developers.

But when talking about MySQL and NoSQL being done:

  1. MySQL is done, except it has a huge community, there are tons of developers very familiar with it, and last but not least MySQL powers massive deployments. This last part matters a lot.
  2. NoSQL is done, except many NoSQL solutions tackle different problem spaces providing optimal solutions for these by staying focused. Neither Oracle, nor MongoDB, nor PosgreSQL will be able to solve all problems. The wider range of problems they are covering, the less optimal solutions they are providing for corner case or extreme scenarios.

Original title and link: MySQL Is Done. NoSQL Is Done. It’s the Postgres Age (NoSQL database©myNoSQL)

via: http://dickey.xxx/mysql-is-done-it-s-the-postgres-age