Friday, 1 June 2012
Is There Anything You Are For?
Great start for the month from geek & poke:
Original title and link: Is There Anything You Are For? (©myNoSQL)
Thursday, 31 May 2012
Making MySQL Accept Connections Faster
Mark Callaghan (Facebook) posted two graphs showing the improvements Facebook got in optimizing the speed of accepting connections in MySQL1.
First thing I thought of was persistent connections are always fast and you should always use a connection pool. Even if Facebook is a using PHP there should be a way to use a connection pool with MySQL. But maybe this is a problem that occurs only at their scale and in specific scenarios.
It was only at the time I was preparing to ask Mark for more details that I’ve noticed the link to Domas Mituzas’s post in which he profiles the MySQL connection accepting code, but he also presents a scenario that reveals this issue:
Sometimes connection avalanches come unexpected, and even if MySQL would have no trouble dealing with queries, it will have problems letting clients in.
From link to link, I then arrived at the MySQL documentation page describing how MySQL uses threads for client connections. If you are a MySQL user and haven’t seen this page I’d suggest you reading it, but here are the interesting parts:
By default, connection manager threads associate each client connection with a thread dedicated to it that handles authentication and request processing for that connection. Manager threads create a new thread when necessary but try to avoid doing so by consulting the thread cache first to see whether it contains a thread that can be used for the connection. When a connection ends, its thread is returned to the thread cache if the cache is not full.
The thread cache has a size determined by the
thread_cache_sizesystem variable. The default value is 0 (no caching), which causes a thread to be set up for each new connection and disposed of when the connection terminates2 . Setthread_cache_sizeto N to enable N inactive connection threads to be cached.
I guess it’s time to connect to your MySQL server, check these settings, and update them accordingly.
Original title and link: Making MySQL Accept Connections Faster (©myNoSQL)
Wednesday, 30 May 2012
An Overview of Neo4j.rb 2.0
Andreas Ronge writing about using Neo4j in embedded mode with JRuby:
The advantage of the embedded Neo4j is better performance due to the direct use of the Java API. This means you can write queries in plain Ruby! Another advantage of the embedded Neo4j is that since it’s an embedded database there is one less piece of infrastructure (the database server) to install. The embedded database is running in the same process as your (Rails) application. Since JRuby has real threads there is no need to start up several instances of the database or of the Ruby runtime since JRuby can utilize all available cores on the CPU. There is actually even no need to start the database at all as it will be started automatically when needed. Notice it’s still possible to use the REST protocol or the web admin interface from an embedded Neo4j, see the neo4j-admin gem.
So which should I choose ? Well, if you can’t use JRuby or you don’t need an Active Model compliant Neo4j binding then the Neo4j Server is a good choice, otherwise I would suggest using the embedded Neo4j.rb gem (but I’m a bit biased)
As showed also by the earlier [migrating data from Oracle to MongoDB with JRuby], JRuby proves to be an interesting beast for handling data. I’m more on the side of Python, but Jython is not (yet?) as up-to-date as JRuby.
Original title and link: An Overview of Neo4j.rb 2.0 (©myNoSQL)
via: http://blog.jayway.com/2012/05/07/neo4j-rb-2-0-an-overview/
Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby
A homegrown ETL process for migrating data from Oracle to MongoDB based on JRuby chameleonic capabilities: a Ruby implementation integrating well in a Java environment:
Rather than having to re-map one database or the other in the other persistence technology to facilitate the ETL process (not DRY), JRuby allowed the two persistence technologies to interoperate. By utilizing JRuby’s powerful embedding capabilities, we were able to read data out of Oracle via Hibernate and write data to MongoDB via MongoMapper.
Original title and link: Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby (©myNoSQL)
via: http://blog.jruby.org/2012/05/bridging-the-gap-with-jruby/
Tuesday, 29 May 2012
13 R Online Resources for Big Data and Parallel Computing
A list of articles, papers, and tutorials for R put together by Yanchang Zhao.
Original title and link: 13 R Online Resources for Big Data and Parallel Computing (©myNoSQL)
PuppetDB: Configuration Management Database for Puppet
PuppetDB is replacing CouchDB for managing Puppet configurations and is a service layer written in Clojure with a PostgreSQL back-end. Not a graph database:
PuppetDB is a key component of the Puppet Data Library, and brings that to bear in its query API. Resources, facts, nodes, and metrics can all be queried over HTTP. For resources and nodes, there is a simple query language which can be used to form arbitrarily complex requests. The public API is the same one that Puppet uses to make storeconfigs queries (using the «||» operator) of PuppetDB, but provides a superset of the functionality provided by storeconfigs.
PuppetDB is faster, smarter, and has more complete data than ever before. […] PuppetDB offers great power over and insight into your infrastructure, and it’s only going to get bigger and better.
Original title and link: PuppetDB: Configuration Management Database for Puppet (©myNoSQL)
via: http://puppetlabs.com/blog/introducing-puppetdb-put-your-data-to-work/
Monday, 28 May 2012
Short Intro to Graph Databases, Manipulating and Traversing With Gremlin
A slide deck by Pierre De Wilde with a short theoretical intro to property graphs and graph databases and an extensive set of examples of manipulating and traversing graph data with Gremlin. Good reference material.
Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed
I’m not the only one confused by Michael Stonebraker’s Hadoop is dead theme. Edward Capriolo:
Let me tell you a story of how I got into hadoop and hive. I was following advice like Stonebreaker’s that said Parallel DBs are the way to go. But I quickly found out Parallel Database are too rich for my blood. Now, I am not telling you or anyone else that you should not spend money on Parallel DBs, because maybe you have the money, or maybe you need some of those things the parallel database provides. But for things I need to do:
- store tons of data
- processed it reasonably fast
- be LOW on the cost scale
Hadoop and hive work fine for me.
Original title and link: Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed (©myNoSQL)
via: http://www.edwardcapriolo.com/roller/edwardcapriolo/date/20120504
Saturday, 26 May 2012
Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone
Even if I’ve been using Berkley DB for over 6 years, I very rarely heard stories about it. This presentation from Yammer tells the story of taking Berkley DB a long way:
In early 2011 Yammer set out to replace an 11 billion row PostgreSQL message delivery database with something a bit more scale-ready. They reached for several databases with which they were familiar, but none proved to be a fit for various reasons. Following in the footsteps of so few before them, they took the wheel of the SS Berkeley DB Java Edition and piloted it into the uncharted waters of horizontal scalability.
In this talk, Ryan will cover Yammer’s journey through log cleaner infested waters, being hijacked on the high seas by the BDB B-tree cache, and their eventual flotilla of a 45 node, 256 partition BDB cluster.
Friday, 25 May 2012
The Myth of Auto Scaling as a Capacity Planning Approach
A quite old, but very educative post by James Golick dissecting the mythical extra server capacity:
There’s this idea floating around that we can scale out our data services “just in time”. Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It’s a great sounding story, but sadly, things don’t quite work that way.
This is the Mythical Man-Month of the IT department.
Original title and link: The Myth of Auto Scaling as a Capacity Planning Approach (©myNoSQL)
via: http://jamesgolick.com/2010/10/27/we-are-experiencing-too-much-load-lets-add-a-new-server..html
Quick Guide to Riak HTTP API and Using Riak as Cache Service
A two-part article by Simon Buckle introducing the Riak HTTP API and using it with Riak pluggable Memory back-end as a caching service for a web application. Somehow I missed that Riak has a pluggable memory (non-persistent) storage. The only missing piece for making it a better caching solution would be having the option to set a per-key expiry/time-to-live (TTL) value. It might be interesting to experiment with using Cache-Control and Last-Modified HTTP headers to simulate this behavior. Has anyone tried it?
Original title and link: Quick Guide to Riak HTTP API and Using Riak as Cache Service (©myNoSQL)
MySQL Is Done. NoSQL Is Done. It's the Postgres Age
Jeff Dickey enumerates some of the new features available in PostgreSQL—schema-less data, array columns, queuing, full-text searching, geo-spatial indexing—concluding that PosgreSQL has now everything an application needs:
Postgres has taken the features out of all of these tools and integrate it right inside the platform. Now you don’t need to spin up a mongo cluster for non-rel data, rabbitmq cluster for queueing, solr box for searching. You can just have a single postgres server. That saves a huge ops headache since each of those clusters/boxes have to be durable, replicated, and scalable.
Sounds a bit too optimistic? As we’ve learned from the NoSQL space there are no silver bullets:
Now obviously, there’s a glaring downside with this approach: you get one box. Maybe a read slave or something, but really, you can’t scale it.
As you can imagine I disagree with most of the points, the only exception being that it is great to see so many useful features packaged with PostgreSQL—these are definitely going to make like easier for some of the developers.
But when talking about MySQL and NoSQL being done:
- MySQL is done, except it has a huge community, there are tons of developers very familiar with it, and last but not least MySQL powers massive deployments. This last part matters a lot.
- NoSQL is done, except many NoSQL solutions tackle different problem spaces providing optimal solutions for these by staying focused. Neither Oracle, nor MongoDB, nor PosgreSQL will be able to solve all problems. The wider range of problems they are covering, the less optimal solutions they are providing for corner case or extreme scenarios.
Original title and link: MySQL Is Done. NoSQL Is Done. It’s the Postgres Age (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
