ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Twitter: All content tagged as Twitter in NoSQL databases and polyglot persistence

Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead

Kevin Weil[1] in an interview about Twitter’s usage of Hadoop:

Hadoop is our data warehouse; every piece of data we store is archived in HDFS. We use HBase for data that sees updates frequently, or data we occasionally need low-latency access to. Every node in our cluster runs HBase. We use Java MapReduce for simple jobs, or jobs which have tight performance requirements. We use Pig for most of our analysis jobs, because its flexibility helps us iterate rapidly to arrive at the right way of looking at the data.

Our Hadoop use is also evolving: initially it was primarily used as an analysis tool to help us better understand the Twitter ecosystem, and that’s not going to change. But it’s increasingly used to build parts of products you use on the site every day such as People Search, the data for which is built with Hadoop. There are many more products like this in development.

Undeniably, Twitter is (deep) into NoSQL.


  1. Kevin Weil: Twitter Analytics Lead, @kevinweil  ()

Original title and link: Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead (NoSQL databases © myNoSQL)

via: http://www.cloudera.com/blog/2010/09/twitter-analytics-lead-kevin-weil-and-a-presenter-at-hadoop-world-interviewed/


Couchappspora: Sort of status.net compared to Twitter

Via Mark Essel:

The couchapp allows anyone to register and begin posting so try it out. There’s no federation yet, but additional couch databases can be continuously replicated. The distributed nature of the application spawns by setting up two way replication or synchronization between multiple nodes. This is one of the benefits of couchDB that makes it a great system for media sharing, and why I’ve decided to build on it.

☞ couchappspora has more or less nothing to do with Diaspora, but rather with ☞ status.net (ex Identica) compared with ☞ Twitter.com, aka federated vs centralized.

Original title and link: Couchappspora: Sort of status.net compared to Twitter (NoSQL databases © myNoSQL)

via: http://www.victusspiritus.com/2010/10/05/distributed-social-app-with-couchdb-brilliant/


MongoDB and Auto Increment

Chris Shiflett shares a solution to emulate the MySQL auto_increment with MongoDB. While you should read his post, the proposed solution is:

db.seq.findAndModify({    
    query: {"_id":"users"},   
     update: {$inc: {"seq":1}},   
     new: true
});

☞ shiflett.org

Kenny Gorman warns that the above solution is not exactly optimal:

When using this technique each insert would require both the insert as well as the findAndModify command which is a query plus an update. So now you have to perform 3 operations where it used to be one. Not only that, but there are 3 more logical I/O’s due to the query, and those might be physical I/O’s. This pattern is easily seen with the mongostat utility.

☞ www.kennygorman.com

I remember Twitter has tried to solve a similar problem — probably at a different, larger scale — and they came up with Snowflake:

We needed something that could generate tens of thousands of ids per second in a highly available manner. This naturally led us to choose an uncoordinated approach. These ids need to be roughly sortable, meaning that if tweets A and B are posted around the same time, they should have ids in close proximity to one another since this is how we and most Twitter clients sort tweets.

☞ engineering.twitter.com

If you have other ideas (even if not directly related to MongoDB) I’d love to hear them!

Original title and link for this post: MongoDB and Auto Increment (published on the NoSQL blog: myNoSQL)


MongoDB: Twitter Stream as Input Data

A bit old, but cool way to get test data into MongoDB:

curl http://stream.twitter.com/1/statuses/sample.json -u<user>:<pass> | mongoimport -c twitter_live

Original title and link for this post: MongoDB: Twitter Stream as Input Data (published on the NoSQL blog: myNoSQL)

via: http://eliothorowitz.com/post/459890033/streaming-twitter-into-mongodb


CouchDB: A Perfect Fit for Twitter Apps

In the past I’ve covered extensively NoSQL-based Twitter apps — since that post I’ve added some more: here, here and here — but interestingly enough not many of these where using CouchDB, the NoSQL database for the web that recently released its 1.0 version.

Things can definitely change after reading the article written by Mark Headd explaining what makes CouchDB a perfect fit for Twitter applications:

  • You interact with a CouchDB instance the same way that you interact with the Twitter API — by making HTTP calls. This can help keep the code for your application clean and simple, and provides lots of opportunities for code reuse within your application.
  • The structure of documents in CouchDB are JSON, which is one of the formats returned from the Twitter API when searching for Tweets (or “status objects” in Twitter parlance).
  • Documents in a CouchDB database are assigned a globally unique ID — it’s how documents are distinguished from one another. Twitter also uses unique identifiers for status objects, so using the ID of a Twitter status object as the ID for a document in CouchDB makes life pretty easy for a Twitter app developer.

While not directly related to this, there’s also another connection between CouchDB and Twitter: Gizzard framework can be used for scaling CouchDB.

via: http://blog.couch.io/post/836208779/guest-blogpost-by-mark-headd-creator-of-tweetmy311


A Longer Version of the News on Cassandra Usage at Twitter

A (much) longer version of our Cassandra status inside Facebook, Twitter, Digg and Friday’s updates on Cassandra usage at Twitter on HighScalability.com:

Twitter may not want to make the same mistake. Does this mean Cassandra and NoSQL suck? No, I think it’s just smart project planning. It’s actually OK to have multiple platforms for different purposes.

That’s just a different way of saying to use the right tool for the job, the motto that I’m trying to associate with the NoSQL ecosystem.

via: http://highscalability.com/blog/2010/7/11/so-why-is-twitter-really-not-using-cassandra-to-store-tweets.html


Updates on Cassandra Usage at Twitter

Just two days after my Cassandra status update, the Twitter engineering blog is publishing an article sharing more details about Cassandra usage at Twitter.

So, how is Twitter using Cassandra today?

  • Cassandra as database of places of interest used by the geo team[1]
  • Cassandra as storage for the data mining research team
  • Cassandra as an upcoming storage solution for real time analytics

In case you wonder what changed, Twitter will not migrate the tweets storage to Cassandra and continue to save and serve these from the existing MySQL cluster:

We believe that this isn’t the time to make large scale migration to a new technology. We will focus our Cassandra work on new projects that we wouldn’t be able to ship without a large-scale data store.


  1. Probably this is similar to how SimpleGeo is using Cassandra  ()

via: http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html


Cassandra Status Inside Facebook, Twitter, Digg, and More

As everyone probably knows by now, Cassandra was originated at Facebook as a solution for inbox search and then open sourced under the ASF umbrella and an Apache license. Since then, Twitter, Digg, Reddit and quite a few others started using it, but not much have been heard from Facebook.

So, in case you are wondering ☞ what’s up with Cassandra here’s a very concise update:

  1. Twitter and Digg are not planning to fork the project. In fact there are clear plans to contribute back their work on Cassandra (see this for more details)
  2. Facebook is still using Cassandra internally for the inbox search, but they are using their own version
  3. even if except the initial code share Facebook has stopped contributing to the Cassandra project, the community on ASF is doing well (read growing)
  4. Riptano, the company founded by Cassandra project lead Jonathan Ellis and Matt Pfeil, is offering technical support, professional services, and training for Cassandra

Update: interesting ☞ note (dated July 7th) from Twitter’s engineer, Nick Kallen:

Twitter no longer intends to use Cassandra for any critical data-stores in the near term future.


MemcacheDB History at Reddit

Steve Huffman (co-founder and programmer of Reddit) speaking at ☞ FOWA Miami 2010 (around min.18:30)[1]:

And then there is another software that is really handy MemcacheDB, which is like memcached but is persistent. […] It’s very very fast, super-handy, we store far more data in MemcacheDB than we do in Postgres

Then bam! MemcacheDB bursting blocking writes leading Reddit to switch to Cassandra as friends from Digg or Twitter did.

Lesson learned: take such pieces of advise with a grain of salt and always test your scenario.


  1. It looks like Steve was not working at Reddit anymore at the time the presentation was made and so he might not have been aware of the problems related to MemcacheDB.  ()

Twitter, Redis and Gizzard

Just one day after mentioning Giazzard as a possible solution for partitioning data in CouchDB, I’m hearing that Twitter is experimenting with Redis and uses ☞ Gizzard for data partitioning. Code for this experiment is available on ☞ GitHub under the name Haplocheirus[1]

References

  1. Just in case you are wondering what Haplocheirus means, it is a ☞ dinosaur.  ()

Scaling CouchDB

One of the most often heard questions about CouchDB is how do I scale CouchDB? While dealing with Google-size data is not really a problem many are facing, more and more engineers are looking for viable solutions for having less SPOFs in their systems.

We already know that starting with CouchDB 0.11.0 its support for replication became extremely smart allowing us to think about a future of CouchDB-backed distributed web data.

But replication is just one part of scaling. One other part which is more related to the size of data is data partitioning or sharding and I think that the question about how to scale CouchDB is mostly referring to it. So, let’s see what options are there for sharding CouchDB.

Right now, probably the best known solution for addressing CouchDB sharding is ☞ Lounge, a solution developed by the Meebo.com team:

The Lounge is a proxy-based partitioning/clustering framework for CouchDB.

There are two main pieces:

  • dumbproxy: An NGINX module to handle simple gets/puts for everything that isn’t a view.
  • smartproxy: A python/twisted daemon that handles mapping/reducing view requests to all shards in the cluster.

Another solution in this space is ☞ Pillow. While ☞ still young it shows a lot of promise:

While manually repartitioning a CouchDB database is doable, I’d rather have an automatic way of doing it since I don’t want to make mistakes. […]

Now I’ve released version 0.3 of Pillow. This version supports automatic resharding, routing of requests to the right shard and views. Reducers need to be written in erlang, but a summing reducer is in place and mappers without reducers are supported out of the box. As such, this version of Pillow has all the functionality I set out to develop, but it does not support the full CouchDB API.

A more generic solution for partitioning data can be built using the Twitter’s ☞ Gizzard framework for creating distributed datastores:

Gizzard architecture

As per the above diagram, Gizzard would become the “smart” middleware that would handle data distribution between the different stores. There are quite a few interesting features available in Gizzard that can make it a good candidate for dealing with the data partitioning problem: programmable support for partitioning strategies, support for replication trees, fault-tolerance, winged migrations, just to name a few.

Last, but not least, during the Berlin Buzzwords conference, I have also heard about another solution that is in the works and will be released soon.

So, having quite a few possible solutions already available, scaling CouchDB should not be considered an issue anymore.

Important update: CouchDB has a new scaling solution through BigCouch

Original title and link for this post: Scaling CouchDB (myNoSQL © NoSQL databases)


Learning NoSQL from Twitter’s Experience

Leaving aside the tons of NoSQL Twitter applications — and if that is not enough here are more NoSQL-based Twitter apps and even more, Twitter seems to be having a lot of fun (nb read work and innovation) in the NoSQL space.

It all started with the problem of handling big data in real-time. Nick Kallen’s (@nk) slides below are explaining the problems faced and the way Twitter tackled them:

Then it was the time to consider Cassandra at Twitter:

We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating. We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available

and scale Twitter with Cassandra (Ryan King (@rk) presentation):

But storing data is not enough and Twitter had to put the NoSQL data to work . For that Twitter is using Hadoop, Pig and HBase as “Cassandra is OLTP and HBase is OLAP“. Kevin Weil (@kevilweil) slides, presented at nosql:eu and Dmitriy Ryaboy (@squarecog) are giving a lot of details about the HBase, Hadoop and Pig usage:

That’s a ton to learn from NoSQL at Twitter!