twitter: All content tagged as twitter in NoSQL databases and polyglot persistence
Wednesday, 11 January 2012
Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress
Drew Conway:
Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API. […] But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.
Original title and link: Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress (©myNoSQL)
Tuesday, 10 January 2012
MySQL at Twitter: Storing 250mil Tweets Daily
Todd Hoff took the time to disect and extract in a post the interesting bits from Jeremy Cole’s talk[1]Big and Small Data at @Twitter from the O’Reilly MySQL conference:
- MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
- MySQL doesn’t work for ID generation and graph storage.
- MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
- Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.
In my summary of the talk I’ve noted:
- Use MySQL when it works, something else when not - fortunately MySQL often does work
- MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
- MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
- MySQL replication improvements like crash safe multi-threaded slave is what they need
But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.
The following other data solutions get mentioned in Jeremy’s talk:
- Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
- Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
- Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs.
Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.
I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:
Tuesday, 15 November 2011
Twitter's Real-Time URL Fetcher Using Cassandra and Memcached
Twitter’s real-time URL fetcher, code named SpiderDuck, is an excellent example of how NoSQL databases fit in the architecture of today’s systems:
Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata.
SpiderDuck is also using memcached:
Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.

Original title and link: Twitter’s Real-Time URL Fetcher Using Cassandra and Memcached (©myNoSQL)
via: http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
Thursday, 4 August 2011
Twitter Open Sourcing Storm at Strange Loop
Ask and you’ll be answered. Nathan Marz announces that Twitter will open source Storm, the Hadoop-like real-time data processing tool developed at BackType:
I’m pleased to announce that I will be releasing Storm at Strange Loop on September 19th!
Here’s a recap of the three broad use cases for Storm:
- Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
- Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
- Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.
Original title and link: Twitter Open Sourcing Storm at Strange Loop (©myNoSQL)
via: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Tuesday, 5 July 2011
ElephantDB and Storm Join the Twitter Flock
That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).
The announcement is here . Looking forward to seeing Storm open sourced.
Original title and link: ElephantDB and Storm Join the Twitter Flock (©myNoSQL)
Sunday, 17 April 2011
Big and Small Data at Twitter: MySQL CE 2011
Twitter DBA Lead at Twitter, Jeremy Cole‘s talk about MySQL at Twitter from MySQL CE 2011:
Roland Bouman had some interesting notes (nb: actually tweets) from the talk:
-
115 mln tweets a day, 1 bln tweets a week, about 50.000 new accounts / day
-
random server uptime 212d, 127 bln questions (6943/s) rows read: 1.36 mln/s
-
Use MySQL when it works, something else when not - fortunately MySQL often does work
-
MySQL is used by twitter because it’s robust, replication works and it’s easy to use and run
-
MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
-
MySQL replication improvements like crash safe multi-threaded slave exactly what they need
-
Twitter open sourced snowflake (id generation system) and Gizzard distributed data storage
-
Use soft launches: new code is launched in a disabled state, turn up slowly, back down if needed
-
Gizzard builds in MySQL/InnoDB handles sharding, replication, job scheduling
-
Twitter uses Cassandra too for some projects. high velocity writes, schemaless design
-
Twitter uses Hadoop for analyzing extremely large datasets: 10 to 100 blns rows (http logs)
-
Twitter also uses vertica for analysis, 100M - 10Blns of rows. Runs 100x faster than MySQL
-
MySQL’s happy place: <= 1.5 TB datasets, archive store for larger sets.
Original title and link: Big and Small Data at Twitter: MySQL CE 2011 (NoSQL databases © myNoSQL)
Monday, 14 March 2011
Hadoop and NoSQL Databases at Twitter
Three presentations covering the various NoSQL usages at Twitter:
-
Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ
-
Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ
-
Dmitriy Ryaboy on Hadoop from Devoxx 2010:
By looking at the powered by NoSQL page and my records, Twitter seems to be the largest adopter of NoSQL solutions. Here is an updated version of who is using Cassandra and HBase
- Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
- Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
- Netflix: Amazon SimpleDB, Cassandra
- Digg: Cassandra
- SimpleGeo: Cassandra
- StumbleUpon: HBase, OpenTSDB
- Yahoo!: Hadoop, HBase, PNUTS
- Rackspace: Cassandra
And probably many more missing from the list. But that could change if you leave a comment.
Original title and link: Hadoop and NoSQL Databases at Twitter (NoSQL databases © myNoSQL)
Wednesday, 23 February 2011
Rewriting the Redis Twitter Clone
The Redis Twitter clone app is showing its age:
I’m looking at the Twitter Clone and noticed a N + 1 -like “get” in the code […] The above code seems rather suboptimal, if my understanding is correct.
At least three better approaches have been suggested, so who is up for experimenting with Redis and rewriting this app to use latest Redis features?
- use pipelining to get all the posts in one server roundtrip (won’t change the code much and be much faster)
- use
SORT…GETsemantics to get all the post data at once from the list of ids (should be somewhat faster than 1) - Use
MGETto get all the post data at once.
Original title and link: Rewriting the Redis Twitter Clone (NoSQL databases © myNoSQL)
via: http://groups.google.com/group/redis-db/browse_thread/thread/a67aae56aca2bc84
Saturday, 5 February 2011
Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution
Kevin Weil[1] presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:
Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.
-
Kevin Weil: Product Lead for Revenue, Twitter, @kevinweil ↩
Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)
Thursday, 16 December 2010
CouchDB Usecase: Decentralizing Twitter
J.Chris Anderson in an interview over ReadWriteWeb:
Klint Finley: Let’s start at the top: what exactly is Twebz? It’s described as a “decentralized Twitter client.” What exactly does that mean?
J Chris Anderson: The aim is to allow you to interact with Twitter when Twitter is up and you are online. But if Twitter is down for maintenance or you are in the middle of nowhere, you can still tweet. And when you can reach Twitter again, it will go through.
If lots of folks are using it, then they can see each other’s tweets come in even when Twitter is down.
Mostly the goal was to show the way on how to integrate CouchDB with web services and APIs.
A classical example of CouchDB powerful P2P replication capabilities. Dave Winer would probably be its ☞ biggest fan.
Original title and link: CouchDB Usecase: Decentralizing Twitter (NoSQL databases © myNoSQL)
via: http://www.readwriteweb.com/hack/2010/12/j-chris-anderson-interview.php
Saturday, 13 November 2010
Videos from Hadoop World
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.
Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …
Sunday, 31 October 2010
Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead
Kevin Weil[1]
in an interview about Twitter’s usage of Hadoop:
Hadoop is our data warehouse; every piece of data we store is archived in HDFS. We use HBase for data that sees updates frequently, or data we occasionally need low-latency access to. Every node in our cluster runs HBase. We use Java MapReduce for simple jobs, or jobs which have tight performance requirements. We use Pig for most of our analysis jobs, because its flexibility helps us iterate rapidly to arrive at the right way of looking at the data.
Our Hadoop use is also evolving: initially it was primarily used as an analysis tool to help us better understand the Twitter ecosystem, and that’s not going to change. But it’s increasingly used to build parts of products you use on the site every day such as People Search, the data for which is built with Hadoop. There are many more products like this in development.
Undeniably, Twitter is (deep) into NoSQL.
- Kevin Weil: Twitter Analytics Lead, @kevinweil (↩)
Original title and link: Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
