twitter: All content tagged as twitter in NoSQL databases and polyglot persistence
Tuesday, 17 January 2012
Flake: An Erlang Decentralized, K-Ordered Unique ID Generator by Boundary
After posting about the proposal of adding a 128-bit K-Order unique id generator to Redis, I’ve realized I haven’t linked to Boundary’s post about Flake, their decentralized k-ordered unique id generator project:
All that is required to construct such an id is a monotonically increasing clock and a location 3. K-ordering dictates that the most-significant bits of the id be the timestamp. UUID-1 contains this information, but arranges the pieces in such a way that k-ordering is lost. Still other schemes offer k-ordering with either a questionable representation of ‘location’ or one that requires coordination among nodes.
Original title and link: Flake: An Erlang Decentralized, K-Ordered Unique ID Generator by Boundary (©myNoSQL)
Monday, 16 January 2012
Redis: Adding a 128-Bit K-Ordered Unique Id Generator
Interesting pull request submitted to Redis by Pierre-Yves Ritschard:
This is similar to what snowflake and the recent boundary solution do, but it makes sense to use redis for that type of use cases for people wanting a simple way to get incremental ids in distributed systems without an additional daemon requirement.
Snoflake is the network service for generating unique ID numbers at high scale used (and open sourced) by Twitter. Flake is an Erlang decentralized, k-ordered id generation service open sourced by Boundary.
Original title and link: Redis: Adding a 128-Bit K-Ordered Unique Id Generator (©myNoSQL)
Wednesday, 11 January 2012
Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress
Drew Conway:
Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API. […] But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.
Original title and link: Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress (©myNoSQL)
Tuesday, 10 January 2012
MySQL at Twitter: Storing 250mil Tweets Daily
Todd Hoff took the time to disect and extract in a post the interesting bits from Jeremy Cole’s talk[1]Big and Small Data at @Twitter from the O’Reilly MySQL conference:
- MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
- MySQL doesn’t work for ID generation and graph storage.
- MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
- Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.
In my summary of the talk I’ve noted:
- Use MySQL when it works, something else when not - fortunately MySQL often does work
- MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
- MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
- MySQL replication improvements like crash safe multi-threaded slave is what they need
But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.
The following other data solutions get mentioned in Jeremy’s talk:
- Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
- Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
- Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs.
Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.
I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:
Tuesday, 15 November 2011
Twitter's Real-Time URL Fetcher Using Cassandra and Memcached
Twitter’s real-time URL fetcher, code named SpiderDuck, is an excellent example of how NoSQL databases fit in the architecture of today’s systems:
Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata.
SpiderDuck is also using memcached:
Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.

Original title and link: Twitter’s Real-Time URL Fetcher Using Cassandra and Memcached (©myNoSQL)
via: http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
Friday, 5 August 2011
Twitter Open Sourcing Storm at Strange Loop
Ask and you’ll be answered. Nathan Marz announces that Twitter will open source Storm, the Hadoop-like real-time data processing tool developed at BackType:
I’m pleased to announce that I will be releasing Storm at Strange Loop on September 19th!
Here’s a recap of the three broad use cases for Storm:
- Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
- Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
- Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.
Original title and link: Twitter Open Sourcing Storm at Strange Loop (©myNoSQL)
via: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Tuesday, 5 July 2011
ElephantDB and Storm Join the Twitter Flock
That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).
The announcement is here . Looking forward to seeing Storm open sourced.
Original title and link: ElephantDB and Storm Join the Twitter Flock (©myNoSQL)
Sunday, 17 April 2011
Big and Small Data at Twitter: MySQL CE 2011
Twitter DBA Lead at Twitter, Jeremy Cole‘s talk about MySQL at Twitter from MySQL CE 2011:
Roland Bouman had some interesting notes (nb: actually tweets) from the talk:
-
115 mln tweets a day, 1 bln tweets a week, about 50.000 new accounts / day
-
random server uptime 212d, 127 bln questions (6943/s) rows read: 1.36 mln/s
-
Use MySQL when it works, something else when not - fortunately MySQL often does work
-
MySQL is used by twitter because it’s robust, replication works and it’s easy to use and run
-
MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
-
MySQL replication improvements like crash safe multi-threaded slave exactly what they need
-
Twitter open sourced snowflake (id generation system) and Gizzard distributed data storage
-
Use soft launches: new code is launched in a disabled state, turn up slowly, back down if needed
-
Gizzard builds in MySQL/InnoDB handles sharding, replication, job scheduling
-
Twitter uses Cassandra too for some projects. high velocity writes, schemaless design
-
Twitter uses Hadoop for analyzing extremely large datasets: 10 to 100 blns rows (http logs)
-
Twitter also uses vertica for analysis, 100M - 10Blns of rows. Runs 100x faster than MySQL
-
MySQL’s happy place: <= 1.5 TB datasets, archive store for larger sets.
Original title and link: Big and Small Data at Twitter: MySQL CE 2011 (NoSQL databases © myNoSQL)
Monday, 14 March 2011
Hadoop and NoSQL Databases at Twitter
Three presentations covering the various NoSQL usages at Twitter:
-
Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ
-
Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ
-
Dmitriy Ryaboy on Hadoop from Devoxx 2010:
By looking at the powered by NoSQL page and my records, Twitter seems to be the largest adopter of NoSQL solutions. Here is an updated version of who is using Cassandra and HBase
- Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
- Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
- Netflix: Amazon SimpleDB, Cassandra
- Digg: Cassandra
- SimpleGeo: Cassandra
- StumbleUpon: HBase, OpenTSDB
- Yahoo!: Hadoop, HBase, PNUTS
- Rackspace: Cassandra
And probably many more missing from the list. But that could change if you leave a comment.
Original title and link: Hadoop and NoSQL Databases at Twitter (NoSQL databases © myNoSQL)
Wednesday, 23 February 2011
Rewriting the Redis Twitter Clone
The Redis Twitter clone app is showing its age:
I’m looking at the Twitter Clone and noticed a N + 1 -like “get” in the code […] The above code seems rather suboptimal, if my understanding is correct.
At least three better approaches have been suggested, so who is up for experimenting with Redis and rewriting this app to use latest Redis features?
- use pipelining to get all the posts in one server roundtrip (won’t change the code much and be much faster)
- use
SORT…GETsemantics to get all the post data at once from the list of ids (should be somewhat faster than 1) - Use
MGETto get all the post data at once.
Original title and link: Rewriting the Redis Twitter Clone (NoSQL databases © myNoSQL)
via: http://groups.google.com/group/redis-db/browse_thread/thread/a67aae56aca2bc84
Sunday, 6 February 2011
Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution
Kevin Weil[1] presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:
Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.
-
Kevin Weil: Product Lead for Revenue, Twitter, @kevinweil ↩
Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)
Thursday, 16 December 2010
CouchDB Usecase: Decentralizing Twitter
J.Chris Anderson in an interview over ReadWriteWeb:
Klint Finley: Let’s start at the top: what exactly is Twebz? It’s described as a “decentralized Twitter client.” What exactly does that mean?
J Chris Anderson: The aim is to allow you to interact with Twitter when Twitter is up and you are online. But if Twitter is down for maintenance or you are in the middle of nowhere, you can still tweet. And when you can reach Twitter again, it will go through.
If lots of folks are using it, then they can see each other’s tweets come in even when Twitter is down.
Mostly the goal was to show the way on how to integrate CouchDB with web services and APIs.
A classical example of CouchDB powerful P2P replication capabilities. Dave Winer would probably be its ☞ biggest fan.
Original title and link: CouchDB Usecase: Decentralizing Twitter (NoSQL databases © myNoSQL)
via: http://www.readwriteweb.com/hack/2010/12/j-chris-anderson-interview.php
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
