Since the release of the Kyoto 1.0 which can be considered the successor of Tokyo Cabinet, I haven’t heard much from the Tokyo Cabinet/Tyrant world (except some political news or some furniture related announcements on craigslist, but these are not really of interest for the NoSQL community)
Some time ago I had a chance to discuss with Florent Solt (@florentsolt), Chief Architect at ☞ Netvibes, about their usage of Tokyo family (Tokyo Cabinet and Tokyo Tyrant). While I don’t have enough details about the Tokyo market, I’d be ready to speculate that Netvibes is probably one of the biggest users of the Tokyo products family.
To give you an quick overview of the Netvibes system here are some interesting points in random order:
- Netvibes uses Tokyo Tyrant, never Tokyo Cabinet directly
- Netvibes architecture is a master-slave architecture (due to weird things in master-master)
- Netvibes is using its own sharding method
- Netvibes maes use of Tokyo Cabinet hash, btree and tables storages
- only feeds related informations are in Tokyo databases (feeds, items, read/unread, …)
- other informations are still in a MySQL database (accounts, tabs, pages, widgets, …)
- to schedule crawling events, a queue has been implemented with a Tokyo Tyrant server and lua
- Netvibes is using a custom transparent proxy (ruby + eventmachine) to move/migrate data between servers
And now the Q&A part:
nosql: It sounds like initially all data lived inside MySQL. What made you look to alternative storage solutions?
Florent: Exactly. We started looking at an alternative when we reached MySQL limits. It was mostly disk space fragmentation issues (with blobs) and raw speed for insert.
nosql: How did you choose Tokyo Cabinet and Tokyo Tyrant?
Florent: We did some research, but 1.5 years ago, there were less solutions than now.
So we did some benchmarks, based on our own data (very important) and our architecture. We tried : Hadoop, CouchDB, Tokyo Tyrant, File system only (it was only to have a raw comparison with IMHO one of the most simple way to store data) and MySQL.
In terms of budget, responsiveness and knowledge gap, Tokyo was the winner.
nosql: What data has been moved to Tokyo?
Florent: We are using Tokyo for our feeds backend. Everything related to feeds such as feed items, enclosures, read/unread flags are stored in Tokyo. Same goes for the data structures we need to crawl all these feeds, such as a queue.
nosql: What criteria have you used to make this separation?
Florent: The separation was not clearly related to Tokyo, it was product decision. We wanted to implement this feed backend as a standalone module. We only interact with it trough an API.
nosql: How have you migrated existing data?
Florent: Indeed, initially feeds data were in MySQL tables.
The migration was simple, in terms of logic, but long and difficult to achieve. The main point was when an unknown data was requested from the new backend, a fallback query asked MySQL for the data, and finally saved everything in Tokyo. It sounds easy, but in reality there were many specific cases and strange issues.
nosql: You are using Tokyo hash, btree and tables. Would you mind giving some examples for what kind of data lives in each of them and how have you decided that is the best option?
Florent: When you really understand each structures it’s pretty easy to pick the best choice. For example:
- When we need only raw speed, we use a hash.
- When we need complex key strategies (based on prefix), we use btree.
- When we need conditional queries, we use tables.
For example, feeds (url, title, author, …) are stored in a Table. Same goes for the feed items and enclosures.
The queue is a Hash, to keep the focus on the speed. The first implementation was based on a BTree, but we improved our algorithms to have guessable keys only and prevent key scanning. There are also some lua functions linked to hide implementation and keep the whole thing fast too.
Flags (where we store read/unread data) are stored in a BTree with a lua extension because we are scanning keys a lot.
nosql: Can you speak a bit more about the in-house sharding solution you are using?
Florent: Sure. Tokyo does not come with sharding or dynamic partitioning implementation, so we built our own solution. It’s feed or user centric. For example, we know that the feed table will always fit on one dedicated server, whatever the number of feeds. So, for each feed we store where (the id of the shard server) its items are.
For the flags, same logic, for a given user we know where his flags are. It makes it easy to add new shards, because it’s a line in a configuration file. And we have created all the scripts we need to move data from one shard to another (migration, auto-balance, …)
nosql: What lessons have you learned that you’d have liked to know before using Tokyo?
Florent: Very difficult to say as we have learned so much with this project.
Maybe the most important point would be to know how Tokyo Tyrant servers would manage the load and what are the best practices to prevent common speed issue, that was what we learned the hard-way.
nosql: Any numbers about Netvibes Tokyo deployment you can share with us?
Florent: About numbers, you already know that it’s a sensitive information :-). I can’t say more than those numbers in my slides.
nosql: Fair enough. Thank you so much Florent!