Not only is Mozilla celebrating the release of Firefox 4, but they took the time to set up a nice visualization for downloads.
glow.mozilla.org is powered by tailing logs and streaming data into HBase:
- The various load balancing clusters that host download.mozilla.org are configured to log download requests to a remote syslog server.
- The remote server is running rsyslog and has a config that specifically filters those remote syslog events into a dedicated file that rolls over hourly
- SQLStream is installed on that server and it is tailing those log files as they appear.
- The SQLStream pipeline does the following for each request:
- filtering out anything other than valid download requests
- uses MaxMind GeoIP to get a geographic location from the IP address
- uses a streaming group by to aggregate the number of downloads by
product, location, and timestamp
- every 10 seconds, sends a stream of counter increments to HBase for
the timestamp row with the column qualifiers being each distinct
location that had downloads in that time interval
- The glow backend is a python app that pulls the data out of HBase using the Python Thrift interface and writes a file containing a JSON representation of the data every minute.
- That JSON file can be cached on the front-end forever since each minute of data has a distinct filename
- The glow website pulls down that data and plays back the downloads or allows you to browse the geographic totals in the arc chart view
This sounds a lot like what Facebook is doing for the new Real-Time Analytics system.. The parts missing are Scribe and ptail.
Original title and link: Firefox Downloads Visualization Powered by HBase (NoSQL databases © myNoSQL)
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.
Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …
Continue to read ➤
Mozilla has previously published about their detailed plan and extensive investigation into Cassandra, HBase, and Riak that led to choosing Riak. This time they are publishing some extensive Riak benchmark results (against both Riak 0.10 and Riak 0.11 running Bitcask) — they are using Riak benchmarking code, included in the list of correct NoSQL benchmarks and performance evaluations solutions. Both the results, their analysis , and interpretation are fascinating.
Our goal in running these studies was, simply put, no surprises. That meant we needed to run studies to that profiled:
- Stability, especially for long running tests
- Performance when we introduced variable object sizes
- Performance when we introduced pre-commit hooks to evaluate incoming data
I guess Mozilla Test Pilot is one of the Riak’s most interesting case studies.
Original title and link for this post: Extensive Riak Benchmarking at Mozilla Test Pilot (published on the NoSQL blog: myNoSQL)
Mozilla shows us the right way of choosing a storage solution (as opposed to this completely incorrect way):
- list as many requirements and details as you have
- identify critical features
- install, experiment and compare with your checklist
- analyze and document missing features, nice-to-haves, etc.
Not only that but the post goes on and explains how Cassandra, HBase, and Riak each answers to the following requirements:
- Scalability — Deliver a solution that can handle the expected starting load and that can easily scale out as that load goes up.
- Elasticity — Because the peak traffic periods are relatively short and the non-peak hours are almost idle, it is important to consider ways to ensure the allocated hardware is not sitting idle, and that you aren’t starved for resources during the peak traffic periods.
- Reliability — Stability and high availability is important. It isn’t as critical as it might be in certain other projects, but if we were down for several hours during the peak traffic period, the client layer needs to be able to retain the data and resubmit at a later date.
- Storage — Need enough room to store active experiments and also recent experiments that are being analyzed. It is expected that data will become stale over time and can be archived off of the active cluster.
- Analysis — What do we have to put together to provide a friendly system to the analysts?
- Cost — Actual cost of the additional hardware needed to deploy the initial solution and to scale through at least the end of the year.
- Manpower — How much time and effort will it take us to deliver the first critical stage of the project and the subsequent stages? Also consider ongoing maintenance and ownership of the code.
- Security — Because we will be accepting data from an outside, untrusted source, we need to consider what steps are necessary to ensure the health of the system and the privacy of users.
- Extensibility — delivering a platform that can readily evolve to meet the future needs of the project and hopefully other projects as well.
- Disaster Recovery / Migration — If the original system fails to meet the requirements after going live, what options do we have to recover from that situation? If we decide to switch to another technology, how do we move the data?
While they are not the only ones doing such extensive investigative work — see also Cassandra at Twitter and HBase at Adobe — there are many things to be learned from their experience. Thanks Mozilla for sharing it with us!
Also available a comparison of Cassandra, HBase and PNUTS and Cassandra and HBase compared.