February 2012
81 posts
2 tags
Possible 100-fold increase in data storage speed →
European researchers may have found a way to speed up data storage 100-fold, breaking one barrier holding back how fast data can be transferred. […] The researchers at York University in the U.K. and Nijmegen University in the Netherlands accomplished the feat by heating a magnetic material with laser bursts that alter what is called the magnetic spin of the material at the atomic level,...
Feb 15th
4 notes
3 tags
Everything is Big Data Now... But Don't let...
Peter Collingridge for Jenn Webb in Book marketing is broken. Big data can fix it on O’Reilly Radar : But when you’re in a much faster-paced world, with the industry moving toward being consumer- rather than trade-facing, and with a fragmented retail and media landscape, you need to make decisions based on fact: What is the ROI on a £50,000 marketing campaign? Where do my banner ads have...
Feb 15th
5 tags
Polyglot persistence at Pinterest: Redis, Membase,...
I’ve created the diagram above based on brief answer on Quora: We use python + heavily-modified Django at the application layer.  Tornado and (very selectively) node.js as web-servers.  Memcached and membase / redis for object- and logical-caching, respectively.  RabbitMQ as a message queue.  Nginx, HAproxy and Varnish for static-delivery and load-balancing.  Persistent data storage using...
Feb 15th
5 notes
4 tags
Cassandra and MongoDB with Gigaspaces Cloudify
There are two reasons I’m writing about Gigaspaces’s Cloudify (PR announcement): Besides MySQL, Cloudify recipes include Cassandra and MongoDB. Also a bit of vintage claim chowder: if you remember Mike Gaultieri’s (Forrester) NoSQL wants to be elastic caching when it grows up, this should be a clear proof he was wrong. Gigaspaces is starting to realize that it’s not really necessary to...
Feb 15th
3 notes
6 tags
Hosted and Managed NoSQL: Cassandra, Redis,...
In the last few days I’ve read about some new NoSQL hosting solutions: Cassandra: managed hardware & software hosting: Per node: Intel Dual Quad-core (8 cpu’s), 16gb of memory, 2tb primary storage + 500gb commitlog drive 5 public ip addresses, 1000Mbps private network port. Debian, CentOS, RedHat or FreeBSD Cassandra setup, configuration and ongoing maintenance (repairs, cleanups,...
Feb 15th
3 notes
3 tags
High Performance Rails Caching with Redis and... →
here at rapidrabbit we deliver many 1,000 requests per second. doing this while only using a handful of servers and ruby on rails we employ very clever caching using redis and nginx. once the cache is written, it is directly accessed by nginx via a module, which makes it around 500-2,000 times faster than any rails controller. Bypassing the slowest component in your stack by using caching...
Feb 15th
2 notes
1 tag
ZooKeeper 3.4.3 First Beta Quality in the 3.4...
After 3 alpha versions in the 3.4 series, ZooKeeper 3.4.3, announced the other day, is the first to be considered a beta release—it’s not production ready, but more serious bugs have been fixed. Being such a critical component of the Hadoop ecosystem it is essential for ZooKeeper to go through extensive testing before being declared production ready. Besides the release notes, both Hortworks and...
Feb 15th
1 tag
Trivialization of the Big Data term
Most teams in small businesses are required to manage a never-ending stream of changing schedules, shifting priorities, and adjustments to resource allocation, which result in a massive amount of updates to their project portfolio. How can smaller organizations leverage big data to gain visibility, control and predictability over their work? Seriously? Original title and link: Trivialization...
Feb 14th
3 notes
3 tags
Step-by-Step Guide to Amazon DynamoDB for .NET... →
This tutorial is meant for the .NET developers to get started with Amazon DynamoDB. I will show you how to create a Table and perform CRUD operations on it. Amazon DynamoDB provides a low-level API and an Object Persistence API for the .NET developers. In this tutorial, we will see how to use the Object Persistence API to talk to Amazon DynamoDB. We will model a component that represents a DVD...
Feb 13th
3 notes
3 tags
Arya a MongoDB based Search Engine →
The system is currently hard coded with one tokenizer and one analyzer. This can easily be changed. The searcher returns the document and the score it received but not where the term is, or any information on how to ‘highlight’ the result. This is doable by adding in the required information into the match embedded document and processing it out in the Map Reduce phase. There is no query caching...
Feb 13th
2 notes
1 tag
It's a revolution - The Impact of Big Data in the... →
Gary King, director of Harvard’s Institute for Quantitative Social Science for The New York Times: “It’s a revolution. We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.” Original title and link: It’s a revolution - The...
Feb 13th
1 note
12 tags
The components and their functions in the Hadoop...
Edd Dumbill enumerates the various components of the Hadoop ecosystem: My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later. Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)
Feb 13th
1 note
2 tags
What types of applications might a graph database...
Found this list of use cases for graph databases in a follow up of a Neo4j webinar: Social networks Collaboration programs Configuration Management Geo-Spatial applications Impact Analysis Master Data Management Network Management Product Line Management Recommendation Engines The more generic answer would be that graph databases can be a great fit for problems handling highly connected...
Feb 13th
3 notes
5 tags
How Web giants store big data →
An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS). Original title and link: How Web giants store big data (NoSQL database©myNoSQL)
Feb 13th
1 note
1 tag
The document is the single source of truth →
Paul Hammant: When it comes to data storage the obvious conclusion is that the backend should save something pretty close to the document that the client presents, mutates, and sends back to the server for posterity. […] Use a document store instead. When would you use a normalized DB design today? The answer to that is: only when you have other processes reading and writing to your database. ...
Feb 12th
7 tags
Scaling Video Analytics with Cassandra by Ilya...
To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time. Scaling Video Analytics with Cassandra with Ilya Maykov ...
Feb 12th
5 notes
7 tags
Cassandra Data Modeling Examples with Matthew F....
Continuing the Cassandra NYC 2011 video series, made available by the folks from DataStax, this week we have Matthew F. Dennis which covers a couple of different Cassandra data modeling use cases. Cassandra Data Modeling Examples with Matthew F. Dennis For watching more videos from this event follow the Cassandra NYC 2011 tag. Original title and link: Cassandra Data Modeling Examples...
Feb 11th
4 notes
4 tags
Big Data Search: Perfect Search
Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that: offers a unique architectural approach that significantly reduces the total computations required to query creates terms and pattern indexes (basically combinations of terms at indexing time) uses jump tables and bloom filters heavily optimizes disk I/O doesn’t require indexes in memory “can often do same query...
Feb 11th
4 notes
1 tag
Hadoop Versions Take 3... Can you follow it?
I’ve just read the Hortonworks’s post about the improvements in Hadoop .Next, jumped up and screamed “Super!”: Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode. NextGen MapReduce (aka YARN) –...
Feb 10th
1 tag
Oracle NoSQL Database in Review →
Daniel Abadi in probably the most detailed high level review of the Oracle NoSQL database: Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead...
Feb 10th
1 note
2 tags
The Future is Polyglot Persistence
Marting Fowler and Pramod Sadalage in an infographic promoting their upcoming book (PDF): Polyglot persistence will occur over the enterprise as different applications use different data storage technologies. It will also occur within a single application as different parts of an application’s data store have different access characteristics. There are over 2 years since I’ve begun...
Feb 9th
4 notes
1 tag
MongoDB in Review →
A high level review of MongoDB by Andrew Glover with a bullet point pros and cons and a MongoDB scorecard: I’ve spent some time trying to figure out what’s behind these scores, but I’ve had to give up. Original title and link: MongoDB in Review (NoSQL database©myNoSQL)
Feb 9th
2 notes
3 tags
Tropo and CouchDB: SMS Voting App in 10 Minutes →
Mark Headd: By pairing Tropo with CouchDB and a CouchApp running in IrisiCouch, you can have an SMS and phone voting app running entirely in the cloud in about 10 minutes. It should actually take you longer to write up the categories for your voting app than it should to deploy this solution. Code available on GitHub. Original title and link: Tropo and CouchDB: SMS Voting App in 10 Minutes...
Feb 9th
2 notes
1 tag
MapReduce Patterns, Algorithms, and Use Cases →
Ilya Katsov’s post enumerates an extensive set of patterns and algorithms each accompanied by use cases and pseudocode: counting and summing (log analysis, data querying) collating (inverted indexes, ETL) filtering, parsing, validation (log analysis, data querying, ETL, data validation) distributed task execution (physical and engineering simulations, numerical analysis, performance testing) ...
Feb 9th
5 notes
7 tags
The Couchbase Genealogy
Looks like Matthew Aslett (the451group) had his own version of the Couchbase genealogy: Credit Matt Aslett . Original title and link: The Couchbase Genealogy (NoSQL database©myNoSQL)
Feb 8th
1 note
4 tags
What other popular paradigms/architectures can...
Interesting answers on Quora mostly expanding on Krishna Sankar’s short answer: There are two ways one can address large scale computational problems: Task Parallelism : This is where MPI and so forth fit in Data Parallelism : This is the sweet spot for map/reduce Original title and link: What other popular paradigms/architectures can handle large scale computational problems? (NoSQL...
Feb 8th
1 note
2 tags
PigEditor: Eclipse plugin for Apache Pig
PigEditor: syntax/errors highlighting check alias name existence auto complete keywords, UDF names outline… Original title and link: PigEditor: Eclipse plugin for Apache Pig (NoSQL database©myNoSQL)
Feb 8th
2 notes
4 tags
Connection Management in MongoDB and CongoMongo →
Are connections pooled or not? Konrad Garus digs to find the answer: Easy. Too easy and comfortable. Coming from the old good and heavy JDBC/SQL I felt uneasy with the connection management. How does it work? Does it just open a connection and leave it dangling in the air the whole time? Might be good for a quick spike in REPL, but not for a real application which needs concurrency, is supposed...
Feb 8th
1 note
2 tags
Redis Pipelining Explained with Ruby Code →
Albert Callarisa Roca demoes pipelining in Redis using some basic Ruby code. Remember that saving round trips equals reduced latency => happier users. Original title and link: Redis Pipelining Explained with Ruby Code (NoSQL database©myNoSQL)
Feb 8th
3 notes
5 tags
Hadoop, HBase and R: Will Open Source Software... →
Harish Kotadia: Predictive Analytics has been billed as the next big thing for almost fifteen years, but hasn’t gained mass acceptance so far the way ERP and CRM solutions have. One of the main reason for this is the high upfront investment required in Software, Hardware and Talent for implementing a Predictive Analytics solution. Well, this is about to change – […] Using R, HBase and Hadoop,...
Feb 8th
1 note
1 tag
A Plan for Apache CouchDB: Putting the Apache Back... →
Dave Cottlehuber has a great plan for the Apache CouchDB community to restore confidence and increase mind-share in order to fulfil a great goal: I’d like to see CouchDB as being the enabler in open data, in breaking open the web for joe & jane user, and enabling interoperability of large data sets especially in research and for government. And an independent replication protocol, with a...
Feb 8th
4 tags
The Outer Limits of Data Warehouse Technology →
The story of adopting Hadoop (through Zettaset) at Zions Bancorporation: The quest for a solution began in 2009 with an investigation of Zion’s existing Microsoft and Oracle technologies, as well as other technologies within the firm and new solutions on the market, Wood relates. After developing a list of six potential vendors, he says, he and his team quickly focused on two Hadoop-based...
Feb 8th
3 tags
5 Top Misconceptions about Big Data and Hadoop →
The MapR team analyzes the top 5 misconceptions in the Big Data/Hadoop market: Big Data is not simply about massive amounts of data — petabytes and beyond. Big Data represents a paradigm shift. Since Hadoop is a funny name and somewhat new to people they assume it must be risky. Another misconception about Hadoop, is that it is a batch process. Perhaps the biggest misconception is that Hadoop...
Feb 8th
5 notes
6 tags
Visualizing Hadoop data with Tableau Software and... →
Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results. Credit Cloudera. While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL. Original title and link: Visualizing Hadoop data with Tableau Software and...
Feb 8th
1 note
4 tags
Hypertable Revival. Still the wrong strategy
After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space: there are new packages of (commercial) services (PR announcement): Uptime support subscription Training and certification Commercial license it seems like Hypertable has a customer in Rediff.com (India) it...
Feb 8th
2 notes
10 tags
Fulltext search your CouchDB in Ruby →
When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations: Xapit is still under active...
Feb 8th
4 tags
LevelDB: SSTable and Log Structured Storage →
Ilya Grigorik digs into LevelDB’s SSTable and log structured storage1: If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while...
Feb 6th
3 notes
3 tags
The Design of 99designs - A Clean Tens of Millions...
By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier. Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)
Feb 6th
1 note
7 tags
99designs: Powered by Amazon RDS, Redis, MongoDB,... →
While the authoritative storage is Amazon RDS, 99designs is using Redis, MongoDB, and Memcached for transient data: We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft...
Feb 6th
2 notes
3 tags
Data Grid or NoSQL? What are the common points?...
A great post by Olivier Mallassi on a topic that comes up very often: how do data grids and NoSQL databases compare? Data Grids enable you controlling the way data is stored. They all have default implementation (Gigaspaces offers RDBMS by default, Gemfire offers file and disk based storage by default….) but in all cases, you can choose the one that fits your needs: do you need to store data,...
Feb 6th
1 note
3 tags
Redis and Python: Building a Markov-chain IRC bot →
Charles Leifer: As an IRC bot enthusiast and tinkerer, I would like to describe the most enduring and popular bot I’ve written, a markov-chain bot. Markov chains can be used to generate realistic text, and so are great fodder for IRC bots. Redis acts, in many ways, like a big python dictionary that can store several types of useful data structures. For our purposes, we will use the set data...
Feb 6th
1 note
6 tags
Calculating a Graph's Degree Distribution Using R... →
Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree: The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a...
Feb 6th
3 notes
5 tags
MongoDB vs MySQL: A DevOps point of view
Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster. Some details about fotopedia: fotopedia is 100% on AWS Amazon RDS for MySQL 4 nodes MongoDB cluster 150mil. photo views MongoDB advantages: no alter table background index creation data backup & restoration...
Feb 6th
1 note
4 tags
Whirr and Hadoop Quickstart Guide: Automating a... →
Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace. Cloudera Distribution of Hadoop. ↩ Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (NoSQL database©myNoSQL)
Feb 6th
1 note
Using Twitter Storm to analyze the Twitter Stream →
Francisco Jordano introduces briefly the concepts and provides some good resources for learnign about Twitter Storm just to present his experiment of using Twitter Storm for analyzing the Twitter (nb: the project is on GitHub ): That’s how the information will flow, and the kind of tasks that we will execute. Yes it’s more effective to group some of those tasks, but remember, we just wanted...
Feb 6th
9 notes
5 tags
Research in the MapReduce Space
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great...
Feb 5th
4 notes
4 tags
Paper: Tenzing A SQL Implementation on the...
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google: Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability,...
Feb 5th
3 notes
8 tags
Paper: TiMR is a Time-oriented data processing...
From the “Temporal Analytics on Big Data for Web Advertising” paper: TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than...
Feb 5th
3 notes
4 tags
Hadoop and NoSQL in a Big Data Environment with...
Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction: People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do...
Feb 5th
3 notes
7 tags
Cassandra at SocialFlow with Drew Robb - Powered...
To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and Bit.ly. For watching more videos from this event follow the Cassandra NYC 2011 tag. Original title and link: Cassandra at SocialFlow with Drew Robb -...
Feb 5th
1 note