February 2012
81 posts
2 tags
Possible 100-fold increase in data storage speed →
European researchers may have found a way to speed up data storage 100-fold, breaking one barrier holding back how fast data can be transferred. […] The researchers at York University in the U.K. and Nijmegen University in the Netherlands accomplished the feat by heating a magnetic material with laser bursts that alter what is called the magnetic spin of the material at the atomic level,...
3 tags
Everything is Big Data Now... But Don't let...
Peter Collingridge for Jenn Webb in Book marketing is broken. Big data can fix it on O’Reilly Radar :
But when you’re in a much faster-paced world, with the industry moving toward being consumer- rather than trade-facing, and with a fragmented retail and media landscape, you need to make decisions based on fact: What is the ROI on a £50,000 marketing campaign? Where do my banner ads have...
5 tags
Polyglot persistence at Pinterest: Redis, Membase,...
I’ve created the diagram above based on brief answer on Quora:
We use python + heavily-modified Django at the application layer. Tornado and (very selectively) node.js as web-servers. Memcached and membase / redis for object- and logical-caching, respectively. RabbitMQ as a message queue. Nginx, HAproxy and Varnish for static-delivery and load-balancing. Persistent data storage using...
4 tags
Cassandra and MongoDB with Gigaspaces Cloudify
There are two reasons I’m writing about Gigaspaces’s Cloudify (PR announcement):
Besides MySQL, Cloudify recipes include Cassandra and MongoDB.
Also a bit of vintage claim chowder: if you remember Mike Gaultieri’s (Forrester) NoSQL wants to be elastic caching when it grows up, this should be a clear proof he was wrong.
Gigaspaces is starting to realize that it’s not really necessary to...
6 tags
Hosted and Managed NoSQL: Cassandra, Redis,...
In the last few days I’ve read about some new NoSQL hosting solutions:
Cassandra: managed hardware & software hosting:
Per node:
Intel Dual Quad-core (8 cpu’s), 16gb of memory, 2tb primary storage + 500gb commitlog drive
5 public ip addresses, 1000Mbps private network port.
Debian, CentOS, RedHat or FreeBSD
Cassandra setup, configuration and ongoing maintenance (repairs, cleanups,...
3 tags
High Performance Rails Caching with Redis and... →
here at rapidrabbit we deliver many 1,000 requests per second. doing this while only using a handful of servers and ruby on rails we employ very clever caching using redis and nginx. once the cache is written, it is directly accessed by nginx via a module, which makes it around 500-2,000 times faster than any rails controller.
Bypassing the slowest component in your stack by using caching...
1 tag
ZooKeeper 3.4.3 First Beta Quality in the 3.4...
After 3 alpha versions in the 3.4 series, ZooKeeper 3.4.3, announced the other day, is the first to be considered a beta release—it’s not production ready, but more serious bugs have been fixed. Being such a critical component of the Hadoop ecosystem it is essential for ZooKeeper to go through extensive testing before being declared production ready.
Besides the release notes, both Hortworks and...
1 tag
Trivialization of the Big Data term
Most teams in small businesses are required to manage a never-ending stream of changing schedules, shifting priorities, and adjustments to resource allocation, which result in a massive amount of updates to their project portfolio. How can smaller organizations leverage big data to gain visibility, control and predictability over their work?
Seriously?
Original title and link: Trivialization...
3 tags
Step-by-Step Guide to Amazon DynamoDB for .NET... →
This tutorial is meant for the .NET developers to get started with Amazon DynamoDB. I will show you how to create a Table and perform CRUD operations on it. Amazon DynamoDB provides a low-level API and an Object Persistence API for the .NET developers. In this tutorial, we will see how to use the Object Persistence API to talk to Amazon DynamoDB. We will model a component that represents a DVD...
3 tags
Arya a MongoDB based Search Engine →
The system is currently hard coded with one tokenizer and one analyzer. This can easily be changed. The searcher returns the document and the score it received but not where the term is, or any information on how to ‘highlight’ the result. This is doable by adding in the required information into the match embedded document and processing it out in the Map Reduce phase. There is no query caching...
1 tag
It's a revolution - The Impact of Big Data in the... →
Gary King, director of Harvard’s Institute for Quantitative Social Science for The New York Times:
“It’s a revolution. We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.”
Original title and link: It’s a revolution - The...
12 tags
The components and their functions in the Hadoop...
Edd Dumbill enumerates the various components of the Hadoop ecosystem:
My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)
2 tags
What types of applications might a graph database...
Found this list of use cases for graph databases in a follow up of a Neo4j webinar:
Social networks
Collaboration programs
Configuration Management
Geo-Spatial applications
Impact Analysis
Master Data Management
Network Management
Product Line Management
Recommendation Engines
The more generic answer would be that graph databases can be a great fit for problems handling highly connected...
5 tags
How Web giants store big data →
An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS).
Original title and link: How Web giants store big data (NoSQL database©myNoSQL)
1 tag
The document is the single source of truth →
Paul Hammant:
When it comes to data storage the obvious conclusion is that the backend should save something pretty close to the document that the client presents, mutates, and sends back to the server for posterity. […] Use a document store instead. When would you use a normalized DB design today? The answer to that is: only when you have other processes reading and writing to your database.
...
7 tags
Scaling Video Analytics with Cassandra by Ilya...
To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time.
Scaling Video Analytics with Cassandra with Ilya Maykov
...
7 tags
Cassandra Data Modeling Examples with Matthew F....
Continuing the Cassandra NYC 2011 video series, made available by the folks from DataStax, this week we have Matthew F. Dennis which covers a couple of different Cassandra data modeling use cases.
Cassandra Data Modeling Examples with Matthew F. Dennis
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: Cassandra Data Modeling Examples...
4 tags
Big Data Search: Perfect Search
Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:
offers a unique architectural approach that significantly reduces the total computations required to query
creates terms and pattern indexes (basically combinations of terms at indexing time)
uses jump tables and bloom filters
heavily optimizes disk I/O
doesn’t require indexes in memory
“can often do same query...
1 tag
Hadoop Versions Take 3... Can you follow it?
I’ve just read the Hortonworks’s post about the improvements in Hadoop .Next, jumped up and screamed “Super!”:
Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
NextGen MapReduce (aka YARN) –...
1 tag
Oracle NoSQL Database in Review →
Daniel Abadi in probably the most detailed high level review of the Oracle NoSQL database:
Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead...
2 tags
The Future is Polyglot Persistence
Marting Fowler and Pramod Sadalage in an infographic promoting their upcoming book (PDF):
Polyglot persistence will occur over the enterprise as different applications use different data storage technologies. It will also occur within a single application as different parts of an application’s data store have different access characteristics.
There are over 2 years since I’ve begun...
1 tag
MongoDB in Review →
A high level review of MongoDB by Andrew Glover with a bullet point pros and cons
and a MongoDB scorecard:
I’ve spent some time trying to figure out what’s behind these scores, but I’ve had to give up.
Original title and link: MongoDB in Review (NoSQL database©myNoSQL)
3 tags
Tropo and CouchDB: SMS Voting App in 10 Minutes →
Mark Headd:
By pairing Tropo with CouchDB and a CouchApp running in IrisiCouch, you can have an SMS and phone voting app running entirely in the cloud in about 10 minutes. It should actually take you longer to write up the categories for your voting app than it should to deploy this solution.
Code available on GitHub.
Original title and link: Tropo and CouchDB: SMS Voting App in 10 Minutes...
1 tag
MapReduce Patterns, Algorithms, and Use Cases →
Ilya Katsov’s post enumerates an extensive set of patterns and algorithms each accompanied by use cases and pseudocode:
counting and summing (log analysis, data querying)
collating (inverted indexes, ETL)
filtering, parsing, validation (log analysis, data querying, ETL, data validation)
distributed task execution (physical and engineering simulations, numerical analysis, performance testing)
...
7 tags
The Couchbase Genealogy
Looks like Matthew Aslett (the451group) had his own version of the Couchbase genealogy:
Credit Matt Aslett .
Original title and link: The Couchbase Genealogy (NoSQL database©myNoSQL)
4 tags
What other popular paradigms/architectures can...
Interesting answers on Quora mostly expanding on Krishna Sankar’s short answer:
There are two ways one can address large scale computational problems:
Task Parallelism : This is where MPI and so forth fit in
Data Parallelism : This is the sweet spot for map/reduce
Original title and link: What other popular paradigms/architectures can handle large scale computational problems? (NoSQL...
2 tags
PigEditor: Eclipse plugin for Apache Pig
PigEditor:
syntax/errors highlighting
check alias name existence
auto complete keywords, UDF names
outline…
Original title and link: PigEditor: Eclipse plugin for Apache Pig (NoSQL database©myNoSQL)
4 tags
Connection Management in MongoDB and CongoMongo →
Are connections pooled or not? Konrad Garus digs to find the answer:
Easy. Too easy and comfortable. Coming from the old good and heavy JDBC/SQL I felt uneasy with the connection management. How does it work? Does it just open a connection and leave it dangling in the air the whole time? Might be good for a quick spike in REPL, but not for a real application which needs concurrency, is supposed...
2 tags
Redis Pipelining Explained with Ruby Code →
Albert Callarisa Roca demoes pipelining in Redis using some basic Ruby code. Remember that saving round trips equals reduced latency => happier users.
Original title and link: Redis Pipelining Explained with Ruby Code (NoSQL database©myNoSQL)
5 tags
Hadoop, HBase and R: Will Open Source Software... →
Harish Kotadia:
Predictive Analytics has been billed as the next big thing for almost fifteen years, but hasn’t gained mass acceptance so far the way ERP and CRM solutions have. One of the main reason for this is the high upfront investment required in Software, Hardware and Talent for implementing a Predictive Analytics solution.
Well, this is about to change – […] Using R, HBase and Hadoop,...
1 tag
A Plan for Apache CouchDB: Putting the Apache Back... →
Dave Cottlehuber has a great plan for the Apache CouchDB community to restore confidence and increase mind-share in order to fulfil a great goal:
I’d like to see CouchDB as being the enabler in open data, in breaking open the web for joe & jane user, and enabling interoperability of large data sets especially in research and for government. And an independent replication protocol, with a...
4 tags
The Outer Limits of Data Warehouse Technology →
The story of adopting Hadoop (through Zettaset) at Zions Bancorporation:
The quest for a solution began in 2009 with an investigation of Zion’s existing Microsoft and Oracle technologies, as well as other technologies within the firm and new solutions on the market, Wood relates. After developing a list of six potential vendors, he says, he and his team quickly focused on two Hadoop-based...
3 tags
5 Top Misconceptions about Big Data and Hadoop →
The MapR team analyzes the top 5 misconceptions in the Big Data/Hadoop market:
Big Data is not simply about massive amounts of data — petabytes and beyond. Big Data represents a paradigm shift.
Since Hadoop is a funny name and somewhat new to people they assume it must be risky.
Another misconception about Hadoop, is that it is a batch process.
Perhaps the biggest misconception is that Hadoop...
6 tags
Visualizing Hadoop data with Tableau Software and... →
Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results.
Credit Cloudera.
While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL.
Original title and link: Visualizing Hadoop data with Tableau Software and...
4 tags
Hypertable Revival. Still the wrong strategy
After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:
there are new packages of (commercial) services (PR announcement): Uptime support subscription
Training and certification
Commercial license
it seems like Hypertable has a customer in Rediff.com (India)
it...
10 tags
Fulltext search your CouchDB in Ruby →
When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:
Xapit is still under active...
4 tags
LevelDB: SSTable and Log Structured Storage →
Ilya Grigorik digs into LevelDB’s SSTable and log structured storage1:
If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while...
3 tags
The Design of 99designs - A Clean Tens of Millions...
By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.
Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)
7 tags
99designs: Powered by Amazon RDS, Redis, MongoDB,... →
While the authoritative storage is Amazon RDS, 99designs is using Redis, MongoDB, and Memcached for transient data:
We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft...
3 tags
Data Grid or NoSQL? What are the common points?...
A great post by Olivier Mallassi on a topic that comes up very often: how do data grids and NoSQL databases compare?
Data Grids enable you controlling the way data is stored. They all have default implementation (Gigaspaces offers RDBMS by default, Gemfire offers file and disk based storage by default….) but in all cases, you can choose the one that fits your needs: do you need to store data,...
3 tags
Redis and Python: Building a Markov-chain IRC bot →
Charles Leifer:
As an IRC bot enthusiast and tinkerer, I would like to describe the most enduring and popular bot I’ve written, a markov-chain bot. Markov chains can be used to generate realistic text, and so are great fodder for IRC bots.
Redis acts, in many ways, like a big python dictionary that can store several types of useful data structures. For our purposes, we will use the set data...
6 tags
Calculating a Graph's Degree Distribution Using R... →
Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree:
The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a...
5 tags
MongoDB vs MySQL: A DevOps point of view
Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster.
Some details about fotopedia:
fotopedia is 100% on AWS
Amazon RDS for MySQL
4 nodes MongoDB cluster
150mil. photo views
MongoDB advantages:
no alter table
background index creation
data backup & restoration...
4 tags
Whirr and Hadoop Quickstart Guide: Automating a... →
Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace.
Cloudera Distribution of Hadoop. ↩
Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (NoSQL database©myNoSQL)
Using Twitter Storm to analyze the Twitter Stream →
Francisco Jordano introduces briefly the concepts and provides some good resources for learnign about Twitter Storm just to present his experiment of using Twitter Storm for analyzing the Twitter (nb: the project is on GitHub ):
That’s how the information will flow, and the kind of tasks that we will execute. Yes it’s more effective to group some of those tasks, but remember, we just wanted...
5 tags
Research in the MapReduce Space
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great...
4 tags
Paper: Tenzing A SQL Implementation on the...
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability,...
8 tags
Paper: TiMR is a Time-oriented data processing...
From the “Temporal Analytics on Big Data for Web Advertising” paper:
TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than...
4 tags
Hadoop and NoSQL in a Big Data Environment with...
Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:
People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do...
7 tags
Cassandra at SocialFlow with Drew Robb - Powered...
To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and Bit.ly.
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: Cassandra at SocialFlow with Drew Robb -...