NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence

4 Database Technologies for Large Scale Data

Park Kieun (CUBRID Cluster Architect) gives an introduction to 4 large scale database technologies:

  • Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.

Examples: EBay DW, Yahoo! Everest Architecture, Greenplum, AsterData

  • Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.

Examples: Vertica, Sybase IQ, MonetDB

  • Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.

Examples: Truviso

  • Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.

Examples: many of the NoSQL databases covered here.

Even if I came up with the same 5 categories for scalable storage solutions, Park’s list is better documented. However we both left out distributed filesystems (sorry Jeff).

Original title and link: 4 Database Technologies for Large Scale Data (NoSQL databases © myNoSQL)


Data Analysis Tools Survey Results

I’ve always wondered what tools are used by data scientists to dig useful information out of the big data and create beauty out of it.

Szilard Pafka[1] has run an survey about the tools used by data scientist and he presents an overview of the results in the video embedded below.

Data Analysis Tools Survey Results

As I’ve learned over time, the R language is the preferred data analysis tool and the survey confirms it. But what surprised me was to see Excel coming in the second place. Python and Unix shell tools are coming after SAS to complete the top five tools.

Szilard Pafka

  1. Szilard Pafka: founder and organizer of the Los Angeles R user group  

Original title and link: Data Analysis Tools Survey Results (NoSQL databases © myNoSQL)

Linked Open Data Star Scheme

While writing quite a bit lately about Big Data marketplaces, I thought it would be worth mentioning Tim Berners-Lee 5-start deployment scheme for Linked Open Data:

  1. make your stuff available on the Web (whatever format) under an open license
  2. make it available as structured data (e.g., Excel instead of image scan of a table)
  3. use non-proprietary formats (e.g., CSV instead of Excel)
  4. use URIs to identify things, so that people can point at your stuff
  5. link your data to other data to provide context

Linked Open Data Star Scheme


See Tim Berner-Lee talking about the star scheme at gov 2.0 expo:

Who Needs Big Data Marketplaces?

Who could start benefiting right away from big data marketplaces:

Scientists are wasting much of the data they are creating. Worldwide computing capacity grew at 58 percent every year from 1986 to 2007, and people sent almost two quadrillion megabytes of data to one another, according to a study published on Thursday in Science. But scientists are losing a lot of the data, say researchers in a wide range of disciplines.

It’s also kind of scary to know that we cannot find enough funding to address such fundamental needs that can change the face of humanity.

Original title and link: Who Needs Big Data Marketplaces? (NoSQL databases © myNoSQL)


Current and Future Big Data Warehouse

  • Custom build BigData frameworks like Teradata and VLDB implementations from Oracle that are proprietary frameworks designed to deal with large datasets. These frameworks are still very relational in orientation and are not designed to work with unstructured data sets.
  • Data Warehouse Appliances like Oracle’s Exadata. This introduces the concept of DW-in-a-box where the entire framework needed for a typical DW implementation (the Hardware, Software Framework in terms of data store and Advanced Analytical tools) are all vertically integrated and provided by the same vendor as a packaged solution.
  • Open Source NoSQL-oriented Big Data Frameworks such as Hadoop and Cassandra. These frameworks implement advanced analytical and mining algorithms such as Map/Reduce and are designed to be installed on commodity hardware for an MPP architecture with huge Master/Slave clusters. They are very good at dealing with vast amounts of unstructured, text-oriented information.
  • Commercial Big Data Frameworks like AsterData and GreenPlum, which follow the same paradigm of MPP infrastructures but have implemented their own add-ons such as SQL-MR and other optimizations for faster analytics.

A good list to augment/detail these 5 approaches to scalable storage solutions.

Original title and link: Current and Future Big Data Warehouse (NoSQL databases © myNoSQL)


What Does Big Data Mean to Infrastructure Professionals?

  1. Big data means the amount of data you’re working with today will look trivial within five years.
  2. Huge amounts of data will be kept longer and have way more value than today’s archived data.
  3. Business people will covet a new breed of alpha geeks. You will need new skills around data science, new types of programming, more math and statistics skills and data hackers…lots of data hackers.
  4. You are going to have to develop new techniques to access, secure, move, analyze, process, visualize and enhance data; in near real time.
  5. You will be minimizing data movement wherever possible by moving function to the data instead of data to function. You will be leveraging or inventing specialized capabilities to do certain types of processing- e.g. early recognition of images or content types – so you can do some processing close to the head.
  6. The cloud will become the compute and storage platform for big data which will be populated by mobile devices and social networks.
  7. Metadata management will become increasingly important.
  8. You will have opportunities to separate data from applications and create new data products.
  9. You will need orders of magnitude cheaper infrastructure that emphasizes bandwidth, not iops and data movement and efficient metadata management.
  10. You will realize sooner or later that data and your ability to exploit it is going to change your business, social and personal life; permanently.

Make sure you also check the 10 big data realities in the same post.

Original title and link: What Does Big Data Mean to Infrastructure Professionals? (NoSQL databases © myNoSQL)


Two Definitions for Big Data

Not sure I’ve got the rest of the post, but really liked these two definitions of big data:

Big data means nothing. It’s a well meaning term for (literally) big piles of data, sitting in various massive balls of infrastructure, randomly scattered around our enterprise. More common terms include data warehouses or decision support systems, etc.


Big data is created by copying transactional data and sticking it on another system. We copy ALL our transactional data and stick it on these systems. Over time, those systems become supersets of our transactional systems. We make lots of copies and put them in lots of big data systems.

Original title and link: Two Definitions for Big Data (NoSQL databases © myNoSQL)

Big Data Flow in Large Hadron Collider Grid

Summarizing the attempt to trace pieces of data in the Large Hadron Collider Grid:

While it’s true that files are given dates and times, figuring out where a packet of data came from is a lot more difficult than just reading the file name. Data are copied, split, copied again and deleted. Bits and pieces travel all over the Grid, often without any input from a human user.

Sounds in a way similar to what I’ve described as an ideal big data processing engine.

Original title and link: Big Data Flow in Large Hadron Collider Grid (NoSQL databases © myNoSQL)


Big Data Marketplaces and Data Privacy

To clarify: our goal was to map the nodes in the training dataset to the real identities in the social network that was used to create the data. […]

We were able to deanonymize about 80% of the nodes, including the vast majority of the high-degree nodes (both in- and out-degree.) We’re not sure what the overall error rate is, but for the high-degree nodes it is essentially zero.

We can go back to my questions: who will decide, regulate, and guarantee the level of privacy for data sets traded on the big data market?


Original title and link: Big Data Marketplaces and Data Privacy (NoSQL databases © myNoSQL)


HBase with Trillions Rows

Interesting question and answer on HBase mailing list:

[…] is it feasible to use HBase table in “read-mostly” mode with trillions of rows, each contains small structured record (~200 bytes, ~15 fields). Does anybody know a successful case when tables with such number of rows are used with HBase?

My follow up questions:

  • where is that data currently stored?
  • how will you migrate it?
  • if this is just what you estimate you’ll get, how soon will you reach these numbers?

Original title and link: HBase with Trillions Rows (NoSQL databases © myNoSQL)

Big Data Analysis at BackType

RWW has a nice post diving into the data flow and the tools used by BackType, a company with only 3 engineers, to deal and analyze large amounts of data.

They’ve invented their own language, Cascalog, to make analysis easy, and their own database, ElephantDB, to simplify delivering the results of their analysis to users. They’ve even written a system to update traditional batch processing of massive data sets with new information in near real-time.

Some highlights:

  • 25 terabytes of compressed binary data, over 100 billion individual records
  • all services and data storage are on Amazon S3 and EC2
  • 60 up to 150 EC2 instances servicing an average of 400 requests/s
  • Clojure and Python as platform languages
  • Hadoop, Cascading and Cascalog are central pieces of BackType’s platform
  • Cascalog, a Clojure-based query language for Hadoop, was created and open sourced by BackType’s engineer Nathan Marz
  • ElephantDB, the storage solution, is a read-only cluster built on top of BerkleyDB files
  • Crawlers place data in Gearman queues for processing and storing

BackType data flow is presented in the following diagram:

BackType data flow

Included below is an interview with Nathan about Cascalog:

@pharkmillups .

Original title and link: Big Data Analysis at BackType (NoSQL databases © myNoSQL)


Machine and Human Generated Data

Volumes aside1, why is this classification important?

Judging also by the definitions Curt and Daniel came up with, I still think it’s an useless classification.

  1. I don’t think YouTube’s data (human generated data) comes anywhere near what a single Boeing can produce in a shorter period  

Original title and link: Machine and Human Generated Data (NoSQL databases © myNoSQL)