ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence

Linked Open Data Star Scheme

While writing quite a bit lately about Big Data marketplaces, I thought it would be worth mentioning Tim Berners-Lee 5-start deployment scheme for Linked Open Data:

  1. make your stuff available on the Web (whatever format) under an open license
  2. make it available as structured data (e.g., Excel instead of image scan of a table)
  3. use non-proprietary formats (e.g., CSV instead of Excel)
  4. use URIs to identify things, so that people can point at your stuff
  5. link your data to other data to provide context

Linked Open Data Star Scheme

Credit lab.linkeddata.deri.ie

See Tim Berner-Lee talking about the star scheme at gov 2.0 expo:


Who Needs Big Data Marketplaces?

Who could start benefiting right away from big data marketplaces:

Scientists are wasting much of the data they are creating. Worldwide computing capacity grew at 58 percent every year from 1986 to 2007, and people sent almost two quadrillion megabytes of data to one another, according to a study published on Thursday in Science. But scientists are losing a lot of the data, say researchers in a wide range of disciplines.

It’s also kind of scary to know that we cannot find enough funding to address such fundamental needs that can change the face of humanity.

Original title and link: Who Needs Big Data Marketplaces? (NoSQL databases © myNoSQL)

via: http://chronicle.com/article/Dumped-On-by-Data-Scientists/126324/


Current and Future Big Data Warehouse

  • Custom build BigData frameworks like Teradata and VLDB implementations from Oracle that are proprietary frameworks designed to deal with large datasets. These frameworks are still very relational in orientation and are not designed to work with unstructured data sets.
  • Data Warehouse Appliances like Oracle’s Exadata. This introduces the concept of DW-in-a-box where the entire framework needed for a typical DW implementation (the Hardware, Software Framework in terms of data store and Advanced Analytical tools) are all vertically integrated and provided by the same vendor as a packaged solution.
  • Open Source NoSQL-oriented Big Data Frameworks such as Hadoop and Cassandra. These frameworks implement advanced analytical and mining algorithms such as Map/Reduce and are designed to be installed on commodity hardware for an MPP architecture with huge Master/Slave clusters. They are very good at dealing with vast amounts of unstructured, text-oriented information.
  • Commercial Big Data Frameworks like AsterData and GreenPlum, which follow the same paradigm of MPP infrastructures but have implemented their own add-ons such as SQL-MR and other optimizations for faster analytics.

A good list to augment/detail these 5 approaches to scalable storage solutions.

Original title and link: Current and Future Big Data Warehouse (NoSQL databases © myNoSQL)

via: http://www.infogain.com/company/perspective-big-data.jsp


What Does Big Data Mean to Infrastructure Professionals?

  1. Big data means the amount of data you’re working with today will look trivial within five years.
  2. Huge amounts of data will be kept longer and have way more value than today’s archived data.
  3. Business people will covet a new breed of alpha geeks. You will need new skills around data science, new types of programming, more math and statistics skills and data hackers…lots of data hackers.
  4. You are going to have to develop new techniques to access, secure, move, analyze, process, visualize and enhance data; in near real time.
  5. You will be minimizing data movement wherever possible by moving function to the data instead of data to function. You will be leveraging or inventing specialized capabilities to do certain types of processing- e.g. early recognition of images or content types – so you can do some processing close to the head.
  6. The cloud will become the compute and storage platform for big data which will be populated by mobile devices and social networks.
  7. Metadata management will become increasingly important.
  8. You will have opportunities to separate data from applications and create new data products.
  9. You will need orders of magnitude cheaper infrastructure that emphasizes bandwidth, not iops and data movement and efficient metadata management.
  10. You will realize sooner or later that data and your ability to exploit it is going to change your business, social and personal life; permanently.

Make sure you also check the 10 big data realities in the same post.

Original title and link: What Does Big Data Mean to Infrastructure Professionals? (NoSQL databases © myNoSQL)

via: http://wikibon.org/blog/ten-?big-data?-realities-and-what-they-mean-to-you/


Two Definitions for Big Data

Not sure I’ve got the rest of the post, but really liked these two definitions of big data:

Big data means nothing. It’s a well meaning term for (literally) big piles of data, sitting in various massive balls of infrastructure, randomly scattered around our enterprise. More common terms include data warehouses or decision support systems, etc.

and

Big data is created by copying transactional data and sticking it on another system. We copy ALL our transactional data and stick it on these systems. Over time, those systems become supersets of our transactional systems. We make lots of copies and put them in lots of big data systems.

Original title and link: Two Definitions for Big Data (NoSQL databases © myNoSQL)


Big Data Flow in Large Hadron Collider Grid

Summarizing the attempt to trace pieces of data in the Large Hadron Collider Grid:

While it’s true that files are given dates and times, figuring out where a packet of data came from is a lot more difficult than just reading the file name. Data are copied, split, copied again and deleted. Bits and pieces travel all over the Grid, often without any input from a human user.

Sounds in a way similar to what I’ve described as an ideal big data processing engine.

Original title and link: Big Data Flow in Large Hadron Collider Grid (NoSQL databases © myNoSQL)

via: http://blogs.nature.com/news/thegreatbeyond/2011/01/travelling_the_petabyte_highwa_1.html


Big Data Marketplaces and Data Privacy

To clarify: our goal was to map the nodes in the training dataset to the real identities in the social network that was used to create the data. […]

We were able to deanonymize about 80% of the nodes, including the vast majority of the high-degree nodes (both in- and out-degree.) We’re not sure what the overall error rate is, but for the high-degree nodes it is essentially zero.

We can go back to my questions: who will decide, regulate, and guarantee the level of privacy for data sets traded on the big data market?

@billynewport.

Original title and link: Big Data Marketplaces and Data Privacy (NoSQL databases © myNoSQL)

via: http://www.kaggle.com/blog/2011/01/15/how-we-did-it-the-winners-of-the-ijcnn-social-network-challenge/


HBase with Trillions Rows

Interesting question and answer on HBase mailing list:

[…] is it feasible to use HBase table in “read-mostly” mode with trillions of rows, each contains small structured record (~200 bytes, ~15 fields). Does anybody know a successful case when tables with such number of rows are used with HBase?

My follow up questions:

  • where is that data currently stored?
  • how will you migrate it?
  • if this is just what you estimate you’ll get, how soon will you reach these numbers?

Original title and link: HBase with Trillions Rows (NoSQL databases © myNoSQL)


Big Data Analysis at BackType

RWW has a nice post diving into the data flow and the tools used by BackType, a company with only 3 engineers, to deal and analyze large amounts of data.

They’ve invented their own language, Cascalog, to make analysis easy, and their own database, ElephantDB, to simplify delivering the results of their analysis to users. They’ve even written a system to update traditional batch processing of massive data sets with new information in near real-time.

Some highlights:

  • 25 terabytes of compressed binary data, over 100 billion individual records
  • all services and data storage are on Amazon S3 and EC2
  • 60 up to 150 EC2 instances servicing an average of 400 requests/s
  • Clojure and Python as platform languages
  • Hadoop, Cascading and Cascalog are central pieces of BackType’s platform
  • Cascalog, a Clojure-based query language for Hadoop, was created and open sourced by BackType’s engineer Nathan Marz
  • ElephantDB, the storage solution, is a read-only cluster built on top of BerkleyDB files
  • Crawlers place data in Gearman queues for processing and storing

BackType data flow is presented in the following diagram:

BackType data flow

Included below is an interview with Nathan about Cascalog:

@pharkmillups .

Original title and link: Big Data Analysis at BackType (NoSQL databases © myNoSQL)

via: http://www.readwriteweb.com/hack/2011/01/secrets-of-backtypes-data-engineers.php


Machine and Human Generated Data

Volumes aside1, why is this classification important?

Judging also by the definitions Curt and Daniel came up with, I still think it’s an useless classification.


  1. I don’t think YouTube’s data (human generated data) comes anywhere near what a single Boeing can produce in a shorter period  

Original title and link: Machine and Human Generated Data (NoSQL databases © myNoSQL)


IA Ventures: We’re All About Big Data

IA Ventures was founded on the belief that managing and extracting value from massive, occasionally unstructured, often real-time data sets is a competitive advantage.

[…]

We invest in talented early stage teams fueling this revolution with the development of innovative tools, technologies and analytics for managing and extracting value from big-data—both structured and unstructured.

iaventures.com/focus

50 mil. awaiting to be invested in Big Data companies.

Original title and link: IA Ventures: We’re All About Big Data (NoSQL databases © myNoSQL)


What is Big Data Used for

Philipp Janert [1]:

It falls into one of two camps. The first is reporting. […].

The other camp is what I consider “generalized search.” These are scenarios like: If User A likes movies B, C, and D, what other specific movie might User A want? That’s a form of searching because you’re not actually trying to create a conceptual model of user behavior. You’re comparing individual data points; you’re trying to find the movie that has the greatest similarity to a very specific other set of predefined movies. For this kind of generalized, exhaustive search, you need a lot of data because you look for the individual data points. But that’s not really analysis as I understand it, either.

I guess ☞ Netflix competition was a bit more than generalized search as it required both inductive and deductive research.


[1] Philipp Janert: author of ☞ Data Analysis with Open Source Tools

Original title and link: What is Big Data Used for (NoSQL databases © myNoSQL)

via: http://radar.oreilly.com/2010/11/the-data-analysis-path-curiosi.html