ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Big data: All content tagged as Big data in NoSQL databases and polyglot persistence

Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance

The announcement of the Oracle Big Data Appliance was out for a couple of hours and already hit all media sites. Before looking at the details of the announcement, let’s try to understand what this announcement means for the parties involved.

What does it mean for Oracle?

  • Oracle enters a very busy Hadoop market associated with the best known company in the Hadoop ecosystem
  • With this partnership, Oracle didn’t have to make a huge investment in software development or services
  • Not having to build its own distribution of Hadoop, Oracle could focus on developing the Oracle Big Data Connectors
  • Oracle will delegate everything Hadoop to Cloudera thus it won’t have to deal with a very fast evolving open source project that might see some interesting events due to the
  • Oracle seems to have changed the message about Hadoop being used only for basic ETL.

What does it mean for Cloudera?

  • Cloudera gets access to a pool of customers (many of them possibly very large customers)
  • Cloudera will not need a big sales force to reach to these possible customers. Even if Cloudera knew about them, Oracle’s sales force will do the job
  • If Oracle spells Cloudera’s name in every sales pitch, Cloudera will see a huge publicity bump that will sooner or later lead to more customers

Truth is I was expecting yet another distribution of Hadoop. And even if Oracle’s Big Data Appliance doesn’t feature the official Apache Hadoop distribution, I think that by choosing an existing distribution, Oracle did the right thing. For them and for their customers.

Original title and link: Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance (NoSQL database©myNoSQL)


Hadoop, Big Data Apps, Data Science Tools, Cloud Collision: Wikibon Big Data Predictions for 2012

Jeff Kelly for Wikibon Blog:

  1. 2012 Will Be the Year of Big Data Applications.
  2. Analytic Platform Vendors Add Improved Functionality, Social Capabilities for Data Scientists.
  3. The Cloud and Big Data Collide.
  4. Big Data Appliances Gain Steam.
  5. Industry Responds to Big Data Skills Gap with Training and Education Resources.
  6. The Big Data Privacy Discussion Begins In Ernest.

According to TRIGG that’s 6 Ts out of 6.

Original title and link: Hadoop, Big Data Apps, Data Science Tools, Cloud Collision: Wikibon Big Data Predictions for 2012 (NoSQL database©myNoSQL)

via: http://wikibon.org/blog/big-data-in-2012-hadoop-big-data-apps-data-science-tools-cloud-collision-and-more/


NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status

What are your first thoughts if you overlay the following graphics:

Hype Cycle for Cloud Computing 2011

Original title and link: NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status (NoSQL database©myNoSQL)


Data Jujitsu and Data Karate

David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:

That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.

Original title and link: Data Jujitsu and Data Karate (NoSQL database©myNoSQL)

via: http://www.informationweek.com/thebrainyard/news/strategy/231900611/web-20-expo-linkedins-big-data-lessons-learned


Make Data Available - Open Data Manual

From the Open Data Manual:

Open data needs to be ‘technically’ open as well as legally open. Specifically the data needs be:

  1. Available — at no more than a reasonable cost of reproduction, preferably for free download on the Internet. Summary: publish your information on the Internet wherever possible.
  2. In bulk. The data should be available as a whole (a web API or service may also be very useful but is not a substitute for bulk access)
  3. In an open, machine-readable format. Machine-readability is important because it facilitates reuse, for example, tables of figures in a PDF can be read easily by humans but are very hard for a computer to use which greatly limits the ability to reuse that data.

Sir Tim Berners-Lee’s linked open data star scheme provides an unambiguous way to categorize open data. And while I’m at open data there’s also the Open Data Protocol which is meant to enable the creation of HTTP-based data services.

Original title and link: Make Data Available - Open Data Manual (NoSQL databases © myNoSQL)


Strategies for Exploiting Large-scale Data

In a guest post hosted by Cloudera blog, Bob Gourley[1] enumerates the characteristics of working with Big Data from federal agencies perspective.

I think these can be generalized to all businesses and problems that require big data:

Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.

For a long time each business worked in its own silo.

Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.

federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities

If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.

Large-scale distributed analysis over large data sets is often expected to return results almost instantly.

Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.

  • Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.

  • increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores

Ditto

considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models

Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.


  1. Bob Gourley: editor of CTOvision.com and a former Defense Intelligence Agency (DIA) CTO, @bobgourley  

Original title and link: Strategies for Exploiting Large-scale Data (NoSQL databases © myNoSQL)


Data Privacy and Data Marketplaces Future

Data gathered and sold by RapLeaf can be very specific. According to documents reviewed by the Journal, RapLeaf’s segments recently included a person’s household income range, age range, political leaning, and gender and age of children in the household, as well as interests in topics including religion, the Bible, gambling, tobacco, adult entertainment and “get rich quick” offers. In all, RapLeaf segmented people into more than 400 categories, the documents indicated.

Obscure data ownership + cryptic TOS + unregulated data marketplaces = 1984

Original title and link: Data Privacy and Data Marketplaces Future (NoSQL databases © myNoSQL)

via: http://online.wsj.com/article/SB10001424052702304410504575560243259416072.html


Everything Drives Storage

James Governor about storage and EMC:

It seems like every computing revolution drives storage volumes […]. But everything drives storage. Virtualisation drives storage (which helps explain both the rationalisation, and the huge success, of EMC’s VMware acquisition. The cloud drives storage. Big Data drives storage (obviously). Data Center consolidation drives storage. The Web drives storage.

… and they don’t believe in memory.

Original title and link: Everything Drives Storage (NoSQL databases © myNoSQL)

via: http://www.redmonk.com/jgovernor/2011/01/28/emc-summit-on-cloud-storage-big-data-and-developers/


MapReducing Big Data with Riak and Luwak

The recording of Basho’s webinar on Riak Map/Reduce and Luwak:

You should read before Baseball Batting Average, Using Riak Map/Reduce and Fixing the count

Original title and link: MapReducing Big Data with Riak and Luwak (NoSQL databases © myNoSQL)


4 Database Technologies for Large Scale Data

Park Kieun (CUBRID Cluster Architect) gives an introduction to 4 large scale database technologies:

  • Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.

Examples: EBay DW, Yahoo! Everest Architecture, Greenplum, AsterData

  • Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.

Examples: Vertica, Sybase IQ, MonetDB

  • Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.

Examples: Truviso

  • Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.

Examples: many of the NoSQL databases covered here.

Even if I came up with the same 5 categories for scalable storage solutions, Park’s list is better documented. However we both left out distributed filesystems (sorry Jeff).

Original title and link: 4 Database Technologies for Large Scale Data (NoSQL databases © myNoSQL)

via: http://blog.cubrid.org/web-2-0/database-technology-for-large-scale-data/


Data Analysis Tools Survey Results

I’ve always wondered what tools are used by data scientists to dig useful information out of the big data and create beauty out of it.

Szilard Pafka[1] has run an survey about the tools used by data scientist and he presents an overview of the results in the video embedded below.

Data Analysis Tools Survey Results

As I’ve learned over time, the R language is the preferred data analysis tool and the survey confirms it. But what surprised me was to see Excel coming in the second place. Python and Unix shell tools are coming after SAS to complete the top five tools.

Szilard Pafka


  1. Szilard Pafka: founder and organizer of the Los Angeles R user group  

Original title and link: Data Analysis Tools Survey Results (NoSQL databases © myNoSQL)