Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence
Tuesday, 10 January 2012
Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance
The announcement of the Oracle Big Data Appliance was out for a couple of hours and already hit all media sites. Before looking at the details of the announcement, let’s try to understand what this announcement means for the parties involved.
What does it mean for Oracle?
- Oracle enters a very busy Hadoop market associated with the best known company in the Hadoop ecosystem
- With this partnership, Oracle didn’t have to make a huge investment in software development or services
- Not having to build its own distribution of Hadoop, Oracle could focus on developing the Oracle Big Data Connectors
- Oracle will delegate everything Hadoop to Cloudera thus it won’t have to deal with a very fast evolving open source project that might see some interesting events due to the
- Oracle seems to have changed the message about Hadoop being used only for basic ETL.
What does it mean for Cloudera?
- Cloudera gets access to a pool of customers (many of them possibly very large customers)
- Cloudera will not need a big sales force to reach to these possible customers. Even if Cloudera knew about them, Oracle’s sales force will do the job
- If Oracle spells Cloudera’s name in every sales pitch, Cloudera will see a huge publicity bump that will sooner or later lead to more customers
Truth is I was expecting yet another distribution of Hadoop. And even if Oracle’s Big Data Appliance doesn’t feature the official Apache Hadoop distribution, I think that by choosing an existing distribution, Oracle did the right thing. For them and for their customers.
Original title and link: Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance (©myNoSQL)
Thursday, 5 January 2012
Hadoop, Big Data Apps, Data Science Tools, Cloud Collision: Wikibon Big Data Predictions for 2012
Jeff Kelly for Wikibon Blog:
- 2012 Will Be the Year of Big Data Applications.
- Analytic Platform Vendors Add Improved Functionality, Social Capabilities for Data Scientists.
- The Cloud and Big Data Collide.
- Big Data Appliances Gain Steam.
- Industry Responds to Big Data Skills Gap with Training and Education Resources.
- The Big Data Privacy Discussion Begins In Ernest.
According to TRIGG that’s 6 Ts out of 6.
Original title and link: Hadoop, Big Data Apps, Data Science Tools, Cloud Collision: Wikibon Big Data Predictions for 2012 (©myNoSQL)
Monday, 28 November 2011
NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status
What are your first thoughts if you overlay the following graphics:
Original title and link: NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status (©myNoSQL)
Thursday, 3 November 2011
Data Jujitsu and Data Karate
David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:
That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.
Original title and link: Data Jujitsu and Data Karate (©myNoSQL)
Sunday, 20 March 2011
Make Data Available - Open Data Manual
From the Open Data Manual:
Open data needs to be ‘technically’ open as well as legally open. Specifically the data needs be:
- Available — at no more than a reasonable cost of reproduction, preferably for free download on the Internet. Summary: publish your information on the Internet wherever possible.
- In bulk. The data should be available as a whole (a web API or service may also be very useful but is not a substitute for bulk access)
- In an open, machine-readable format. Machine-readability is important because it facilitates reuse, for example, tables of figures in a PDF can be read easily by humans but are very hard for a computer to use which greatly limits the ability to reuse that data.
Sir Tim Berners-Lee’s linked open data star scheme provides an unambiguous way to categorize open data. And while I’m at open data there’s also the Open Data Protocol which is meant to enable the creation of HTTP-based data services.
Original title and link: Make Data Available - Open Data Manual (NoSQL databases © myNoSQL)
Wednesday, 16 March 2011
Strategies for Exploiting Large-scale Data
In a guest post hosted by Cloudera blog, Bob Gourley[1] enumerates the characteristics of working with Big Data from federal agencies perspective.
I think these can be generalized to all businesses and problems that require big data:
Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.
For a long time each business worked in its own silo.
Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.
federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities
If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.
Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.
Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.
increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores
Ditto
considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models
Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.
-
Bob Gourley: editor of CTOvision.com and a former Defense Intelligence Agency (DIA) CTO, @bobgourley ↩
Original title and link: Strategies for Exploiting Large-scale Data (NoSQL databases © myNoSQL)
Tuesday, 15 March 2011
Data Privacy and Data Marketplaces Future
Data gathered and sold by RapLeaf can be very specific. According to documents reviewed by the Journal, RapLeaf’s segments recently included a person’s household income range, age range, political leaning, and gender and age of children in the household, as well as interests in topics including religion, the Bible, gambling, tobacco, adult entertainment and “get rich quick” offers. In all, RapLeaf segmented people into more than 400 categories, the documents indicated.
Obscure data ownership + cryptic TOS + unregulated data marketplaces = 1984
Original title and link: Data Privacy and Data Marketplaces Future (NoSQL databases © myNoSQL)
via: http://online.wsj.com/article/SB10001424052702304410504575560243259416072.html
Wednesday, 9 March 2011
Everything Drives Storage
James Governor about storage and EMC:
It seems like every computing revolution drives storage volumes […]. But everything drives storage. Virtualisation drives storage (which helps explain both the rationalisation, and the huge success, of EMC’s VMware acquisition. The cloud drives storage. Big Data drives storage (obviously). Data Center consolidation drives storage. The Web drives storage.
… and they don’t believe in memory.
Original title and link: Everything Drives Storage (NoSQL databases © myNoSQL)
Saturday, 26 February 2011
MapReducing Big Data with Riak and Luwak
The recording of Basho’s webinar on Riak Map/Reduce and Luwak:
You should read before Baseball Batting Average, Using Riak Map/Reduce and Fixing the count
Original title and link: MapReducing Big Data with Riak and Luwak (NoSQL databases © myNoSQL)
Monday, 21 February 2011
4 Database Technologies for Large Scale Data
Park Kieun (CUBRID Cluster Architect) gives an introduction to 4 large scale database technologies:
- Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.
Examples: EBay DW, Yahoo! Everest Architecture, Greenplum, AsterData
- Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.
Examples: Vertica, Sybase IQ, MonetDB
- Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.
Examples: Truviso
- Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.
Examples: many of the NoSQL databases covered here.
Even if I came up with the same 5 categories for scalable storage solutions, Park’s list is better documented. However we both left out distributed filesystems (sorry Jeff).
Original title and link: 4 Database Technologies for Large Scale Data (NoSQL databases © myNoSQL)
via: http://blog.cubrid.org/web-2-0/database-technology-for-large-scale-data/
Sunday, 20 February 2011
Data Analysis Tools Survey Results
I’ve always wondered what tools are used by data scientists to dig useful information out of the big data and create beauty out of it.
Szilard Pafka[1] has run an survey about the tools used by data scientist and he presents an overview of the results in the video embedded below.
As I’ve learned over time, the R language is the preferred data analysis tool and the survey confirms it. But what surprised me was to see Excel coming in the second place. Python and Unix shell tools are coming after SAS to complete the top five tools.
-
Szilard Pafka: founder and organizer of the Los Angeles R user group ↩
Original title and link: Data Analysis Tools Survey Results (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling



