Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence
The announcement of the Oracle Big Data Appliance was out for a couple of hours and already hit all media sites. Before looking at the details of the announcement, let’s try to understand what this announcement means for the parties involved.
What does it mean for Oracle?
- Oracle enters a very busy Hadoop market associated with the best known company in the Hadoop ecosystem
- With this partnership, Oracle didn’t have to make a huge investment in software development or services
- Not having to build its own distribution of Hadoop, Oracle could focus on developing the Oracle Big Data Connectors
- Oracle will delegate everything Hadoop to Cloudera thus it won’t have to deal with a very fast evolving open source project that might see some interesting events due to the
- Oracle seems to have changed the message about Hadoop being used only for basic ETL.
What does it mean for Cloudera?
- Cloudera gets access to a pool of customers (many of them possibly very large customers)
- Cloudera will not need a big sales force to reach to these possible customers. Even if Cloudera knew about them, Oracle’s sales force will do the job
- If Oracle spells Cloudera’s name in every sales pitch, Cloudera will see a huge publicity bump that will sooner or later lead to more customers
Truth is I was expecting yet another distribution of Hadoop. And even if Oracle’s Big Data Appliance doesn’t feature the official Apache Hadoop distribution, I think that by choosing an existing distribution, Oracle did the right thing. For them and for their customers.
Original title and link: Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance ( ©myNoSQL)
What are your first thoughts if you overlay the following graphics:
Original title and link: NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status ( ©myNoSQL)
From the Open Data Manual:
Open data needs to be ‘technically’ open as well as legally open. Specifically the data needs be:
- Available — at no more than a reasonable cost of reproduction, preferably for free download on the Internet. Summary: publish your information on the Internet wherever possible.
- In bulk. The data should be available as a whole (a web API or service may also be very useful but is not a substitute for bulk access)
- In an open, machine-readable format. Machine-readability is important because it facilitates reuse, for example, tables of figures in a PDF can be read easily by humans but are very hard for a computer to use which greatly limits the ability to reuse that data.
Sir Tim Berners-Lee’s linked open data star scheme provides an unambiguous way to categorize open data. And while I’m at open data there’s also the Open Data Protocol which is meant to enable the creation of HTTP-based data services.
I think these can be generalized to all businesses and problems that require big data:
Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.
For a long time each business worked in its own silo.
Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.
federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities
If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.
Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.
Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.
increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores
considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models
Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.
Szilard Pafka has run an survey about the tools used by data scientist and he presents an overview of the results in the video embedded below.
As I’ve learned over time, the R language is the preferred data analysis tool and the survey confirms it. But what surprised me was to see Excel coming in the second place. Python and Unix shell tools are coming after SAS to complete the top five tools.
Szilard Pafka: founder and organizer of the Los Angeles R user group ↩