Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence
Great decision flowchart created by Aaron Cordova to help answer the question: what tools should I use to process my data:
Original title and link: SQL or Hadoop: What Tools Should I Use to Process My Data? ( ©myNoSQL)
Oracle Big Data Appliance hardware specification
18 Oracle Sun servers with a total of:
- 864 GB main memory;
- 216 CPU cores;
- 648 TB of raw disk storage;
- 40 Gb/s InfiniBand connectivity between nodes and other Oracle engineered systems; and,
- 10 Gb/s Ethernet data center connectivity.
The package includes 40Gb/s InfiniBand connectivity among the nodes, a rarity among Hadoop deployments, many of which use Ethernet to connect the nodes. Lumpkin said InfiniBand would speed data transfers within the system. Multiple racks can be tethered together in a cluster configuration. There is no theoretical limit to how many racks can be clustered together, though configurations of more than eight racks would require additional switches, Lumpkin said.
Oracle Big Data Appliance software specification
- Cloudera’s Distribution including Apache Hadoop
- Cloudera Manager
- Open source distribution of R
- Oracle NoSQL Database Community Edition
- Oracle Big Data Connectors
- Oracle Linux
Along with the release, Oracle also released Oracle Big Data Connectors, a set of drivers for exchanging data between the Big Data Appliance and other Oracle products, such as the Oracle Database 11g, the Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud and Oracle Exalytics In-Memory Machine.
However, Oracle isn’t blind to the fact that not everyone will be gung ho about buying an appliance. Its custom-built Big Data Connectors are available as separate products for those customers wanting to connect existing Hadoop clusters to Oracle database environments or R statistical-analysis environments.
According to Oracle’s announcement “The integrated Oracle and Cloudera architecture has been fully tested and validated by Oracle, who will also collaborate with Cloudera to provide support for Oracle Big Data Appliance.”
Oracle Big Data Appliance Services
George Lumpkin, Oracle’s vice president of data warehousing product management:
Oracle will provide first-line support for the appliance and all software (including the Hadoop distribution and Cloudera Manager) through its case-tracking support infrastructure. But when particularly tough support cases arise, Oracle will tap Cloudera’s expertise.
What’s more, Oracle will refer customers to Cloudera for Hadoop training and consulting engagements.
Oracle Big Data Appliance Positioning
George Lumpkin, Oracle’s vice president of data warehousing product management:
We are positioning this as something that runs alongside other Oracle-based systems. Big data is more than just a cluster of hardware running Hadoop. It is an overall information architecture for enabling companies to analyze data and make decisions.
Oracle highlighted the Big Data Appliance as a complement to a growing family of “engineered systems” that now includes Exadata, Exalogic, and the Exalytics In-Memory Machine.
But what’s more remarkable is the fact that Oracle is finally looking beyond its core database. Oracle’s TimesTen and Essbase databases, which were recently upgraded for use in the Exalytics appliance, and BerkeleyDB, which was Oracle’s development starting point for the new NoSQL database, are examples of that shift.
Oracle is suddenly beginning to act as a data-management portfolio company, not just a company with a big brother and a bunch of starving siblings.
Oracle is positioning the appliance for managing and analyzing large sets of data that may be too large, or otherwise unsuitable for keeping in databases, such as telemetry data, click-stream data or other log data. “You may not want to keep the data in a database, but you do want to store it and analyze it,” Lumpkin said. The appliance is intended for those organizations that want to undertake Big Data-style analysis but may not have the in-house expertise to assemble large Hadoop or NoSQL-based systems.
Kurt Dunn, Cloudera’s chief operating officer told InformationWeek.
Oracle has put together a very comprehensive product that is priced very well.
The cost of the Big Data Appliance is what will really stand out. At $500,000, this may not seem like a bargain, but in reality it is. Typically, commoditized Hadoop systems run at about $4,000 a node. To get this much data storage capacity and power, you would need about 385 nodes… which puts the price tag at around $1.54 million—three times the price of Oracle’s Cloudera-based offering (which, I should add, excludes things like support costs and power).
The hardware and software combined will sell for $450,000, with an annual support fee for both hardware and software of 12%. That’s highly competitive, working out to less than $700 per terabyte and being in line with the low costs big data practitioners expect from deployments built on commodity hardware.
Oracle - Cloudera Parternship
I wrote earlier my take on what this partnership means to both Oracle and Cloudera.
But by releasing the product early in the year in partnership with Cloudera, which has more customers and years in the market than any other Hadoop software and services provider, Oracle has made it clear that it is wasting no time and taking no chances with unproven technology.
“Cloudera brings us a couple of very important missing pieces, including its management software and assistance for a deeper second- and third-tier level of support,” said George Lumpkin, Oracle’s vice president of product management, data warehousing.
Speculations about the future of the Oracle - Cloudera partnership
Students of Linux history will well remember that’s exactly what happened when Oracle partnered with Red Hat to introduce commoditized Oracle offerings… and then Larry Ellison and crew decided to roll their own Oracle Enterprise Linux in 2006 when they decided to cut Red Hat out of the stack.
This is strong historical evidence that Oracle will do the same with Cloudera, because frankly the big data market is too big for Oracle not to want to own. Big Data Appliance customers should note this, and be very prepared that future versions may not be tied to Cloudera at all, but rather Oracle’s version of Hadoop.
A few people suggested on Twitter that this partnership is a sign of a possible Oracle’s acquisition of Cloudera. TechCrunch’s Leena Rao links to an old post by Matt Asay suggesting this acquisition.
Media coverage of Oracle Big Data Appliance
- Oracle Press Release: Oracle Selects Cloudera to Provide Apache Hadoop Distribution and Tools for Oracle Big Data Appliance
- Jean-Pierre Dijcks on Oracle blogs: Big Data Appliance and Big Data Connectors are now Generally Available
- myNoSQL: Oracle Big Data Appliance Roundup: What, Why, How
- myNoSQL: Current and Future Big Data Warehouse
- ServicesANGLE: Oracle Releases Big Data Appliance with Cloudera Distribution for Hadoop
- PCWorld Business Center: Oracle Partners With Cloudera for Hadoop Appliance
- GigaOm: Cloudera puts the Hadoop in Oracle’s Big Data Appliance
- ITWorld: Big data: Oracle, Cloudera about to make it rain
- Informationweek: Oracle Makes Big Data Appliance Move With Cloudera
- TechCrunch Oracle Taps Cloudera For Hadoop Distribution Of Big Data Appliance:
Original title and link: Oracle Big Data Appliance Released Features Cloudera Distribution of Hadoop: What You Need to Know ( ©myNoSQL)
The announcement of the Oracle Big Data Appliance was out for a couple of hours and already hit all media sites. Before looking at the details of the announcement, let’s try to understand what this announcement means for the parties involved.
What does it mean for Oracle?
- Oracle enters a very busy Hadoop market associated with the best known company in the Hadoop ecosystem
- With this partnership, Oracle didn’t have to make a huge investment in software development or services
- Not having to build its own distribution of Hadoop, Oracle could focus on developing the Oracle Big Data Connectors
- Oracle will delegate everything Hadoop to Cloudera thus it won’t have to deal with a very fast evolving open source project that might see some interesting events due to the
- Oracle seems to have changed the message about Hadoop being used only for basic ETL.
What does it mean for Cloudera?
- Cloudera gets access to a pool of customers (many of them possibly very large customers)
- Cloudera will not need a big sales force to reach to these possible customers. Even if Cloudera knew about them, Oracle’s sales force will do the job
- If Oracle spells Cloudera’s name in every sales pitch, Cloudera will see a huge publicity bump that will sooner or later lead to more customers
Truth is I was expecting yet another distribution of Hadoop. And even if Oracle’s Big Data Appliance doesn’t feature the official Apache Hadoop distribution, I think that by choosing an existing distribution, Oracle did the right thing. For them and for their customers.
Original title and link: Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance ( ©myNoSQL)
What are your first thoughts if you overlay the following graphics:
Original title and link: NoSQL Databases and Big Data Market: A Quick Look at Technology vs Funding Status ( ©myNoSQL)
From the Open Data Manual:
Open data needs to be ‘technically’ open as well as legally open. Specifically the data needs be:
- Available — at no more than a reasonable cost of reproduction, preferably for free download on the Internet. Summary: publish your information on the Internet wherever possible.
- In bulk. The data should be available as a whole (a web API or service may also be very useful but is not a substitute for bulk access)
- In an open, machine-readable format. Machine-readability is important because it facilitates reuse, for example, tables of figures in a PDF can be read easily by humans but are very hard for a computer to use which greatly limits the ability to reuse that data.
Sir Tim Berners-Lee’s linked open data star scheme provides an unambiguous way to categorize open data. And while I’m at open data there’s also the Open Data Protocol which is meant to enable the creation of HTTP-based data services.
I think these can be generalized to all businesses and problems that require big data:
Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.
For a long time each business worked in its own silo.
Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.
federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities
If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.
Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.
Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.
increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores
considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models
Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.