Big data: All content tagged as Big data in NoSQL databases and polyglot persistence
Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis
A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and “big data” database systems, with an eye towards real-world deployments. I figured I’d share the list. While it’s biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I’ve omitted several, like MapReduce), I think there are a few underappreciated gems.
Original title and link: Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis ( ©myNoSQL)
Cloudant has received an undisclosed investment from Samsun Ventures
- Cloudant PR: Samsung Ventures Adds Cloudant to its Portfolio of Leading Mobile Service Solution and Data Network Companies
- TNW: Samsung Ventures continues its investment offensive with DBaaS company Cloudant
- GigaOm: Samsung Ventures backs Cloudant with undisclosed investment
- WSJ: The Daily Startup: Cloudant Accepts Samsung Backing and ‘Long-Term Vision’ - Venture Capital Dispatch - WSJ
- TechCrunch: Samsung Invests In Cloudant, A CIA-Backed, YC Alum That Specializes In Database-As-A-Service Technologies
- DataCenterKnowledge: Samsung Invests in Cloudant, Prepping for ‘Internet of Things’
Think Big Analytics, a Big Data consulting company raised $3mil. from former Cisco executive Dan Scheinman and WI Harper Group
Hortonwork’s announces Certification Program for Apache Hadoop
[…]today announced the launch of the Hortonworks Certified Technology Program, designed to help customers choose leading enterprise software that has been tested to integrate with Hortonworks Data Platform (HDP), the only 100-percent open source Apache Hadoop distribution. By certifying technologies, Hortonworks is taking the risk out of the technology selection, thereby accelerating and simplifying customers’ big data projects. The Program strengthens and expands the Apache Hadoop ecosystem, while helping to increase the enterprise capabilities of Apache Hadoop.
I assume the model here is that vendors pay Hortonworks for this certification and they can use the Hortonworks stamp when talking to customers.
DataStax’s Next Great Data Developer Contest
Last, but not necessarily money-related:
MySQL 5.6 Released
I’m still reading about what’s new in MySQL 5.6, but what caught my eyes while skimming over the docs is support for online DDL.
Original title and link: NoSQL and Big Data Money News ( ©myNoSQL)
A lot of people like to make predictions. I don’t. But I love filling them for later reference.
Here’s a roundup of predictions for 2013. Most of them are about the Big Data market, very few mentioning NoSQL databases. Why?
- It’s all so… pink
- Back to Planet Earth
- Existing solutions. Do you mean old solutions?
- No Hadoop?
- We’re going up… I mean vertical
- Too much Hadoop
- What about NoSQL databases?
- Show me the money
To frame the context of these predictions, let’s start with the forecast of the Big Data market from Gartner Research. According to their reports, Big Data accounted for $96 billion of global IT spending in 2012. This will rise to $120 billion in 2013 and up to $232 billion by 2016.
It’s all so… pink
Stefan Groschupf from Datameer: Big Data – Crossing the Chasm in 2013!:
We think 2013 is the year that Big Data will cross the chasm.
Mike Gualtiere for Forrester: Big Data Predictions For 2013:
My prediction: Time magazine will name big data its 2013 person of the year.
Derrick Harris for GigaOM: What we’ll see in 2013 in data:
- Get ready for Hadoop as you’ve never seen it before
- The Google-Ray-Kruzeweil singularity: If Google and Kurzweil can find a way to work symbiotically as employer and employee, who knows what they’ll be able to pull off. Maybe it will be an even crazier batch of ideas with which to dazzle the public, but it might also be some legitimate progress on Google’s current batch of ideas (including those hidden away inside Google X) that have promise today but need some old-school engineering know-how.
- Data for the people: What I’d like to see in 2013 is a combination of applications, data and devices that makes it easy for average consumers to learn about themselves in sow meaningful ways.
If it’s about what I’d like for 2013, one of the top positions would be the “Freedom of Data Act”. The non-legalese text could simply read: “If you have permission to collect and process my data, I do have permission to get it back and use it however I like”.
Going back to 2013, to prepare for the new year, Derrick Harris writes A programmer’s guide to big data: 12 tools to know—none of these were on my list though.
[…] if your job revolves around writing code rather than data flows, you might need a little help. Here are 12 tools (listed alphabetically) that aim to help. As usual with this type of list, it’s very possible I left out some good options, so please note any omissions in the comments.
Back to Planet Earth
Reading more like Planet Food, The Red Hat Storage Team writes in Red Hat Predicts Significant Trends in Scale-out Open Hybrid Cloud Storage in 2013:
- Prediction #2 — Storage Software will Eat Storage Hardware for Lunch!
- Prediction #3 — Open Source Storage Software will Eat Proprietary Storage Software for Dinner!
- Prediction #5 — Big Data and Small Storage is the Perfect Recipe for Success!
Richard McDougall (VMware Application Infrastructure CTO): 2013 Predictions for Big Data:
- Prediction #4: “Delete” will become a forbidden word
- Prediction #3: There will be a mad dash for software-defined storage
- Prediction #2: The default infrastructure for Big Data will change
- Prediction #1: The focus on big data use cases will shift heavily towards real-time
That’s 2 for software-defined storage.
Nick Kolakowski for Slashdot: Hadoop, Mobile, and Other Big Data Trends in 2013:
Build Your Own Massive Data System: While other organization don’t have Facebook’s resources, they do have a need to wrangle increasing amounts of data. That could drive many of them, over the next year or so, to opt for custom-built solutions over “off the shelf” platforms.
The emphasis should be on: even if you have the talent and budget, do not create yet another clone of an existing solution.
Elliot Bentley and Chris Mayer for JAXenter: Reasons to be excited about Big Data in 2013:
- Hadoop’s next real-time move: Hadoop has reached maturity but its main hindrance has been the inability of gleaning analysis at the speed which enterprises demand. 2013 could be the year where we see this change and a new direction for data-centric products.
- Jumping in is easier than ever: As the Hadoop platform solidifies, it is forming the foundation for clever startups like Precog and Continuuity which are abstracting away existing barriers to entry, and we’re likely to see even more of thin within the coming year.
Indeed, engineers have always been known for jumping in heads first.
Existing solutions. Do you mean old solutions?
Jeff Bertolucci for InformationWeek: 5 Big Data Predictions For 2013:
Data warehouses will go the way of the dinosaur. Pervasive Software, a data management and analytics company, foresees gloom and doom for existing data warehouses.
“The ‘Big Data Revolution’ is exposing how technically obsolete the existing data warehousing infrastructure really is. Relational technology is not well suited for large-scale analytical workloads. Big data analytics demand a completely modern technology infrastructure, such as Hadoop and its ecosystem,” […]
If throwing out old solutions is not your thing, Maarten Ectors writes on his blog: Big Data 2013 Predictions:
If you just invested a lot of money in a Big Data solution from any of the traditional BI vendors (Teradata, IBM, Oracle, SAS, EMC, HP, etc.) then you are likely to see sub-optimal ROI in 2013.
Yves for the Talend blog: Predicts 2013: Hadoop Becomes Enterprise-Acceptable, Transitions from Experimental to Mainstream:
In 2013, no longer an experimental platform, Hadoop will become a major player in the overall IT environment.
Herb Cunitz for the Hortonworks blog Apache Hadoop: Seven Predictions for 2013:
Prediction #2: Emergence of vertically aligned Apache Hadoop “solutions”: […] As more and more companies gain success we will see patterns and solutions arise that are custom-fit for a challenge found in a particular industry. As the system integrators and consultants become more and more expert on Apache Hadoop, they will wrap solutions in packages and we will see the emergence of these vertical solutions
Prediction #6: The big data ecosystem expands. Related to number four prediction, existing application vendors will all clamor to make their products Hadoop-compatible. Led by Teradata and Microsoft and many others, application vendors are waking up to the reality that their applications must run on Hadoop. Already, it seems everyone is building a reference architectures which incorporate Hadoop and HDP to leverage all the goodness they already provide around data lifecycle management, data governance, security, etc. Meanwhile the Hadoop community is doing everything it can to foster adoption by the ISVs. In 2013, nearly everyone will be speaking big data.
We’re going up… I mean vertical
Christophe from Wibidata: Welcome to 2013!:
We believe that the cutting edge trend in 2013 will be about building Big Data Applications, which means a greated focus on real-time serving technologies such as HBase and Kiji as well as emerging real-time query engines like Impala and Apache Drill.
If you ask yourself what are Big Data Applications, Christophe has an answer:
The differentiating factor between established applications and those that use Big Data is the ability of an application to dynamically adapt based on new data. This includes the ability to rescore models as sensor data fluctuates, incorporate external factors – such as weather and social media – that become relevant and modify the next best action each time end user behavior changes. Most applications make decisions using a bevy of rules and relying on select fractions of data. Products that claim real-time decisions or contextualized results largely operate in silos, using just the data that someone thought to include when the application was first deployed, not the most relevant and important data.
Staying with the application space, Jim Kaskade for Infochimps: Intelligent Applications: The Big Data Theme for 2013:
My prediction for 2013 is that competitive advantage will translate into enterprises using sophisticated Big Data analytics to create a new breed of applications - Intelligent Applications.
Too much Hadoop
Andrew Brust for ZDNet: Big Data 2013: Industry Players’ Forecasts:
My take on where Big Data technology is going comes dow to two themes: a lessening dependency on MapReduce and a pushing down of Hadoop deeper into the enterprise software stack.
By the lessening dependency on MapReduce, I mean to say that products like Cloudera’s Impala, and Microsoft’s PolyBase, which bypass MapReduce and work directly against data stored in Hadoop’s Distributed File System (HDFS) will gain momentum. MapR’s prediction about the continued rise of SQL-based tools aligns with this, as does another prediction from Pervasive that “YARN changes the Hadoop game”.
And what do I mean by my prediction that Hadoop will be pushed deeper into the software stack? Simply that (a) Hadoop has gained such significant adoption that it has in effect become an industry standard and that (b) standards tend to become the foundation of higher-valued software tools, rather than tools in their own right.
James Kobielus (IBM Big Data Evangelist) for The Big Data Hub: Koby’s Big Data Predictions for 2013:
- Hybrid big-data deployments will become the standard
- Cross-scale data architectures will predominate
- Governance will become a prime focus of maturing big-data deployments
- Data science centers of excellence will spring up everywhere
- Next-best-action deployments will become more cross-application
No word about Hadoop. No word about IBM products. But reading between the lines makes me feel there’s an IBM product for every bullet point.
What about NoSQL databases?
Gazzang’s predictions for 2013 contain one of the few references to NoSQL databases in their 2013 The Year Big Data Goes Big-Time:
- A damaging big data breach will cause the market to question holes and vulnerabilities in NoSQL infrastructure.
- Vertical line of business applications on top of big data will start to explode, with some early examples already starting to emerge in retail, financial services and oil and gas.
- The first significant big data company acquisitions will happen, signaling a shift in focus from proof-of-concept projects to high-business-value implementations/rollouts.
Not really the best mention of NoSQL databases. Somehow in the same vein, Armel Nene writes on his post Big Data, Bigger Myths:
NoSQL is the way forward and Hadoop is the Holy Grail: This is a funny one. The NoSQL started as death to traditional RDBMS. Startups companies started to jump on the buzz wagon. There were NoSQL evangelist at every street corner, ok maybe not but you get the point. And the early adopters started to see problems in the movement. Experienced data admins from the SQL world started converting then they stopped, why?
Show me the money
John Bantleman (CEO of RainStor) for Wired: Big Data: Business or Technology Challenge?:
- Prediction 1: Enterprise Big Data Initiatives Move out of the Sandbox and Define a Clear Set of Business and Technology Requirements
- Prediction 2: Companies will Look to New Technology Combinations, other than Hadoop, when Managing Big Data
- Prediction 3: Budget Limitations will Pose one of the Biggest Hurdles to Solving Big Data Challenges
- Prediction 4: Big Data Tools Must Satisfy both Business and Technical Users
- Prediction 5: Heavyweights, such as Oracle and IBM, will Make Acquisitions in the Big Data Market
Coming from the CEO of a company active in the Big Data market, some of these predictions could be interpreted in different ways.
No prediction list is complete without looking at IPOs and from the Big Data market, only one company made David Zielenziger’s list for International Business Times, 5 Tech IPOs For 2013 From Cloud Events To Ultrafast Chips: Cloudera. Why? The IPO of 2013.
The list of predictions could go on and on for a while. So I’ll finish here with a conversation I had on Twitter:
Kontra: If the future of ‘big data’ is Hadoop, we’re royally screwed. We’re in dark ages with regards to data, multi-DC transactions/reliability/etc.
Alex: It very much depends on what we define as “future”. IMO it’s a building block, but there’s a lot to be built on top.
Kontra: Hadoop, currently, is unusable for majority of use cases often used by the majority of big(ish) data users without huge resources.
Alex: True. But other tools in the space are unusable to the majority of companies that cannot afford multi-million single tool investments
Kontra: That’s the point: we are in the dark ages when it comes to data, with or without Hadoop. It’s painful.
Alex: well, I think and hope that we are in the early renaissance days.
- Big Data — Crossing the Chasm in 2013!
- Big Data Predictions For 2013
- What we’ll see in 2013 in data
- A programmer’s guide to big data: 12 tools to know
- Red Hat Predicts Significant Trends in Scale-out Open Hybrid Cloud Storage in 2013
- 2013 Predictions for Big Data
- Hadoop, Mobile, and Other Big Data Trends in 2013
- Reasons to be excited about Big Data in 2013
- 5 Big Data Predictions For 2013
- Big Data 2013 Predictions
- Predicts 2013: Hadoop Becomes Enterprise-Acceptable, Transitions from Experimental to Mainstream
- Apache Hadoop: Seven Predictions for 2013
- Welcome to 2013!
- Intelligent Applications: The Big Data Theme for 2013
- Big Data 2013: Industry Players’ Forecasts
- Koby’s Big Data Predictions for 2013
- 2013 The Year Big Data Goes Big-Time
- Big Data, Bigger Myths
- Big Data: Business or Technology Challenge?
- 5 Tech IPOs For 2013 From Cloud Events To Ultrafast Chips
Original title and link: Issue #1: Quo Vadis, Big Data? ( ©myNoSQL)
Jay Kreps1 had a very interesting follow up to the GigaOM’s article Why big data might be more about automation than insights :
That article reminded me how immature people’s thinking about the use of data is. They are still thinking about “reports”. Reports indicate that that part of your business algorithm that is executed by a human. When you understand it well enough, whatever you are doing looking at a report a computer can do better and faster. But the real advantage is that computers can disaggregate decisions humans make into many many individual cases and be far more accurate.
The algorithms is:
- add instrumentation
- visualzie data
- turn visualization into a report
- automate reaction to report
- Wash, rinse, repeat.
Jay Kreps is working at LinkedIn in the SNA team. ↩
Original title and link: Reports Indicate That Part of Your Business Algorithm Is Executed by Humans ( ©myNoSQL)
Jay Jarell, the President and CEO of Objectivity, in a PR announcement:
We have been solving the Big Data problem for decades.
Original title and link: Objectivity CEO: We Have Been Solving the Big Data Problem ( ©myNoSQL)
Great decision flowchart created by Aaron Cordova to help answer the question: what tools should I use to process my data:
Original title and link: SQL or Hadoop: What Tools Should I Use to Process My Data? ( ©myNoSQL)
Oracle Big Data Appliance hardware specification
18 Oracle Sun servers with a total of:
- 864 GB main memory;
- 216 CPU cores;
- 648 TB of raw disk storage;
- 40 Gb/s InfiniBand connectivity between nodes and other Oracle engineered systems; and,
- 10 Gb/s Ethernet data center connectivity.
The package includes 40Gb/s InfiniBand connectivity among the nodes, a rarity among Hadoop deployments, many of which use Ethernet to connect the nodes. Lumpkin said InfiniBand would speed data transfers within the system. Multiple racks can be tethered together in a cluster configuration. There is no theoretical limit to how many racks can be clustered together, though configurations of more than eight racks would require additional switches, Lumpkin said.
Oracle Big Data Appliance software specification
- Cloudera’s Distribution including Apache Hadoop
- Cloudera Manager
- Open source distribution of R
- Oracle NoSQL Database Community Edition
- Oracle Big Data Connectors
- Oracle Linux
Along with the release, Oracle also released Oracle Big Data Connectors, a set of drivers for exchanging data between the Big Data Appliance and other Oracle products, such as the Oracle Database 11g, the Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud and Oracle Exalytics In-Memory Machine.
However, Oracle isn’t blind to the fact that not everyone will be gung ho about buying an appliance. Its custom-built Big Data Connectors are available as separate products for those customers wanting to connect existing Hadoop clusters to Oracle database environments or R statistical-analysis environments.
According to Oracle’s announcement “The integrated Oracle and Cloudera architecture has been fully tested and validated by Oracle, who will also collaborate with Cloudera to provide support for Oracle Big Data Appliance.”
Oracle Big Data Appliance Services
George Lumpkin, Oracle’s vice president of data warehousing product management:
Oracle will provide first-line support for the appliance and all software (including the Hadoop distribution and Cloudera Manager) through its case-tracking support infrastructure. But when particularly tough support cases arise, Oracle will tap Cloudera’s expertise.
What’s more, Oracle will refer customers to Cloudera for Hadoop training and consulting engagements.
Oracle Big Data Appliance Positioning
George Lumpkin, Oracle’s vice president of data warehousing product management:
We are positioning this as something that runs alongside other Oracle-based systems. Big data is more than just a cluster of hardware running Hadoop. It is an overall information architecture for enabling companies to analyze data and make decisions.
Oracle highlighted the Big Data Appliance as a complement to a growing family of “engineered systems” that now includes Exadata, Exalogic, and the Exalytics In-Memory Machine.
But what’s more remarkable is the fact that Oracle is finally looking beyond its core database. Oracle’s TimesTen and Essbase databases, which were recently upgraded for use in the Exalytics appliance, and BerkeleyDB, which was Oracle’s development starting point for the new NoSQL database, are examples of that shift.
Oracle is suddenly beginning to act as a data-management portfolio company, not just a company with a big brother and a bunch of starving siblings.
Oracle is positioning the appliance for managing and analyzing large sets of data that may be too large, or otherwise unsuitable for keeping in databases, such as telemetry data, click-stream data or other log data. “You may not want to keep the data in a database, but you do want to store it and analyze it,” Lumpkin said. The appliance is intended for those organizations that want to undertake Big Data-style analysis but may not have the in-house expertise to assemble large Hadoop or NoSQL-based systems.
Kurt Dunn, Cloudera’s chief operating officer told InformationWeek.
Oracle has put together a very comprehensive product that is priced very well.
The cost of the Big Data Appliance is what will really stand out. At $500,000, this may not seem like a bargain, but in reality it is. Typically, commoditized Hadoop systems run at about $4,000 a node. To get this much data storage capacity and power, you would need about 385 nodes… which puts the price tag at around $1.54 million—three times the price of Oracle’s Cloudera-based offering (which, I should add, excludes things like support costs and power).
The hardware and software combined will sell for $450,000, with an annual support fee for both hardware and software of 12%. That’s highly competitive, working out to less than $700 per terabyte and being in line with the low costs big data practitioners expect from deployments built on commodity hardware.
Oracle - Cloudera Parternship
I wrote earlier my take on what this partnership means to both Oracle and Cloudera.
But by releasing the product early in the year in partnership with Cloudera, which has more customers and years in the market than any other Hadoop software and services provider, Oracle has made it clear that it is wasting no time and taking no chances with unproven technology.
“Cloudera brings us a couple of very important missing pieces, including its management software and assistance for a deeper second- and third-tier level of support,” said George Lumpkin, Oracle’s vice president of product management, data warehousing.
Speculations about the future of the Oracle - Cloudera partnership
Students of Linux history will well remember that’s exactly what happened when Oracle partnered with Red Hat to introduce commoditized Oracle offerings… and then Larry Ellison and crew decided to roll their own Oracle Enterprise Linux in 2006 when they decided to cut Red Hat out of the stack.
This is strong historical evidence that Oracle will do the same with Cloudera, because frankly the big data market is too big for Oracle not to want to own. Big Data Appliance customers should note this, and be very prepared that future versions may not be tied to Cloudera at all, but rather Oracle’s version of Hadoop.
A few people suggested on Twitter that this partnership is a sign of a possible Oracle’s acquisition of Cloudera. TechCrunch’s Leena Rao links to an old post by Matt Asay suggesting this acquisition.
Media coverage of Oracle Big Data Appliance
- Oracle Press Release: Oracle Selects Cloudera to Provide Apache Hadoop Distribution and Tools for Oracle Big Data Appliance
- Jean-Pierre Dijcks on Oracle blogs: Big Data Appliance and Big Data Connectors are now Generally Available
- myNoSQL: Oracle Big Data Appliance Roundup: What, Why, How
- myNoSQL: Current and Future Big Data Warehouse
- ServicesANGLE: Oracle Releases Big Data Appliance with Cloudera Distribution for Hadoop
- PCWorld Business Center: Oracle Partners With Cloudera for Hadoop Appliance
- GigaOm: Cloudera puts the Hadoop in Oracle’s Big Data Appliance
- ITWorld: Big data: Oracle, Cloudera about to make it rain
- Informationweek: Oracle Makes Big Data Appliance Move With Cloudera
- TechCrunch Oracle Taps Cloudera For Hadoop Distribution Of Big Data Appliance:
Original title and link: Oracle Big Data Appliance Released Features Cloudera Distribution of Hadoop: What You Need to Know ( ©myNoSQL)