bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence
Tuesday, 30 April 2013
Hadoop Drives Down Costs
Darryl K. Taft reporting the experience of using Hadoop at UC Irvine Medical Center:
Because they were bleeding money, the team wanted a cost-effective solution. “Our target was $500 per terabyte. We were at $100,000 per terabyte with the old system,” Peterson said. “With our Hadoop cluster, we’re now at $900 per terabyte.”
How are these costs calculated?
- Fixed costs: hardware, any one time licenses
- Recurring costs: hardware replacement, energy, HR
Is this all?
Original title and link: Hadoop Drives Down Costs (©myNoSQL)
Impala 1.0 - That was fast
Cloudera announces Impala 1.0 GA release.
That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service
Original title and link: Impala 1.0 - That was fast (©myNoSQL)
Hadoop Virtualization
Roberto V. Zicari interviewing Joe Russell1 about Hadoop virtualization with Serengeti:
A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.
I’m not 100% sure I get this, but the way I explained it to myself to actually make sense this would mean that HDFS lives directly on the physical hardware and only the compute part is virtualized. Is that what he means?
-
Joe Russell is Product Line Marketing Manager at VMware. ↩
Original title and link: Hadoop Virtualization (©myNoSQL)
via: http://www.odbms.org/blog/2013/04/on-virtualize-hadoop-interview-with-joe-russell/
A Value Definition of Big Data
Jim Walker:
Last year, Shaun Connolly, Hortonworks VP of Corporate Strategy came up with this definition…
Big Data = Transactions + Interactions + Observations.
Well, give me an example of any data system that doesn’t satisfy this definition.
Here’s my proposal for yet another definition of Big Data: a buzzword that we’ll never have a real definition so we’d be better moving over.
Original title and link: A Value Definition of Big Data (©myNoSQL)
via: http://hortonworks.com/blog/big-data-defined-part-deux-value-definition/
Monday, 29 April 2013
Project Savanna: Hadoop and OpenStack
Timothy Prickett Morgan for The Register about Project Savanna, a collaboration between Mirantis, Hortonworks, and Red Hat:
Batman and Robin. Peanut butter and chocolate. OpenStack and Hadoop. These are things that go together, with the latter pairing being something that commercial OpenStack distie Mirantis, commercial Hadoop distie Hortonworks, and commercial KVM and Linux distie (and soon to be OpenStack commercializer) Red Hat are putting together under a new OpenStack effort dubbed Project Savanna.
Hadoop is at the age where everyone tries to package it and claim they’ll be the Red Hat of the Hadoop ecosystem. I cannot really dot the i-s and cross the t-s, but my gut feeling is that right now all these are actually more similar to the attempts of bringing Linux to the desktop.
We know how successful these have been so far.
Original title and link: Project Savanna: Hadoop and OpenStack (©myNoSQL)
via: http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/
Boundary for Splunk app for correlating alerts
Alex Williams for TechCrunch:
Boundary‘s application performance monitoring technology is now integrated into Splunk‘s enterprise platform, providing a window into apps that increasingly are distributed across cloud and on-premise virtualized environments.
At first I thought this means Boundary will use Splunk as the backend for the data. But Boundary is a service so that’s not the case. Plus Splunk can already be used for network management and monitoring.
According to the post, “Splunk real-time alerts are tagged as annotations in Boundary’s time-series graphs. Customers can then correlate alerts against application flow and performance data.” So basically this is monitoring your monitoring system, right?
Original title and link: Boundary for Splunk app for correlating alerts (©myNoSQL)
Thursday, 25 April 2013
Project Falcon: Tackling Hadoop Data Lifecycle Management
Venkatesh Seetharam announcing a new Apache incubating project in the Hadoop ecosystem open sourced by InMobi and Hortonworks:
Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.
I think this diagram describes Project Falcon best:
✚ Was there any other project addressing this space?
Original title and link: Project Falcon: Tackling Hadoop Data Lifecycle Management (©myNoSQL)
3 Big Data Use Cases in Banking
An article on Sys-Con about 3 high level and generic use cases of Big Data in banking:
- Customer experience
- Risk management
- Operations optimization
The first and the third are common across multiple fields. Risk management is critical to banks’ core business and I assume this is the domain where most of the technology investment happens.
Original title and link: 3 Big Data Use Cases in Banking (©myNoSQL)
Wednesday, 24 April 2013
Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!
Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):
Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.
✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?
Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (©myNoSQL)
Monday, 22 April 2013
Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility
Ofer Mendelevitch for Hortonworks blog:
Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.
Most often when speaking about Hadoop, people refer to costs (commodity servers), parallelism and scalability. I do not remember how many times I’ve written that the main difference between Hadoop and traditional data warehouses is in the agility it offers.
One Hadoop tagline could be: “collect data today. analyse it when and how you want“.
Original title and link: Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility (©myNoSQL)
Monday, 15 April 2013
A Walk Down the Memory Lane of Big Data: Inside IBM’s SAGE, the Largest Computer Ever Built
Fascinating data about the Semi-Automatic Ground Environment system built by IBM in1957:
SAGE consisted of 20 or so Direction Centers, each of which was a windowless, one-acre-large concrete cube (see below). Inside each DC were two CPUs, each one measuring 7,500 sq ft and consisting of 60,000 vacuum tubes, 175,000 diodes, 13,000 newfangled transistors, and 256KB of magnetic core RAM, consuming a total of 3MW of power and weighing in at 250 tons. Each CPU — only one operated at a time; the other was kept as a hot spare to minimize downtime — was capable of executing 75,000 instructions per second, which was enough to spit out tons of radar data to 150 CRT consoles.
Original title and link: A Walk Down the Memory Lane of Big Data: Inside IBM’s SAGE, the Largest Computer Ever Built (©myNoSQL)
Saturday, 13 April 2013
Some Interesting Talks and a Panel With People From Cloudera, Platfora, Greylock and Think Big Analytics
Recorded at New York Data Business Meetup, gathered by Matt Turck and featuring separately and together in a panel:
- Jeff Hammerbacher from Cloudera
- DJ Patil from Greylock
- Ron Bodkin from Think Big Analytics
- Ben Werther from Platfora.
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
