ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

datawarehouse: All content tagged as datawarehouse in NoSQL databases and polyglot persistence

Scaling the Facebook data warehouse to 300 PB

Fascinating read, raising interesting observations on different levels:

  1. At Facebook, data warehouse means Hadoop and Hive.

    Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

  2. I don’t see how in-memory solutions, like Hana, will see their market expanding.

    In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.

    On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.

    In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.

  3. At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.

    There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS.

  4. For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the ORCFile format.

    There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).

Original title and link: Scaling the Facebook data warehouse to 300 PB (NoSQL database©myNoSQL)

via: https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/


Three opinions about the future of Hadoop and Data Warehouse

Building on the same data coming from Gartner and a talk from Hadoop Summit (exactly the same), Matt Asay1 and Timo Elliott2 place Hadoop on the data warehouse map.

Matt Asay writes in the ReadWrite article that Hadoop is not replacing existing data warehouses, but it’s taking all new projects:

Hadoop (and its kissing cousin, the NoSQL database) isn’t replacing legacy technology so much as it’s usurping its place in modern workloads. This means enterprises will end up supporting both legacy technology and Hadoop/NoSQL to manage both existing and new workloads […]

Of course, given “the effective price of core Hadoop distribution software and support services is nearly zero” at this point, as Jeff Kelly highlights, more and more workloads will gravitate to Hadoop. So while data warehouse vendors aren’t dead—they’re not even gasping for breath—they risk being left behind for modern data workloads if they don’t quickly embrace Hadoop and other 21st Century data infrastructure.

On his blog, Timo Elliott makes sure that there’s some SAP in that future picture and uses their Hadoop partner, Hortonworks to depict it:

No. Ignoring the many advantages of Hadoop would be dumb. But it would be just as dumb to ignore the other revolutionary technology breakthroughs in the DW space. In particular, new in- memory processing opportunities have created a brand-new category that Gartner calls “hybrid transactional/analytic platforms” (HTAP)

hadoopmodernarchitecture_thumb

The future I’d like to see is the one where:

  1. there is an integrated data platform. Note that in this ideal world, integrated does not mean any form of ETL
  2. it supports and runs in isolation different workloads from online transactions and bulk upload to various forms of analytics
  3. data is stored on dedicated mediums (spinning disks, flash, memory) depending on the workloads that touch it
  4. data would move between these storage mediums automatically, but the platform would allow fine tuning for maintaining the SLAs of the different components

  1. Matt Asay is VP of business development and corporate strategy at MongoDB 

  2. Timo Elliott is an Innovation Evangelist for SAP 

Original title and link: Three opinions about the future of Hadoop and Data Warehouse (NoSQL database©myNoSQL)


Does Hadoop replace or augment the enterprise data warehouse?

Wayne Eckerson:

For Cloudera, the first vendor to offer a Hadoop distribution, the answer is an unequivocal yes. Last November, Cloudera finally exposed its true sentiments by introducing the Enterprise Data Hub in which Hadoop replaces the data warehouse, among other things, as the center of an organization’s data management strategy. In contrast, Hortonworks takes a hybrid approach, partnering with leading commercial data management and analytics vendors to create a data environment that blends the best of Hadoop and commercial software. In short, Cloudera offers revolution, Hortonworks evolution.

You know what? Both are right. To replace existing enterprise data warehouse, the first step is in cohabiting with them.

Original title and link: Does Hadoop replace or augment the enterprise data warehouse? (NoSQL database©myNoSQL)


Picking the Right Platform: Big Data or Traditional Warehouse?

Stephen Swoyer (tdwi) is summarizing Richard Winter’s research into the topic of cost-based efficiency of Hadoop vs data warehouses:

“Under what circumstances, in fact, does Hadoop save you a lot of money, and under what circumstances does a data warehouse save you a lot of money?”

The conversation happened at a Teradata event, so you might already guess some of the findings. Anyways without seeing the data it’s difficult to agree or disagree:

In fact, he argued that misusing Hadoop for some types of decision support workloads could cost up to 2.8x more than a data warehouse.

Original title and link: Picking the Right Platform: Big Data or Traditional Warehouse? (NoSQL database©myNoSQL)

via: http://tdwi.org/Articles/2013/12/17/Picking-Right-DW-Platform.aspx?Page=1&p=1


How Hadoop wraps the data warehouse in a savory big data sandwich

Nancy Kopp for IBM data magazine:

Why is Hadoop the metaphorical bun for the big data burger? Well, as Hadoop moved further into production environments, two very prominent use cases emerged. We at IBM refer to the first use case as the landing zone. It is an area of the architecture where organizations are building out the capability to land all data—both structured and unstructured. […] The other prominent use case is leveraging Hadoop for archiving and offloading the data warehouse.

That’s a pretty tasty comparison. But there’s something missing from this sandwich: the gravy—how is data moving between these layers? I’d bet that in most of the cases that’s still Hadoop.

Original title and link: How Hadoop wraps the data warehouse in a savory big data sandwich (NoSQL database©myNoSQL)

via: http://ibmdatamag.com/2013/08/relishing-the-big-data-burger/