NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



data warehouse: All content tagged as data warehouse in NoSQL databases and polyglot persistence

Is Big Data the inevitable future for data professionals?

TDWI interviewing Jonas Olsson, CEO of Graz:

Q: Is big data the inevitable future for data professionals?

Jonas Olsson: Of course not. Big data and data warehousing are two different technologies solving two different sets of challenges. Big data focuses on volume and unstructured data; data warehousing focuses on structured data and traceability. Which technology will suit your organization best depends on many factors. Using traditional data warehouse technology to analyze sensor- generated data is probably not a good idea because of the high volume of data, just as using big data technology to perform regulatory reporting is not a good idea due to poor traceability.

Wrong question. Very wrong answer.

Original title and link: Is Big Data the inevitable future for data professionals? (NoSQL database©myNoSQL)


Teradata Deployments:Apple, Walmart, eBay, Verizon, AT&T, BoA

Impressive roster for Teradata. I’d also love to see a list of deployments where Teradata and Hadoop are meeting.

Original title and link: Teradata Deployments:Apple, Walmart, eBay, Verizon, AT&T, BoA (NoSQL database©myNoSQL)


Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis

Peter Bailis:

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and “big data” database systems, with an eye towards real-world deployments. I figured I’d share the list. While it’s biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I’ve omitted several, like MapReduce), I think there are a few underappreciated gems.

Original title and link: Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis (NoSQL database©myNoSQL)

Big Data Implications for IT Architecture and Infrastructure

Teradata’s Martin Willcox:

From an IT architecture / infrastructure perspective, I think that the key thing to understand about all of this is that, at least for the foreseeable future, we’ll need at least two different types of “database” technology to efficiently manage and exploit the relational and non-relational data, respectively: an integrated data warehouse, built on an Massively Parallel Processing (MPP) DBMS platform for the relational data, and the relational meta-data that we generate by processing the non-relational data (for example, that a call was made at this date and time, by this customer, and that they were assessed as being stressed and agitated); and another platform for the processing of the non-relational data, that enables us to parallelise complex algorithms - and so bring them to bear on large data-sets - using the MapReduce programming model. Since the value of these data are much greater in combination than in isolation – and because we may be shipping very large volumes of data between the different platforms - considerations of how best to connect and integrate these two repositories become very important.

One of the few corporate blog posts that do not try to position Hadoop (and implicitely MapReduce) in a corner.

This sane perspective could be a validation of my thoughts about the Teradata and Hortwonworks partnership.

Original title and link: Big Data Implications for IT Architecture and Infrastructure (NoSQL database©myNoSQL)


Hadoop and NoSQL in a Big Data Environment with Ron Bodkin

Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:

People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.

The interview is only 16 minutes long and you have the full transcript.

Original title and link: Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (NoSQL database©myNoSQL)

Hadoop and IBM Netezza: Compete or Co-Exist?

I assume people on both sides of data warehouses (users and providers) are asking the same question. IBM Netezza and Cloudera seem to agree on the answer:

IBM Netezza had worked with Cloudera to put together a compelling demo to highlight the value of our combined solution of CDH/Hadoop and Netezza.  Through an interesting use case, the demo showed how businesses could have their “hot” data (most recent data) residing in Netezza, “warm” data (longer time range data) residing in HDFS, while leveraging the Cloudera Connector for Netezza and Oozie (workflow engine part of CDH) to provide deeper insights to business executives.

I would have liked to know more details about the use case though. Just categorizing data in “hot” and “warm” is not enough to understand the advantages of each piece.

Original title and link: Hadoop and IBM Netezza: Compete or Co-Exist? (NoSQL database©myNoSQL)


Infobright Rough Query: Aproximating Query Results

Very interesting idea in the latest Infobright release:

The most interesting of the group might be Rough Query, which speeds the process of finding the needle in a multi-terabyte haystack by quickly pointing users to a relevant range of data, at which point they can drill down with more-complex queries. So, in theory, a query that might have taken 20 minutes before might now take just a few minutes because Rough Query works in seconds by using only the in-memory data and the subsequent search is against a much smaller data set.

Curt Monash provides more context about Rough Queries in his post:

To understand Infobright Rough Query, recall the essence of Infobright’s architecture:

Infobright’s core technical idea is to chop columns of data into 64K chunks, called data packs, and then store concise information about what’s in the packs. The more basic information is stored in data pack nodes,* one per data pack. If you’re familiar with Netezza zone maps, data pack nodes sound like zone maps on steroids. They store maximum values, minimum values, and (where meaningful) aggregates, and also encode information as to which intervals between the min and max values do or don’t contain actual data values.

I.e., a concise, imprecise representation of the database is always kept in RAM, in something Infobright calls the “Knowledge Grid.” Rough Query estimates query results based solely on the information in the Knowledge Grid — i.e., Rough Query always executes against information that’s already in RAM.

Rough Query is not meant for BI or reporting, but rather for initial investigations data scientists would perform against BigData.

Original title and link: Infobright Rough Query: Aproximating Query Results (NoSQL database©myNoSQL)


Data Warehouse 2011 Market According to Gardner

Donald Feinberg, vice president and distinguished analyst at Gartner:

In 2011, we are seeing data-warehouse platforms evolve from an information store supporting traditional business intelligence platforms to a broader analytics infrastructure supporting operational analytics, corporate performance management and other new applications and uses, such as operational BI and performance management,

Do all those words mean anything? Cause all I’m reading is: “in return for the pile of money they pay, clients want faster, closer to real-time results”.

Update: Jason Porter suggested another interpretation:

Original title and link: Data Warehouse 2011 Market According to Gardner (NoSQL databases © myNoSQL)


Current and Future Big Data Warehouse

  • Custom build BigData frameworks like Teradata and VLDB implementations from Oracle that are proprietary frameworks designed to deal with large datasets. These frameworks are still very relational in orientation and are not designed to work with unstructured data sets.
  • Data Warehouse Appliances like Oracle’s Exadata. This introduces the concept of DW-in-a-box where the entire framework needed for a typical DW implementation (the Hardware, Software Framework in terms of data store and Advanced Analytical tools) are all vertically integrated and provided by the same vendor as a packaged solution.
  • Open Source NoSQL-oriented Big Data Frameworks such as Hadoop and Cassandra. These frameworks implement advanced analytical and mining algorithms such as Map/Reduce and are designed to be installed on commodity hardware for an MPP architecture with huge Master/Slave clusters. They are very good at dealing with vast amounts of unstructured, text-oriented information.
  • Commercial Big Data Frameworks like AsterData and GreenPlum, which follow the same paradigm of MPP infrastructures but have implemented their own add-ons such as SQL-MR and other optimizations for faster analytics.

A good list to augment/detail these 5 approaches to scalable storage solutions.

Original title and link: Current and Future Big Data Warehouse (NoSQL databases © myNoSQL)


No SQL and Big Data from a Business Intelligence & Data Warehousing Perspective

“No SQL” and Big data appearing in Rick Sherman’s list of overhyped trends in BI and Data warehousing:

7.No SQL: Pundits are confusing the complexity of integrating data with the use of SQL. Enterprise business data is complex because business processes and their relationships are complex. Relational databases are a symptom of that complexity, not the cause. Sorry, but the world of data is complex and no amount of wishful thinking (or avoiding SQL) is going to change that.

9.BIG Data: What is this? Ask 10 vendors and you get 10 answers (based on what they are selling). This can be a trend if somebody can define it.

Original title and link: No SQL and Big Data from a Business Intelligence & Data Warehousing Perspective (NoSQL databases © myNoSQL)


6 Trends Driving Data Warehousing and Business Intelligence

Philip Russom here and Curt Monash here:

most drivers of change in BI and DW concern four Mega-Trends:

  • size
  • speed
  • interoperability
  • economics
  • new kinds of data
  • increased analytic sophistication

I guess what’s new is the impact of the new kinds of data — I’d probably include here social data, sensor data, the continuously increasing size and new analytic approaches.

Original title and link: 6 Trends Driving Data Warehousing and Business Intelligence (NoSQL databases © myNoSQL)

The Near Future of NoSQL databases

Bradford Stephens[1] about the future of data storage solutions, including NoSQL databases:

Q: Do you foresee any consolidation in the near time?

Bradford: I see actually a proliferation of the open source tools.

We’ve got a ton of key-value stores out there, like Cassandra, Voldemort. I have some feeling that people have very specific requirements that they are going to cook up and open source.

In the document databases world I don’t see anything more than MongoDB, CouchDB, and the few of the others.

I do see consolidation happening in the commercial space, because there’s a lot of vendors out there doing very similar things, especially in the commercial data warehousing space.

And I see a ton of growth in areas like geo data — there’s no stack out there for geo data — and managing time series and other data like that.

Complete interview with O’Reilly’s David Sims below:

  1. Bradford Stephens: founder of Drawn to Scale, @LusciousPear  

Original title and link: The Near Future of NoSQL databases (NoSQL databases © myNoSQL)