NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:


Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)


Big Data 2014: Powering Up the Curve

Quentin Gallivan1:

  1. The big data ‘power curve’ in 2014 will be shaped by business users’ demand for data blending
  2. Big data needs to play well with others!
  3. You will see even more rapid innovation from the big data open source community
  4. You can’t prepare for tomorrow with yesterday’s tools

Just 4 truisms. Also data blending? Really?

  1. Quentin Gallivan: CEO Pentaho 

Original title and link: Big Data 2014: Powering Up the Curve (NoSQL database©myNoSQL)


Big Data analytics predictions for 2014

Michele Chambers1:

In 2014, data analysts will be empowered through easy-to-use tool that leverage the insights of data scientists, by providing real-time forecasts and recommendations in their day-to-day business tools. Better analytics will make data analysis more effective, while automation frees up data scientists to focus on strategic initiatives and unlocking further value in corporate data stores.

Ease of use is a mega trend of the last few years. Those ignoring it, try to make up through different means. But as proved over and over again, usability wins even over usefulness and correctness.

  1. Michele Chambers: Chief Strategy Officer and VP Product Management at Revolution Analytics, the company behind R. 

Original title and link: Big Data analytics predictions for 2014 (NoSQL database©myNoSQL)


Big Data Top Ten predictions for 2014 by Jim Kaskade

I couldn’t start the new year without a final round of predictions. Not mines though.

Jim Kaskade:

  1. Consolidation of NoSQLs begins
  2. The Hadoop Clone wars end
  3. Open source business model is acknowledged by Wall Street
  4. Big Data and Cloud really means private cloud
  5. 2014 starts the era of analytic applications
  6. Search-based business intelligence tools will become the norm with Big Data
  7. Real-time in-memory analytics, complex event processing, and ETL combine
  8. Prescriptive analytics become more mainstream
  9. MDM will provide the dimensions for big data facts
  10. Security in Big Data won’t be a big issue

4.5 out of 10.

Original title and link: Big Data Top Ten predictions for 2014 by Jim Kaskade (NoSQL database©myNoSQL)


Big Data predictions for 2014

EMC’s Bill Schmarzo concluding his Big Data centric 8 predictions for 2014:

Whether these predictions are entirely true or only partially true, 2014 will be an even bigger year for Big Data as technology innovations enable organizations to leverage data to optimize key business processes and uncover new monetization opportunities. To be successful, organizations don’t need a Big Data strategy as much as they need a business strategy that incorporates Big Data.

One of the few prediction lists that are realistic.

Original title and link: Big Data predictions for 2014 (NoSQL database©myNoSQL)


Big Data 2.0: the next generation of Big Data

Gagan Mehra1 from VentureBeat:

The big data ecosystem has now reached a tipping point where the basic infrastructural capabilities for supporting big data challenges and opportunities are easily available. Now we are entering what I would call the next generation of big data — big data 2.0 — where the focus is on three key areas: Speed, Data Quality, Applications.

While I’m not convinced we can call the first cycle complete, if this is Big Data 2.0 can we skip at 3.0 already2?

  1. Gagan Mehra: Chief evangelist for Software AG 

  2. Not that these are boring, but they sound like pre Big Data problems. 

Original title and link: Big Data 2.0: the next generation of Big Data (NoSQL database©myNoSQL)


Cloudera's strategy for Hadoop

Alex Woodie about Cloudera’s strategy for Hadoop:

Cloudera has gone further than other Hadoop vendors in articulating a business-oriented strategy for converting Hadoop R&D into a profitable business model. The company unveiled its “enterprise data hub” strategy at the Strata + Hadoop World conference in October, in which it envisions Hadoop at the center of a new data-focused architecture. Every type of data, whether it’s analytical or transactional in nature, goes through Hadoop on its way to somewhere else. (Hortonworks, MapR Technologies, and Pivotal, for what it’s worth, have similar strategies in play, but Cloudera has jumped out front in articulating the marketing message in the cleanest manner.)

In the early days a coherent strategy is not a critical point as technology alone can win adopters quite easily through its direct value. Later, when penetrating the enterprise world, a big picture strategy is at least a way to keep the conversation going even if in the end the deployed solutions are highly customized.

Original title and link: Cloudera’s strategy for Hadoop (NoSQL database©myNoSQL)


LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)


Is Big Data the inevitable future for data professionals?

TDWI interviewing Jonas Olsson, CEO of Graz:

Q: Is big data the inevitable future for data professionals?

Jonas Olsson: Of course not. Big data and data warehousing are two different technologies solving two different sets of challenges. Big data focuses on volume and unstructured data; data warehousing focuses on structured data and traceability. Which technology will suit your organization best depends on many factors. Using traditional data warehouse technology to analyze sensor- generated data is probably not a good idea because of the high volume of data, just as using big data technology to perform regulatory reporting is not a good idea due to poor traceability.

Wrong question. Very wrong answer.

Original title and link: Is Big Data the inevitable future for data professionals? (NoSQL database©myNoSQL)


Hadoop will be made better through engineering

Dan Woods prefacing an interview with Scott Gnau of Teradata:

In this vision, because Hadoop can store unstructured and structured information, because it can scale massively, because it is open source, because it allows many forms of analysis, because it has a thriving ecosystem, it will become the one repository to rule them all.

In my view, the most extreme advocates for Hadoop need to sober up and right size both expectations and time frames. Hadoop is important but it won’t replace all other repositories. Hadoop will change the world of data, but not in the next 18 months. The open source core of Hadoop is a masterful accomplishment, but like many open source projects, it will be made better through engineering.

You have to agree: there’s no engineering behind Hadoop. Just a huge number of intoxicated… brogrammers.

Original title and link: Hadoop will be made better through engineering (NoSQL database©myNoSQL)


Pigs can build graphs too for graph analytics

Extremely interesting and intriguing usage and extension of Pig at Intel:

Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User- Defined Functions (UDF’s) and macros in Pig Latin.

Original title and link: Pigs can build graphs too for graph analytics (NoSQL database©myNoSQL)