ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Optimizing Joins running on HDInsight Hive on Azure

Two notable things in Denny Lee’s post about optimizing some of the Hive joins used by Microsoft’s Online Services Division:

  1. Microsoft is drinking their own HDInsight on Azure champaign. This will take HDInsight product far as they’ll always have first hand feedback about parts of the system that need improvement.
  2. Know the different types of JOINs supported by Hive and don’t be afraid of experimenting.

✚ An extra point for the link to Liyin Tang and Namit Jain’s Join strategies in Hive (PDF)

Original title and link: Optimizing Joins running on HDInsight Hive on Azure (NoSQL database©myNoSQL)

via: http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/


HBase migration to the new Hadoop Metrics2 system

Elliott Clarke explains a bit the work that his doing in migrating the HBase metrics to Hadoop’s Metrics2 system:

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

The post is more about the specific implementation details than the wide range of metrics HBase already supports and how this new system would unify and allow extending it.

Original title and link: HBase migration to the new Hadoop Metrics2 system (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics


RCFile - OCFile - Parquet: Storing Big Data With Hive

Christian Prokopp explaining the advantages of the RCFile storage:

The state-of-the-art solution for Hive is the RCFile. The format has been co-developed by Facebook, which is running the largest Hadoop and Hive installation in the world. RCFile has been adopted by the Hive and Pig projects as the core format for table like data storage. The goal of the format development was “(1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns,” as can be seen in this PDF from the development teams.

Questions:

  1. is there any connection between the RCFile and Parquet the new columnar storage format? At first glance, the goals of the two are pretty similar.
  2. It looks like there’s already a new format that will supersede RCFile: ORC Files. Are all these 3 approaches independent of each other? If yes, then would are the pros and cons of each of them?

Original title and link: RCFile - OCFile - Parquet: Storing Big Data With Hive (NoSQL database©myNoSQL)

via: http://www.bigdatarepublic.com/author.asp?section_id=2840&doc_id=262756


Hadoop, the Swap, and the OOM Killer

Stories from Spotify in Hadoop trenches:

Who and why could be a killer? The answer probably could be only one. The kernel out-of-memory killer that under desperately low memory conditions, starts murdering processes according to their “badness” score. It looks that the OOM killer takes out a Hadoop process (in this case TaskTracker). You can read how “badness” score is calculated here, but in case of “tradional” Hadoop slave servers, TaskTracker usually becomes the prime candidate to be killed, because together with its child processes (JVM running map and reduce tasks, and potentially an external scripts invoking map and reduce functions, if Hadoop Streaming is used), it consumes a lot of memory.

Original title and link: Hadoop, the Swap, and the OOM Killer (NoSQL database©myNoSQL)

via: http://hakunamapdata.com/two-memory-related-issues-on-the-apache-hadoop-cluster/


What is Apache Bigtop?

The project founder, Roman Shaposhnik defining what is Apache Bigtop:

The elevator pitch for Bigtop has always been: Bigtop is to Hadoop what Debian is to Linux. The most surprising development to me was how well that message resonates with the commercial vendors in the Big Data space. I’m still amazed at how quickly the “Powered by Bigtop” list is growing.

Original title and link: What is Apache Bigtop? (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/


Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL

Nokia’s big data ecosystem consists of a centralized, petabyte-scale Hadoop cluster that is interconnected with a 100-TB Teradata enterprise data warehouse (EDW), numerous Oracle and MySQL data marts, and visualization technologies that allow Nokia’s 60,000+ users around the world tap into the massive data store. Multi-structured data is constantly being streamed into Hadoop from the relational systems, and hundreds of thousands of Scribe processes run every day to move data from, for example, servers in Singapore to a Hadoop cluster in the UK. Nokia is also a big user of Apache Sqoop and Apache HBase.

In the coming years you’ll hear more often stories—sales pitches—about single unified platforms solving all these problems at once. But platforms that will survive and thrive are those that will accomplish two things:

  1. keep the data gates open: in and out.
  2. work with different other platform to make this efficiently for users

Original title and link: Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/04/customer-spotlight-nokias-big-data-ecosystem-connects-cloudera-teradata-oracle-and-others/


Big Data Industry Atlas

Forbes published this chart based on Wikibon data:

It’s an $18 billion industry heading to $50 billion in five years, according to tech researchers at Wikibon. Make note of the names in the inner circle. They’re the pure plays with the newest science—and are likely to get gobbled up by the growth-hungry incumbents on the outside.

To save your eyes, in the inner circle:

  • LucidWorks
  • Datameer
  • Kognitio
  • Couchbase
  • Basho
  • Datastax
  • Hortonworks
  • Fractal Analytics
  • Mapr
  • Paraccel (nb: Paraccel has already been acquired by Actian)
  • Guavus
  • Alteryx
  • 10gen
  • 1010data
  • Actian
  • Cloudera
  • Palantir
  • MJ Sigma
  • Opera Solutions
  • Splunk
  • Sisense
  • Rainstor
  • Calpoint
  • Think Big Analytics
  • Aerospike
  • Digital Reasoning

Big Data Industry Atlas

The big data market is still shaping. But soon (not very soon though), we’ll see some clear segments with leaders and challengers. And then…, then we will see a lot of acquisitions and mergers.

Original title and link: Big Data Industry Atlas (NoSQL database©myNoSQL)

via: http://www.forbes.com/special-report/2013/industry-atlas.html


Apache Hive 0.11: Stinger Phase 1 Delivered

Owen O’Malley on Hortonworks’ blog:

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them.

This is indeed the power of open. But don’t forget that too much bragging might diminish it: keep repeating a word and its value will slowly vanish.

Original title and link: Apache Hive 0.11: Stinger Phase 1 Delivered (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/


6 Key Hardware Considerations for Deploying Hadoop in Your Environment

To deploy, configure, manage and scale Hadoop clusters in a way that optimizes performance and resource utilization there is a lot to consider.

The 6 aspects presented in the post: OS, MapReduce slots available across nodes, memory, storage, capacity, network. It would be a lot more useful to put these in some order based on the scenarios the Hadoop cluster will have to solve.

Original title and link: 6 Key Hardware Considerations for Deploying Hadoop in Your Environment (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/6-key-hardware-considerations-for-deploying-hadoop-in-your-environment/


Hadoop, Security, and DataStax Enterprise

But the eWeek article demonstrates that the same concerns [nb: about security] exist where Hadoop implementations are concerned. The article says: “It [Hadoop] was not written to support hardened security, compliance, encryption, policy enablement and risk management.”

The story goes like this: in the early days of NoSQL, when no NoSQL database had any sort of security features, people behind the projects answered: “it’s too early. we’re focusing on more important features. and you can still get around security by placing your database behind firewalls”. Today, when more and more NoSQL databases are adding security features, the story these same people are telling is quite different: “ohhh, security is critical. we don’t really see how you could run a database without these features”.

Security is always critical. And exactly the same can be said about maintaining a solid, coherent story of what you are telling your users.

Original title and link: Hadoop, Security, and DataStax Enterprise (NoSQL database©myNoSQL)

via: http://www.datastax.com/2013/04/hadoop-security-and-the-enterprise


Hadoop for graphs - GraphLab picks up $6.75m from Madrona and NEA

Robin Wauters for TNW:

Seattle startup GraphLab claims it is building the “fastest machine-learning analytics engine for graph datasets”, based on the popular open-source distributed graph computation framework with the same name, and it has just raised capital to come through on its promise.

Good luck to GraphLab’s team.

✚ Here’s a short list of MapReduce implementations for graphs.

Original title and link: Hadoop for graphs - GraphLab picks up $6.75m from Madrona and NEA (NoSQL database©myNoSQL)

via: http://thenextweb.com/insider/2013/05/14/graphlab-funding/


Hadoop, Moh's Law and Corollaries

Robert Novak’s proposes Moh’s law and Rob’s corollaries to Hadoop and Big Data:

  1. Hadoop is hard.
  2. Make sure your’re measuring what you think you’re measuring.
  3. Make sure you’re measuring what you need to be measuring.

For the first, I’m somehow confident that Cloudera and Hortonworks and others will finally solve it over time. But for the latter you are the only responsible. Not even a SaaS can save you.

Original title and link: Hadoop, Moh’s Law and Corollaries (NoSQL database©myNoSQL)

via: https://rsts11.wordpress.com/2013/05/14/mohs-law-and-big-data-rsts11/