ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Hadoop: All content tagged as Hadoop in NoSQL databases and polyglot persistence

Project Savanna: Hadoop and OpenStack

Timothy Prickett Morgan for The Register about Project Savanna, a collaboration between Mirantis, Hortonworks, and Red Hat:

Batman and Robin. Peanut butter and chocolate. OpenStack and Hadoop. These are things that go together, with the latter pairing being something that commercial OpenStack distie Mirantis, commercial Hadoop distie Hortonworks, and commercial KVM and Linux distie (and soon to be OpenStack commercializer) Red Hat are putting together under a new OpenStack effort dubbed Project Savanna.

Hadoop is at the age where everyone tries to package it and claim they’ll be the Red Hat of the Hadoop ecosystem. I cannot really dot the i-s and cross the t-s, but my gut feeling is that right now all these are actually more similar to the attempts of bringing Linux to the desktop.

We know how successful these have been so far.

Original title and link: Project Savanna: Hadoop and OpenStack (NoSQL database©myNoSQL)

via: http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/


Project Falcon: Tackling Hadoop Data Lifecycle Management

Venkatesh Seetharam announcing a new Apache incubating project in the Hadoop ecosystem open sourced by InMobi and Hortonworks:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

I think this diagram describes Project Falcon best:

Project Falcon at a Glance

✚ Was there any other project addressing this space?

Original title and link: Project Falcon: Tackling Hadoop Data Lifecycle Management (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html


Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility

Ofer Mendelevitch for Hortonworks blog:

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

Most often when speaking about Hadoop, people refer to costs (commodity servers), parallelism and scalability. I do not remember how many times I’ve written that the main difference between Hadoop and traditional data warehouses is in the agility it offers.

One Hadoop tagline could be: “collect data today. analyse it when and how you want“.

Original title and link: Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hadoop-data-agility/


Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler

Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).

✚ Datanami has published a summary of the keynote here

Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (NoSQL database©myNoSQL)


Hadoop Security Design Paper

Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.


Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop

  1. Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.

  2. Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.

  3. Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.

  4. Determine the compliance exposure risk based on the information collected.

  5. Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.

  6. Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.

  7. Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.

  8. Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?

  9. Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.

  10. Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.

Wait… where’s point 11, buy Dataguise?

Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (NoSQL database©myNoSQL)

via: http://www.businesspress24.com/pressrelease1213023.html


Scaling Big Data Mining Infrastructure at Twitter

I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:

DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”

and then the reality check:

  1. Your boss says something vague
  2. You think very hard on how to move the needle
  3. Where’s the data?
  4. What’s in this dataset?
  5. What’s all the f#$#$ crap in the data?
  6. Clean the data
  7. Run some off-the-shelf data mining algorithm
  8. Productionize, act on the insight
  9. Rinse, repeat

Enjoy!


Hadoop + Terracotta BigMemory: Run, Elephant, Run!

While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.

Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.

✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.

Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (NoSQL database©myNoSQL)

via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/


Field-Level Encryption for Apache Hadoop From Dataguise

Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.

DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.

$25.000 a piece (hopefully not a piece of encrypted data though).

Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.

Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (NoSQL database©myNoSQL)

via: http://news.techworld.com/security/3437999/dataguise-introduces-field-level-encryption-for-apache-hadoop-database/


Happy Birthday Hadoop!

On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.

Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!

Original title and link: Happy Birthday Hadoop! (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/


‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop

‎Tajo:

  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

Just another example of the way of the future.

Original title and link: ‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (NoSQL database©myNoSQL)

via: http://tajo.incubator.apache.org/