hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence
Friday, 5 April 2013
Hadoop Security Design Paper
Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.
Thursday, 4 April 2013
Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop
Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.
Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.
Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.
Determine the compliance exposure risk based on the information collected.
Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.
Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.
Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.
Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?
Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.
Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.
Wait… where’s point 11, buy Dataguise?
Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (©myNoSQL)
via: http://www.businesspress24.com/pressrelease1213023.html
Wednesday, 3 April 2013
Scaling Big Data Mining Infrastructure at Twitter
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- …
- Productionize, act on the insight
- Rinse, repeat
Enjoy!
Hadoop + Terracotta BigMemory: Run, Elephant, Run!
While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.
Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.
✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.
Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (©myNoSQL)
via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/
Tuesday, 2 April 2013
Field-Level Encryption for Apache Hadoop From Dataguise
Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.
DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.
$25.000 a piece (hopefully not a piece of encrypted data though).
✚ Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.
Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (©myNoSQL)
Monday, 1 April 2013
Happy Birthday Hadoop!
On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.
Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!
Original title and link: Happy Birthday Hadoop! (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/
Wednesday, 27 March 2013
Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop
- Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
- Rudiment ETL that transforms one data format to another data format.
- Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
- Command line interface to allow users to submit SQL queries
- Java API to enable clients to submit SQL queries to Tajo
Just another example of the way of the future.
Original title and link: Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (©myNoSQL)
Tuesday, 26 March 2013
Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop
Ben Werther announcing the general availability of the Platfora BI:
At Platfora, we made a bet that Hadoop’s destiny wasn’t simply to be a cheaper, slower cousin of the relational data warehouse. […] Hadoop is superb at two things — it provides a near-infinite data reservoir where data of all kinds can be landed without needing to figure out how it will be used ahead of time, and it is a slow lumbering freight-train of an engine for crunching and aggregating batches of millions or billions of rows.
They are neither the first, nor the last to understand and bet on Hadoop. But in some cases this bet originates only in the financial potential of the Hadoop market and less so on the technological potential.
Indeed it’s rarely the case that these two can leave alone. When they do, it leads to either a smaller market segment or to a shorter life time. Looking around at what’s happening in the Hadoop space, technologically and business wise, I assume many economists would recognize the signs of a long lived opportunity.
As a side note, I find it interesting that very few articles are looking at two other fundamental aspects of the Hadoop platform, which, in my opinion, were, are and will remain critical to the growth of this market: open source and extensibility. Without any of these two, what would we see would be tons of copy cats wasting resources in creating small indistinguishable clones, plus countless and endless negotiations to extend and integrate the platform. Hadoop is open source and the open source developers working on it have built it with extensibility in mind. The proof is out there and is clear: look at the breadth and depth of the tools around Hadoop.
That’s the power of open source. The way of the future.
Original title and link: Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop (©myNoSQL)
Monday, 25 March 2013
GIS Tools for Hadoop by Esri
Interesting project, GIS Tools for Hadoop:
GIS Tools for Hadoop is an open source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries:
- Esri Geometry API for Java: A generic geometry library, can be used to extend Hadoop core with vector geometry types and operations, and enables developers to build MapReduce applications for spatial data.
- Spatial Framework for Hadoop: Extends Hive and is based on the Esri Geometry API, to enable Hive Query Language users to leverage a set of analytical functions and geometry types. In addition to some utilities for JSON used in ArcGIS.
- Geoprocessing Tools for Hadoop: Contains a set of ready to use ArcGIS Geoprocessing tools, based on the Esri Geometry API and Spatial Framework for Hadoop. Developers can download the source code of the tools and customize it; they can also create new tools and contribute it to the open source project. Through these tools ArcGIS users can move their spatial data and execute a pre-defined workflow inside Hadoop.
I recently learned about GeoJSON — JSON Geometry and Feature Description, but the two don’t seem to be related.
Original title and link: GIS Tools for Hadoop by Esri (©myNoSQL)
Wednesday, 20 March 2013
White Elephant: Task Statistics for Hadoop
From LinkedIn’s engineering team:
While tools like Ganglia provide system-level metrics, we wanted to be able to understand what resources were being used by each user and at what times. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts.
Isn’t this a form of resource usage auditing? Based on this, next you could build support for resource quotas and then start enforcing them.
Original title and link: White Elephant: Task Statistics for Hadoop (©myNoSQL)
via: http://engineering.linkedin.com/hadoop/white-elephant-hadoop-tool-you-never-knew-you-needed
Monday, 18 March 2013
How Does MapR Compare to Cloudera?
Staying in the MapR land, the question of comparing MapR to Cloudera is answered by people from all sides (MapR, Cloudera and Hortonworks). My summary: “cool proprietary technology addressing some of the current limitations of the Hadoop, but also missing some of the features the Hadoop community has come up with”.
Original title and link: How Does MapR Compare to Cloudera? (©myNoSQL)
via: http://www.quora.com/How-does-MapR-plan-to-compete-with-Cloudera
Friday, 15 March 2013
Paper: YSmart - Yet Another SQL-to-MapReduce Translator
Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:
MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling