mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence
Wednesday, 24 April 2013
Testing MapReduce With MRUnit
Mansoor Ashraf about MRUnit:
Testing and debugging multi threaded programs is hard. Now take the same programs and massively distribute them across multiple JVMs deployed on a cluster of machines and the complexity goes off the roof. One way to overcome this complexity is to do testing in isolation and catch as many bugs as possible locally. MRUnit is a testing framework that lets you test and debug Map Reduce jobs in isolation without spinning up a Hadoop cluster. In this blog post we will cover various features of MRUnit by walking through a simple MapReduce job.
The code samples look quite legible and there doesn’t seem to be a lot of boilerplate code involved. That’s a great thing for a testing framework.
Original title and link: Testing MapReduce With MRUnit (©myNoSQL)
via: http://m-mansur-ashraf.blogspot.com/2013/02/testing-mapreduce-with-mrunit.html
Monday, 22 April 2013
Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility
Ofer Mendelevitch for Hortonworks blog:
Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.
Most often when speaking about Hadoop, people refer to costs (commodity servers), parallelism and scalability. I do not remember how many times I’ve written that the main difference between Hadoop and traditional data warehouses is in the agility it offers.
One Hadoop tagline could be: “collect data today. analyse it when and how you want“.
Original title and link: Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility (©myNoSQL)
Saturday, 6 April 2013
Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler
Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).
✚ Datanami has published a summary of the keynote here
Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (©myNoSQL)
Friday, 5 April 2013
Hadoop Security Design Paper
Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.
Thursday, 4 April 2013
Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop
Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.
Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.
Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.
Determine the compliance exposure risk based on the information collected.
Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.
Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.
Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.
Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?
Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.
Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.
Wait… where’s point 11, buy Dataguise?
Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (©myNoSQL)
via: http://www.businesspress24.com/pressrelease1213023.html
Wednesday, 3 April 2013
Scaling Big Data Mining Infrastructure at Twitter
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- …
- Productionize, act on the insight
- Rinse, repeat
Enjoy!
Hadoop + Terracotta BigMemory: Run, Elephant, Run!
While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.
Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.
✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.
Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (©myNoSQL)
via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/
Tuesday, 2 April 2013
Field-Level Encryption for Apache Hadoop From Dataguise
Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.
DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.
$25.000 a piece (hopefully not a piece of encrypted data though).
✚ Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.
Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (©myNoSQL)
Monday, 1 April 2013
Happy Birthday Hadoop!
On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.
Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!
Original title and link: Happy Birthday Hadoop! (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/
Wednesday, 27 March 2013
Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop
- Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
- Rudiment ETL that transforms one data format to another data format.
- Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
- Command line interface to allow users to submit SQL queries
- Java API to enable clients to submit SQL queries to Tajo
Just another example of the way of the future.
Original title and link: Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (©myNoSQL)
Tuesday, 26 March 2013
Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop
Ben Werther announcing the general availability of the Platfora BI:
At Platfora, we made a bet that Hadoop’s destiny wasn’t simply to be a cheaper, slower cousin of the relational data warehouse. […] Hadoop is superb at two things — it provides a near-infinite data reservoir where data of all kinds can be landed without needing to figure out how it will be used ahead of time, and it is a slow lumbering freight-train of an engine for crunching and aggregating batches of millions or billions of rows.
They are neither the first, nor the last to understand and bet on Hadoop. But in some cases this bet originates only in the financial potential of the Hadoop market and less so on the technological potential.
Indeed it’s rarely the case that these two can leave alone. When they do, it leads to either a smaller market segment or to a shorter life time. Looking around at what’s happening in the Hadoop space, technologically and business wise, I assume many economists would recognize the signs of a long lived opportunity.
As a side note, I find it interesting that very few articles are looking at two other fundamental aspects of the Hadoop platform, which, in my opinion, were, are and will remain critical to the growth of this market: open source and extensibility. Without any of these two, what would we see would be tons of copy cats wasting resources in creating small indistinguishable clones, plus countless and endless negotiations to extend and integrate the platform. Hadoop is open source and the open source developers working on it have built it with extensibility in mind. The proof is out there and is clear: look at the breadth and depth of the tools around Hadoop.
That’s the power of open source. The way of the future.
Original title and link: Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop (©myNoSQL)
Monday, 25 March 2013
GIS Tools for Hadoop by Esri
Interesting project, GIS Tools for Hadoop:
GIS Tools for Hadoop is an open source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries:
- Esri Geometry API for Java: A generic geometry library, can be used to extend Hadoop core with vector geometry types and operations, and enables developers to build MapReduce applications for spatial data.
- Spatial Framework for Hadoop: Extends Hive and is based on the Esri Geometry API, to enable Hive Query Language users to leverage a set of analytical functions and geometry types. In addition to some utilities for JSON used in ArcGIS.
- Geoprocessing Tools for Hadoop: Contains a set of ready to use ArcGIS Geoprocessing tools, based on the Esri Geometry API and Spatial Framework for Hadoop. Developers can download the source code of the tools and customize it; they can also create new tools and contribute it to the open source project. Through these tools ArcGIS users can move their spatial data and execute a pre-defined workflow inside Hadoop.
I recently learned about GeoJSON — JSON Geometry and Feature Description, but the two don’t seem to be related.
Original title and link: GIS Tools for Hadoop by Esri (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling