bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence
Thursday, 4 April 2013
Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop
Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.
Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.
Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.
Determine the compliance exposure risk based on the information collected.
Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.
Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.
Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.
Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?
Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.
Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.
Wait… where’s point 11, buy Dataguise?
Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (©myNoSQL)
via: http://www.businesspress24.com/pressrelease1213023.html
Wednesday, 3 April 2013
Scaling Big Data Mining Infrastructure at Twitter
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- …
- Productionize, act on the insight
- Rinse, repeat
Enjoy!
Big Data Is…
I’ve seen this tweet from Tim O’Reilly quoting George Dyson on Keen’s post:
Big data is what happened when the cost of keeping information became less than the cost of throwing it away.
Smart. So smart. And true.
Original title and link: Big Data Is… (©myNoSQL)
Hadoop + Terracotta BigMemory: Run, Elephant, Run!
While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.
Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.
✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.
Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (©myNoSQL)
via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/
A Data Scientist's Real Job: Storytelling
Jeff Bladt and Bob Filbin for HBR:
Data gives you the what, but humans know the why.
I thought the process is a bit more different: Humans hypothesize why and data knows how true that is. Am I wrong?
Original title and link: A Data Scientist’s Real Job: Storytelling (©myNoSQL)
via: http://blogs.hbr.org/cs/2013/03/a_data_scientists_real_job_sto.html
Tuesday, 2 April 2013
Field-Level Encryption for Apache Hadoop From Dataguise
Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.
DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.
$25.000 a piece (hopefully not a piece of encrypted data though).
✚ Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.
Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (©myNoSQL)
The Data Scientist Concept Will Die
Kathryn Kelly for SmartDataCollective:
This is the one that really got people. Companies need solutions that enable them to use and customize their data easily, because it is the whole team, not just the individual analyst, that knows the business best. By offering business users intuitive data solutions, we bypass the need for the data scientist, who works in isolation. In fact, most data scientists are associated with the old school of business intelligence, where systems were so complicated that they needed someone with a data science background to run and get value from them. The new generation of solutions, on the other hand, is making it easy for business users to engage big data. An interdisciplinary team will see and use the visuals provided, and collaborate on the best decisions on a regular basis.
It’s better not to make predictions when you miss the point.
Original title and link: The Data Scientist Concept Will Die (©myNoSQL)
Monday, 1 April 2013
Happy Birthday Hadoop!
On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.
Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!
Original title and link: Happy Birthday Hadoop! (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/
Wednesday, 27 March 2013
Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop
- Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
- Rudiment ETL that transforms one data format to another data format.
- Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
- Command line interface to allow users to submit SQL queries
- Java API to enable clients to submit SQL queries to Tajo
Just another example of the way of the future.
Original title and link: Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (©myNoSQL)
Tuesday, 26 March 2013
Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop
Ben Werther announcing the general availability of the Platfora BI:
At Platfora, we made a bet that Hadoop’s destiny wasn’t simply to be a cheaper, slower cousin of the relational data warehouse. […] Hadoop is superb at two things — it provides a near-infinite data reservoir where data of all kinds can be landed without needing to figure out how it will be used ahead of time, and it is a slow lumbering freight-train of an engine for crunching and aggregating batches of millions or billions of rows.
They are neither the first, nor the last to understand and bet on Hadoop. But in some cases this bet originates only in the financial potential of the Hadoop market and less so on the technological potential.
Indeed it’s rarely the case that these two can leave alone. When they do, it leads to either a smaller market segment or to a shorter life time. Looking around at what’s happening in the Hadoop space, technologically and business wise, I assume many economists would recognize the signs of a long lived opportunity.
As a side note, I find it interesting that very few articles are looking at two other fundamental aspects of the Hadoop platform, which, in my opinion, were, are and will remain critical to the growth of this market: open source and extensibility. Without any of these two, what would we see would be tons of copy cats wasting resources in creating small indistinguishable clones, plus countless and endless negotiations to extend and integrate the platform. Hadoop is open source and the open source developers working on it have built it with extensibility in mind. The proof is out there and is clear: look at the breadth and depth of the tools around Hadoop.
That’s the power of open source. The way of the future.
Original title and link: Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop (©myNoSQL)
Will Hadoop Become Dominant Platform? Short Answer Is Yes
EMC’s David Menninger and IBM’s James Kobielus argument pro and con the future of Hadoop. As you’d expect the company behind each of them has a lot to say about the topic. But if you get rid of the chaff, you’ll notice that the two pretty much agree: SQL married with Hadoop is the way of the future. One thing to keep in mind though: SQL and Hadoop does not mean getting rid of the different approaches of accessing Hadoop stored data (like Pig or pure MapReduce).
Original title and link: Will Hadoop Become Dominant Platform? Short Answer Is Yes (©myNoSQL)
Monday, 25 March 2013
Big Data and a Renewed Debate Over Privacy
NY Times reports about a paper, “Unlocking the Value of Personal Data: From Collection to Usage”, suggesting stricter control over usage of data:
The forum report suggests a future in which all collected data would be tagged with software code that included an individual’s preferences for how his or her data is used. All uses of data would have to be registered, and there would be penalties for violators.
I already like it. A lot.
You can download the paper directly from here.
Original title and link: Big Data and a Renewed Debate Over Privacy (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling