BigData: All content tagged as BigData in NoSQL databases and polyglot persistence
Wednesday, 10 April 2013
IBM Accelerates Its Big Data Portfolio
Jeff Kelly takes a look at IBM’s data solutions portfolio:
IBM has the broadest and deepest Big Data product and services portfolio in the industry, as well as the market leading revenue to show for it. But IBM’s greatest asset also lies at the heart of its biggest challenge. With such a diverse set of Big Data capabilities, IBM has struggled to unify them into distinct, compelling offerings. How IBM responds to the challenge of bringing together such a broad and deep set of technologies and services - many the result of $16 billion worth of analytics-related acquisitions since 2005 - into consumable and effective product offerings will largely determine the company’s success (or failure) in the Big Data space and will have major implications for enterprise CIOs.
There are two things that I’m not sure I understand:
-
is it a known strategy leading to more sales to have a confusing portfolio of products?
Basically you offer so many products that a customer will be so confused that he’ll have to hire your consultant to make the buying
recommendationdecision. -
when ranking companies by sales, wouldn’t make more sense to compare revenue/employee than raw numbers?
Which company is better? A company with 2 sales people generating $1mil in revenue or a company with 100 sales people and 100 consultants generating $20mil?
Original title and link: IBM Accelerates Its Big Data Portfolio (©myNoSQL)
via: http://wikibon.org/wiki/v/IBM_Accelerates_Its_Big_Data_Portfolio
Saturday, 6 April 2013
Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler
Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).
✚ Datanami has published a summary of the keynote here
Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (©myNoSQL)
Friday, 5 April 2013
Best NoSQL April’s Fool
I know a few people that avoid the Internet completely on April’s Fool. After being tricked every year by my dad, I’m very careful with what I’m posting on that day. This year has been easy on me, but that doesn’t mean there weren’t a couple of good ones.
My favorites:
- The Title of the Year: Chief Hadoop Officer
- [#HADOOP-9448] Submitted Patch for Hadoop: Reimplement Things
- The Real-Time Cure: Slow and Steady
Original title and link: Best NoSQL April’s Fool (©myNoSQL)
Hadoop Security Design Paper
Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.
Thursday, 4 April 2013
Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop
Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.
Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.
Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.
Determine the compliance exposure risk based on the information collected.
Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.
Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.
Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.
Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?
Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.
Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.
Wait… where’s point 11, buy Dataguise?
Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (©myNoSQL)
via: http://www.businesspress24.com/pressrelease1213023.html
Wednesday, 3 April 2013
Scaling Big Data Mining Infrastructure at Twitter
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- …
- Productionize, act on the insight
- Rinse, repeat
Enjoy!
Big Data Is…
I’ve seen this tweet from Tim O’Reilly quoting George Dyson on Keen’s post:
Big data is what happened when the cost of keeping information became less than the cost of throwing it away.
Smart. So smart. And true.
Original title and link: Big Data Is… (©myNoSQL)
Hadoop + Terracotta BigMemory: Run, Elephant, Run!
While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.
Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.
✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.
Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (©myNoSQL)
via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/
A Data Scientist's Real Job: Storytelling
Jeff Bladt and Bob Filbin for HBR:
Data gives you the what, but humans know the why.
I thought the process is a bit more different: Humans hypothesize why and data knows how true that is. Am I wrong?
Original title and link: A Data Scientist’s Real Job: Storytelling (©myNoSQL)
via: http://blogs.hbr.org/cs/2013/03/a_data_scientists_real_job_sto.html
Tuesday, 2 April 2013
Field-Level Encryption for Apache Hadoop From Dataguise
Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.
DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.
$25.000 a piece (hopefully not a piece of encrypted data though).
✚ Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.
Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (©myNoSQL)
The Data Scientist Concept Will Die
Kathryn Kelly for SmartDataCollective:
This is the one that really got people. Companies need solutions that enable them to use and customize their data easily, because it is the whole team, not just the individual analyst, that knows the business best. By offering business users intuitive data solutions, we bypass the need for the data scientist, who works in isolation. In fact, most data scientists are associated with the old school of business intelligence, where systems were so complicated that they needed someone with a data science background to run and get value from them. The new generation of solutions, on the other hand, is making it easy for business users to engage big data. An interdisciplinary team will see and use the visuals provided, and collaborate on the best decisions on a regular basis.
It’s better not to make predictions when you miss the point.
Original title and link: The Data Scientist Concept Will Die (©myNoSQL)
Monday, 1 April 2013
Happy Birthday Hadoop!
On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.
Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!
Original title and link: Happy Birthday Hadoop! (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling