ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

BigData: All content tagged as BigData in NoSQL databases and polyglot persistence

IBM Accelerates Its Big Data Portfolio

Jeff Kelly takes a look at IBM’s data solutions portfolio:

IBM has the broadest and deepest Big Data product and services portfolio in the industry, as well as the market leading revenue to show for it. But IBM’s greatest asset also lies at the heart of its biggest challenge. With such a diverse set of Big Data capabilities, IBM has struggled to unify them into distinct, compelling offerings. How IBM responds to the challenge of bringing together such a broad and deep set of technologies and services - many the result of $16 billion worth of analytics-related acquisitions since 2005 - into consumable and effective product offerings will largely determine the company’s success (or failure) in the Big Data space and will have major implications for enterprise CIOs.

There are two things that I’m not sure I understand:

  1. is it a known strategy leading to more sales to have a confusing portfolio of products?

    Basically you offer so many products that a customer will be so confused that he’ll have to hire your consultant to make the buying recommendation decision.

  2. when ranking companies by sales, wouldn’t make more sense to compare revenue/employee than raw numbers?

    Which company is better? A company with 2 sales people generating $1mil in revenue or a company with 100 sales people and 100 consultants generating $20mil?

Original title and link: IBM Accelerates Its Big Data Portfolio (NoSQL database©myNoSQL)

via: http://wikibon.org/wiki/v/IBM_Accelerates_Its_Big_Data_Portfolio


Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler

Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).

✚ Datanami has published a summary of the keynote here

Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (NoSQL database©myNoSQL)


Best NoSQL April’s Fool

I know a few people that avoid the Internet completely on April’s Fool. After being tricked every year by my dad, I’m very careful with what I’m posting on that day. This year has been easy on me, but that doesn’t mean there weren’t a couple of good ones.

My favorites:

Original title and link: Best NoSQL April’s Fool (NoSQL database©myNoSQL)


Hadoop Security Design Paper

Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.


Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop

  1. Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.

  2. Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.

  3. Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.

  4. Determine the compliance exposure risk based on the information collected.

  5. Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.

  6. Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.

  7. Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.

  8. Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit. ?

  9. Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.

  10. Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.

Wait… where’s point 11, buy Dataguise?

Original title and link: Dataguise Presents 10 Best Practices for Securing Sensitive Data in Hadoop (NoSQL database©myNoSQL)

via: http://www.businesspress24.com/pressrelease1213023.html


Scaling Big Data Mining Infrastructure at Twitter

I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:

DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”

and then the reality check:

  1. Your boss says something vague
  2. You think very hard on how to move the needle
  3. Where’s the data?
  4. What’s in this dataset?
  5. What’s all the f#$#$ crap in the data?
  6. Clean the data
  7. Run some off-the-shelf data mining algorithm
  8. Productionize, act on the insight
  9. Rinse, repeat

Enjoy!


Big Data Is…

I’ve seen this tweet from Tim O’Reilly quoting George Dyson on Keen’s post:

Big data is what happened when the cost of keeping information became less than the cost of throwing it away.

Smart. So smart. And true.

Original title and link: Big Data Is… (NoSQL database©myNoSQL)


Hadoop + Terracotta BigMemory: Run, Elephant, Run!

While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.

Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.

✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.

Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (NoSQL database©myNoSQL)

via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/


A Data Scientist's Real Job: Storytelling

Jeff Bladt and Bob Filbin for HBR:

Data gives you the what, but humans know the why.

I thought the process is a bit more different: Humans hypothesize why and data knows how true that is. Am I wrong?

Original title and link: A Data Scientist’s Real Job: Storytelling (NoSQL database©myNoSQL)

via: http://blogs.hbr.org/cs/2013/03/a_data_scientists_real_job_sto.html


Field-Level Encryption for Apache Hadoop From Dataguise

Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.

DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.

$25.000 a piece (hopefully not a piece of encrypted data though).

Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.

Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (NoSQL database©myNoSQL)

via: http://news.techworld.com/security/3437999/dataguise-introduces-field-level-encryption-for-apache-hadoop-database/


The Data Scientist Concept Will Die

Kathryn Kelly for SmartDataCollective:

This is the one that really got people. Companies need solutions that enable them to use and customize their data easily, because it is the whole team, not just the individual analyst, that knows the business best. By offering business users intuitive data solutions, we bypass the need for the data scientist, who works in isolation. In fact, most data scientists are associated with the old school of business intelligence, where systems were so complicated that they needed someone with a data science background to run and get value from them. The new generation of solutions, on the other hand, is making it easy for business users to engage big data. An interdisciplinary team will see and use the visuals provided, and collaborate on the best decisions on a regular basis.

It’s better not to make predictions when you miss the point.

Original title and link: The Data Scientist Concept Will Die (NoSQL database©myNoSQL)

via: http://smartdatacollective.com/kathryn1723/101841/no-data-scientists-required-big-data-all-about-business-users


Happy Birthday Hadoop!

On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.

Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!

Original title and link: Happy Birthday Hadoop! (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/