NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



data science: All content tagged as data science in NoSQL databases and polyglot persistence

Data Scientists Are Hot

Based on a couple of searches on job sites and an email from a headhunter, GigaOM Barb Darrow concludes that data scientists are in high demand these days:

My client is one of the largest professional services firms in the world and they are looking for very senior data analytics experts who can apply his/her advanced analytics, predictive modeling, and data visualization skills to the fraud/dispute arena.  Exceptional compensation packages are available in the $300,000 to $500,000 range for the appropriate technical and leadership experience.

There’s no denial of the fact that data scientists are hot and Darrow is not the first one writing about it. Hal Varian, Chief Economist at Google, said many years ago: “I keep saying that the sexy job in the next 10 years will be statisticians”. Many others have already agreed that the future belongs to the companies and people that turn data into products. And I remember reading recently about some reports mentioning 150-200,000 jobs in this market in the next couple of years.

On the other hand though, there are various myths about data scientists’ role. Job descriptions will mention many years of experience with Hadoop and Big Data. But even if there are some hints about what makes a good data scientist and how to hire the right data geeks, there’s no alignment on what data science is and what is involved in the role of the data scientist.

This still feels like the early days when requirements and expectations are changing overnight. But these are also the days when most of those involved are having a lot of fun learning and discovering new ways to deal with data and defining the tomorrow.

Original title and link: Data Scientists Are Hot (NoSQL database©myNoSQL)

SQL or Hadoop: What Tools Should I Use to Process My Data?

Great decision flowchart created by Aaron Cordova to help answer the question: what tools should I use to process my data:

SQL or Hadoop

Click to view full size. Credit Aaron Cordova

Original title and link: SQL or Hadoop: What Tools Should I Use to Process My Data? (NoSQL database©myNoSQL)

Data Science and BI: Similarities and Differences

Data science and BI differ in the foci of their  investigations. DS is consumed with supporting the development of data products. As Monica Rogati of LinkedIn notes, “On one side, I’ve been working on building products … The other side is finding interesting stories in the data.” BI, on the other hand, is all about measuring and managing business performance. At their best, though, both disciplines have an evidenced-based “science of business” foundation that makes me reject the contention by some that data science has a higher calling and is more scientifically sophisticated than BI.

Steve Miller puts the accent on the difference of maturity of the two fields. I’d say the difference in the approaches is even more important.

Original title and link: Data Science and BI: Similarities and Differences (NoSQL database©myNoSQL)


Statistical Advances: The Maximal Information Coefficient a New Method to Uncover Hidden Data Relationships

Yakir Reshef (main researcher):

“If you have a data set with 22 million relationships, the 500 relationships in there that you care about are effectively invisible to a human.”

The statistical method that Reshef and his colleagues have devised aims to crack those problems. It can spot many superimposed correlations between variables and measure exactly how tight each relationship is, on the basis of a quantity that the team calls the maximal information coefficient (MIC). The MIC is calculated by plotting data on a graph and looking for all ways of dividing up the graph into blocks or grids that capture the largest possible number of data points. MIC can then be deduced from the grids that do the best job.

The original article, Detecting Novel Associations in Large Data Sets, was published on Science, but is behind a paywall.

Original title and link: Statistical Advances: The Maximal Information Coefficient a New Method to Uncover Hidden Data Relationships (NoSQL database©myNoSQL)


What Makes a Good Data Scientist?

Watch this interview with DJ Patil, formerly LinkedIn chief scientist and now data scientist in residence at Greylock Partners, to find the answer.

Teaser: a passion for really getting to an answer.

Data Jujitsu and Data Karate

David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:

That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.

Original title and link: Data Jujitsu and Data Karate (NoSQL database©myNoSQL)


The Data Deluge Makes the Scientific Method Obsolete

Chris Anderson in a 2008 article:

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

Original title and link: The Data Deluge Makes the Scientific Method Obsolete (NoSQL database©myNoSQL)


Data Scientist Summit Videos

After seeing the excerpt from Jonathan Harris’ talk at Data Scientist Summit I really wanted to post a link to some of the videos. But they are all behind a registration gateway. Just in case you want to watch them—there are indeed some interesting titles— you’ll find them here.

Original title and link: Data Scientist Summit Videos (NoSQL database©myNoSQL)

You Need to Hire a Data Geek

What to look for when hiring a data geek—a different name of the now established data scientist role

  • A strong background in computer science is essential. Dealing with information is not easy. The data geek needs to be able to collect the data, which in many cases involves knowing about databases, some networking, and Web programming technologies (XML, HTML, etc.), for a start.
  • Statistics and mathematics are part of the game. Your data geek needs to know statistics inside out and backwards, and the software for manipulating them to develop an analysis.
  • Data visualization is key. You need data visualization tools that are in equal parts useful and appealing. Your data geek should have an eye for graphs, maps, and charts, with a feel for the right dashboards, scorecards, data mashups, or even Excel workbooks—to generate the right mix of information for the right people.
  • A bit of creativity goes a long way. The right data geek will use all the above skills to create new and improve existing ways to increase the return on investment (ROI) of your organization’s BI solutions.

Many different opinions on what data scientists should know and do.

Original title and link: You Need to Hire a Data Geek (NoSQL database©myNoSQL)


Data Scientist and Cloud Architect: The 6 Hottest New Jobs in IT

Infoworld published a non scientific research on the hottest new jobs in IT and Data scientists and Cloud architects made it in the top 6.

About data scientists:

According to Norman Nie, CEO of Revolution Analytics, data science jobs will require workers with a spectrum of skills, from entry-level data cleaners to the high-level statisticians, yielding a range of opportunities for newcomers to the field. As the business world gets increasingly social, the demand for people to plumb the depths of all that social networking clickstream data will only increase. The cliché going around is that “data is the new oil.” A career in refining that raw material sounds like a good bet.

Cloud architects:

In addition to establishing and managing a private cloud infrastructure, Ron Gula, CEO of Tenable Network Security, says cloud architects will increasingly need to be experts in choosing public cloud services. “When you get into the nuances of SLAs, you become less of an IT person and more of a lawyer,” says Gula. The ultimate goal is the hybrid cloud, where cloud architects and business management decide which cloud services make the most sense to run internally and which should be farmed out on a pay-per-use basis.

Original title and link: Data Scientist and Cloud Architect: The 6 Hottest New Jobs in IT (NoSQL database©myNoSQL)


Data Science & The Role of the Data Scientist

From the Wikibon blog infographic about data science and the data scientist:

Data science can be broken down into four essential parts:

  • mining data: collecting and formatting the information
  • statistics: information analysis
  • interpret: representation or visualization
  • leverage: implications of the data, application of the data, interaction using the data and predictions formed from studying it

The skills of a data scientist:

  • Hacking and Computer Science: knowing how to take advantage of computers and the internet to create data-mining formulas
  • Expertise in Mathematics, Statistics, Data Mining: Pulling important statistics and coherently organizing them using mathematic prowess and computer formulas
  • Creativity and Insight: Knowing what statistics are important and how to leverage them

In a recent post under the title Data beats math, Jeff Jonas[1] wrote:

Over the years, folks have often asked me what kind of math am I using to create large scale, real-time, context accumulating systems (e.g., NORA).  Some fond of Bayesian speculate I am using Bayesian techniques.  Some ask if I am using neural networks or heuristics.  A math professor said I was doing advanced work in the field of Set Theory.

My answer is always, “I don’t know any math.  I didn’t finish high school.  But I can explain how it works, step-by-step, and it is really quite simple.”

So data science starts with the passionate interest for the data. Then you are adding tools, processes, algorithms, and science to discover the secrets hidden inside data.

A Taxonomy of Data Science

Hilary Mason[1] and Chris Wiggins[2]:

We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).

The clearest list of what a modern data scientist is supposed to know and do.

  1. Hilary Mason: Chief Scientist at  

  2. Chris Wiggins: Associate Professor in the Department of Applied Physics and Applied Mathematics at Columbia  

Original title and link: A Taxonomy of Data Science (NoSQL databases © myNoSQL)