ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Data science: All content tagged as Data science in NoSQL databases and polyglot persistence

Does Big Data Need Big Budgets?

If you’d ask me this question, I’m sure my initial answer would be: “absolutely”. And I guess I would not be alone. But is that the right answer?

While watching GigaOm’s Structure Big Data event, there were two talks that gave me a different perspective on this question.

Firstly, it was the interview with Kevin Krim, the Global Head of Bloomberg Digital, which told the story of adopting, mining, and materializing Big Data inside a corporation that didn’t believe in it, nor did it allocate large budgets to it. The result: collecting more than a terabyte of data every day from 100 data points for every pageview and running 15 different parallel algorithms to make recommendations that led sometimes to 10x clickthrough rates. The interview is embedded at the end of this post.

The second story, coming from Pete Warden, founder of OpenHeatMap, is even more exciting. Pete has used a combination of right tools deployed on the cloud to mine Facebook data: 500 million pages for $100 — that was the cost before being sued by Facebook.

Pete Warden distilled his experience with these tools and has made available at datasciencetoolkit.org a collection of data tools and open APIs in both an Amazon AMI format to be run on the cloud and as a VMWare image to run locally. I highly recommend watching Pete’s talk which I’ve embedded below.

While it depends on what definition of BigData we’d use, both these talks are leading to a simple conclusion:

  • you need imagination to get started with Big Data
  • you need to use the right tools for getting good results

Is this going to work at the scale of Twitter, LinkedIn, Facebook, Google? Probably not. But before getting at that size, you need to start somewhere. And both these talks suggest a clear answer to the question “does big data need big budgets?”: not always.


R and the web in 2011

The last couple of posts were about BigData and Jeffrey Horner’s presentation is inline with this topic:

If there is ever a time to learn R and web application development, it is now…in the age of Big Data. The upcoming release of R 2.13 will provide basic functionality for developing R web applications on the desktop via the internal HTTP server, but the interface is incompatible with rApache. Jeffrey will talk about Rack, a web server interface and package for R, and how you can start creating your own Big Data stories from the comfort of your own desktop.

Note: The video is missing the beginning and it is not a generic talk about R, so it will be interesting mostly to those using R and planning to develop web applications directly from R.

Original title and link: R and the web in 2011 (NoSQL databases © myNoSQL)


Strategies for Exploiting Large-scale Data

In a guest post hosted by Cloudera blog, Bob Gourley[1] enumerates the characteristics of working with Big Data from federal agencies perspective.

I think these can be generalized to all businesses and problems that require big data:

Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.

For a long time each business worked in its own silo.

Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.

federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities

If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.

Large-scale distributed analysis over large data sets is often expected to return results almost instantly.

Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.

  • Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.

  • increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores

Ditto

considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models

Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.


  1. Bob Gourley: editor of CTOvision.com and a former Defense Intelligence Agency (DIA) CTO, @bobgourley  

Original title and link: Strategies for Exploiting Large-scale Data (NoSQL databases © myNoSQL)


The Fourth Paradigm: Data-Intensive Scientific Discovery

This book is about a new, fourth paradigm for science based on data- intensive computing. In such scientific research, we are at a stage of development that is analogous to when the printing press was invented. Printing took a thousand years to develop and evolve into the many forms it takes today. Using computers to gain understanding from data created and stored in our electronic data stores will likely take decades — or less.

In Jim Gray’s last talk to the Computer Science and Telecommunications Board on January 11, 2007, he described his vision of the fourth paradigm of scientific research. He outlined a two-part plea for the funding of tools for data capture, curation, and analysis, and for a communication and publication infrastructure. He argued for the establishment of modern stores for data and documents that are on par with traditional libraries.

Microsoft Research has made the book available for free here. During his Strata conference presentation, Werner Vogels encouraged everyone to read it.

Original title and link: The Fourth Paradigm: Data-Intensive Scientific Discovery (NoSQL databases © myNoSQL)


Big Data: Millionfold Mashups and the Shape of Data

Philip (flip) Kromer (infochimps.com) talking about origins of big data, generating big data, and some ideas on using big data. Very interesting talk.

Original title and link: Big Data: Millionfold Mashups and the Shape of Data (NoSQL databases © myNoSQL)


Big Data Analysis at BackType

RWW has a nice post diving into the data flow and the tools used by BackType, a company with only 3 engineers, to deal and analyze large amounts of data.

They’ve invented their own language, Cascalog, to make analysis easy, and their own database, ElephantDB, to simplify delivering the results of their analysis to users. They’ve even written a system to update traditional batch processing of massive data sets with new information in near real-time.

Some highlights:

  • 25 terabytes of compressed binary data, over 100 billion individual records
  • all services and data storage are on Amazon S3 and EC2
  • 60 up to 150 EC2 instances servicing an average of 400 requests/s
  • Clojure and Python as platform languages
  • Hadoop, Cascading and Cascalog are central pieces of BackType’s platform
  • Cascalog, a Clojure-based query language for Hadoop, was created and open sourced by BackType’s engineer Nathan Marz
  • ElephantDB, the storage solution, is a read-only cluster built on top of BerkleyDB files
  • Crawlers place data in Gearman queues for processing and storing

BackType data flow is presented in the following diagram:

BackType data flow

Included below is an interview with Nathan about Cascalog:

@pharkmillups .

Original title and link: Big Data Analysis at BackType (NoSQL databases © myNoSQL)

via: http://www.readwriteweb.com/hack/2011/01/secrets-of-backtypes-data-engineers.php


The Beauty of Data Visualization

David McCandless talking at TED about data visualization:

Data science is the future and there cannot be data science without data visualization and vice versa.

Or in Bundy’s Frank Sinatra words: You can’t have one without the other.

Original title and link: The Beauty of Data Visualization (NoSQL databases © myNoSQL)


What is Big Data Used for

Philipp Janert [1]:

It falls into one of two camps. The first is reporting. […].

The other camp is what I consider “generalized search.” These are scenarios like: If User A likes movies B, C, and D, what other specific movie might User A want? That’s a form of searching because you’re not actually trying to create a conceptual model of user behavior. You’re comparing individual data points; you’re trying to find the movie that has the greatest similarity to a very specific other set of predefined movies. For this kind of generalized, exhaustive search, you need a lot of data because you look for the individual data points. But that’s not really analysis as I understand it, either.

I guess ☞ Netflix competition was a bit more than generalized search as it required both inductive and deductive research.


[1] Philipp Janert: author of ☞ Data Analysis with Open Source Tools

Original title and link: What is Big Data Used for (NoSQL databases © myNoSQL)

via: http://radar.oreilly.com/2010/11/the-data-analysis-path-curiosi.html


Mining of Massive Datasets: Free PDF

Anand Rajaraman (Kosmix, Inc.) and Jeffrey D. Ullman (Stanford Univ.) have made their book Mining of Massive Datasets available online ☞ here (PDF). Enjoy.

@mikeolson

Original title and link: Mining of Massive Datasets: Free PDF (NoSQL databases © myNoSQL)


Big Data is snake oil

It’s because data is powerful but fickle. A lot of theoretically promising approaches don’t work because there’s so many barriers between spotting a possible relationship and turning it into something useful and actionable. […] Here’s some of the hurdles you’ll have to jump:

  • Acquisition
  • Coverage
  • Over-determination
  • Poor correlations
  • Noise

Differently put: 1) data is not the goal, but only the means and 2) what you’ll discover behind data will (many times) be different than your initial assumptions/expectations.

Original title and link: Big Data is snake oil (NoSQL databases © myNoSQL)

via: http://petewarden.typepad.com/searchbrowser/2010/12/data-is-snake-oil.html


Names You Need to Know in 2011: R Data Analysis Software

Steve McNally (Forbes):

Simply put by one of its staunchest advocates, “R is the most powerful statistical computing language on the planet; there is no statistical equation that cannot be calculated in R.”

If you say data scientists or Big Data, then you are saying Hadoop and R.

Original title and link: Names You Need to Know in 2011: R Data Analysis Software (NoSQL databases © myNoSQL)

via: http://blogs.forbes.com/smcnally/2010/11/10/names-you-need-to-know-in-2011-r-data-analysis-software/


Data Science and Data Scientists

Hal Varian[1] said a couple of years ago[2]:

The sexy job in the next ten years will be statisticians.

While Hal Varian’s call it statisticians, others have been using terms like data scientists. But what is data science? O’Reilly has long but very interesting article on this subject:

The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. There’s a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.

While reading these articles, a question raised in my ming: is there a way to prepare yourself for being a data scientist? Are there any data scientists secrets? Michael E. Driscoll lists on Dataspora blog seven secrets for successful data scientists:

  1. Choose the right-sized tool
  2. Compress everything: we live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth
  3. Split up your data: “monolithic” is a bad word in software development
  4. Sample your data
  5. Smart borrows, but genius uses open source
  6. Keep your head in the cloud
  7. Don’t be clever: when dealing with big data, embrace standards and use commonly available tools. Most of all, keep it simple, because simplicity scales.

As with every “craft” there’s no simple path but learning the technologies and the tools for the job, and keeping your mind and eyes open.


  1. Hal Varian: Google Chief Economist  ()
  2. The part relevant to BigData:

    The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

    I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills - of being able to access, understand, and communicate the insights you get from data analysis - are going to be extremely important. Managers need to be able to access and understand the data themselves.

    The complete interview with Hal Varian can be found ☞ here

Original title and link: Data Science and Data Scientists (NoSQL databases © myNoSQL)