Data science: All content tagged as Data science in NoSQL databases and polyglot persistence
In the blue corner we have IBM with Netezza as analytic database, Cognos for BI, and SPSS for predictive analytics. In the green corner we have EMC with Greenplum and the partnership with SAS. And in the open source corner we have Hadoop and R.
Update: there’s also another corner I don’t know how to color where Teradata and its recently acquired Aster Data partner with SAS.
Who is ready to bet on which of these platforms will be processing more data in the next years?
It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.
These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.
and how Hadoop can help:
Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.
Even if the name of this TED talk is “The birth of a word”, I would have called it anything from the future of data science, extreme data analysis, and brilliant informatio visualization. Anyway, it is a must see:
If you’d ask me this question, I’m sure my initial answer would be: “absolutely”. And I guess I would not be alone. But is that the right answer?
While watching GigaOm’s Structure Big Data event, there were two talks that gave me a different perspective on this question.
Firstly, it was the interview with Kevin Krim, the Global Head of Bloomberg Digital, which told the story of adopting, mining, and materializing Big Data inside a corporation that didn’t believe in it, nor did it allocate large budgets to it. The result: collecting more than a terabyte of data every day from 100 data points for every pageview and running 15 different parallel algorithms to make recommendations that led sometimes to 10x clickthrough rates. The interview is embedded at the end of this post.
The second story, coming from Pete Warden, founder of OpenHeatMap, is even more exciting. Pete has used a combination of right tools deployed on the cloud to mine Facebook data: 500 million pages for $100 — that was the cost before being sued by Facebook.
Pete Warden distilled his experience with these tools and has made available at datasciencetoolkit.org a collection of data tools and open APIs in both an Amazon AMI format to be run on the cloud and as a VMWare image to run locally. I highly recommend watching Pete’s talk which I’ve embedded below.
While it depends on what definition of BigData we’d use, both these talks are leading to a simple conclusion:
- you need imagination to get started with Big Data
- you need to use the right tools for getting good results
Is this going to work at the scale of Twitter, LinkedIn, Facebook, Google? Probably not. But before getting at that size, you need to start somewhere. And both these talks suggest a clear answer to the question “does big data need big budgets?”: not always.
The last couple of posts were about BigData and Jeffrey Horner’s presentation is inline with this topic:
If there is ever a time to learn R and web application development, it is now…in the age of Big Data. The upcoming release of R 2.13 will provide basic functionality for developing R web applications on the desktop via the internal HTTP server, but the interface is incompatible with rApache. Jeffrey will talk about Rack, a web server interface and package for R, and how you can start creating your own Big Data stories from the comfort of your own desktop.
Note: The video is missing the beginning and it is not a generic talk about R, so it will be interesting mostly to those using R and planning to develop web applications directly from R.
I think these can be generalized to all businesses and problems that require big data:
Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.
For a long time each business worked in its own silo.
Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.
federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities
If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.
Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.
Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.
increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores
considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models
Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.
This book is about a new, fourth paradigm for science based on data- intensive computing. In such scientific research, we are at a stage of development that is analogous to when the printing press was invented. Printing took a thousand years to develop and evolve into the many forms it takes today. Using computers to gain understanding from data created and stored in our electronic data stores will likely take decades — or less.
In Jim Gray’s last talk to the Computer Science and Telecommunications Board on January 11, 2007, he described his vision of the fourth paradigm of scientific research. He outlined a two-part plea for the funding of tools for data capture, curation, and analysis, and for a communication and publication infrastructure. He argued for the establishment of modern stores for data and documents that are on par with traditional libraries.
Original title and link: The Fourth Paradigm: Data-Intensive Scientific Discovery (NoSQL databases © myNoSQL)
Philip (flip) Kromer (infochimps.com) talking about origins of big data, generating big data, and some ideas on using big data. Very interesting talk.
Original title and link: Big Data: Millionfold Mashups and the Shape of Data (NoSQL databases © myNoSQL)
David McCandless talking at TED about data visualization:
Data science is the future and there cannot be data science without data visualization and vice versa.
Bundy’s Frank Sinatra words: You can’t have one without the other.