Data science: All content tagged as Data science in NoSQL databases and polyglot persistence
After seeing the excerpt from Jonathan Harris’ talk at Data Scientist Summit I really wanted to post a link to some of the videos. But they are all behind a registration gateway. Just in case you want to watch them—there are indeed some interesting titles— you’ll find them here.
From the Wikibon blog infographic about data science and the data scientist:
Data science can be broken down into four essential parts:
- mining data: collecting and formatting the information
- statistics: information analysis
- interpret: representation or visualization
- leverage: implications of the data, application of the data, interaction using the data and predictions formed from studying it
The skills of a data scientist:
- Hacking and Computer Science: knowing how to take advantage of computers and the internet to create data-mining formulas
- Expertise in Mathematics, Statistics, Data Mining: Pulling important statistics and coherently organizing them using mathematic prowess and computer formulas
- Creativity and Insight: Knowing what statistics are important and how to leverage them
Over the years, folks have often asked me what kind of math am I using to create large scale, real-time, context accumulating systems (e.g., NORA). Some fond of Bayesian speculate I am using Bayesian techniques. Some ask if I am using neural networks or heuristics. A math professor said I was doing advanced work in the field of Set Theory.
My answer is always, “I don’t know any math. I didn’t finish high school. But I can explain how it works, step-by-step, and it is really quite simple.”
So data science starts with the passionate interest for the data. Then you are adding tools, processes, algorithms, and science to discover the secrets hidden inside data.
In the blue corner we have IBM with Netezza as analytic database, Cognos for BI, and SPSS for predictive analytics. In the green corner we have EMC with Greenplum and the partnership with SAS. And in the open source corner we have Hadoop and R.
Update: there’s also another corner I don’t know how to color where Teradata and its recently acquired Aster Data partner with SAS.
Who is ready to bet on which of these platforms will be processing more data in the next years?
It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.
These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.
and how Hadoop can help:
Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.
Even if the name of this TED talk is “The birth of a word”, I would have called it anything from the future of data science, extreme data analysis, and brilliant informatio visualization. Anyway, it is a must see: