data science: All content tagged as data science in NoSQL databases and polyglot persistence
Thursday, 3 November 2011
Data Jujitsu and Data Karate
David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:
That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.
Original title and link: Data Jujitsu and Data Karate (©myNoSQL)
Wednesday, 12 October 2011
The Data Deluge Makes the Scientific Method Obsolete
Chris Anderson in a 2008 article:
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.
The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.
Original title and link: The Data Deluge Makes the Scientific Method Obsolete (©myNoSQL)
via: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Sunday, 19 June 2011
Data Scientist Summit Videos
After seeing the excerpt from Jonathan Harris’ talk at Data Scientist Summit I really wanted to post a link to some of the videos. But they are all behind a registration gateway. Just in case you want to watch them—there are indeed some interesting titles— you’ll find them here.
Original title and link: Data Scientist Summit Videos (NoSQL database©myNoSQL)
Wednesday, 15 June 2011
You Need to Hire a Data Geek
What to look for when hiring a data geek—a different name of the now established data scientist role
- A strong background in computer science is essential. Dealing with information is not easy. The data geek needs to be able to collect the data, which in many cases involves knowing about databases, some networking, and Web programming technologies (XML, HTML, etc.), for a start.
- Statistics and mathematics are part of the game. Your data geek needs to know statistics inside out and backwards, and the software for manipulating them to develop an analysis.
- Data visualization is key. You need data visualization tools that are in equal parts useful and appealing. Your data geek should have an eye for graphs, maps, and charts, with a feel for the right dashboards, scorecards, data mashups, or even Excel workbooks—to generate the right mix of information for the right people.
- A bit of creativity goes a long way. The right data geek will use all the above skills to create new and improve existing ways to increase the return on investment (ROI) of your organization’s BI solutions.
Many different opinions on what data scientists should know and do.
Original title and link: You Need to Hire a Data Geek (NoSQL database©myNoSQL)
Data Scientist and Cloud Architect: The 6 Hottest New Jobs in IT
Infoworld published a non scientific research on the hottest new jobs in IT and Data scientists and Cloud architects made it in the top 6.
About data scientists:
According to Norman Nie, CEO of Revolution Analytics, data science jobs will require workers with a spectrum of skills, from entry-level data cleaners to the high-level statisticians, yielding a range of opportunities for newcomers to the field. As the business world gets increasingly social, the demand for people to plumb the depths of all that social networking clickstream data will only increase. The cliché going around is that “data is the new oil.” A career in refining that raw material sounds like a good bet.
Cloud architects:
In addition to establishing and managing a private cloud infrastructure, Ron Gula, CEO of Tenable Network Security, says cloud architects will increasingly need to be experts in choosing public cloud services. “When you get into the nuances of SLAs, you become less of an IT person and more of a lawyer,” says Gula. The ultimate goal is the hybrid cloud, where cloud architects and business management decide which cloud services make the most sense to run internally and which should be farmed out on a pay-per-use basis.
Original title and link: Data Scientist and Cloud Architect: The 6 Hottest New Jobs in IT (NoSQL database©myNoSQL)
via: http://www.pcworld.com/businesscenter/article/230285/the_6_hottest_new_jobs_in_it.html
Tuesday, 19 April 2011
Data Science & The Role of the Data Scientist
From the Wikibon blog infographic about data science and the data scientist:
Data science can be broken down into four essential parts:
- mining data: collecting and formatting the information
- statistics: information analysis
- interpret: representation or visualization
- leverage: implications of the data, application of the data, interaction using the data and predictions formed from studying it
The skills of a data scientist:
- Hacking and Computer Science: knowing how to take advantage of computers and the internet to create data-mining formulas
- Expertise in Mathematics, Statistics, Data Mining: Pulling important statistics and coherently organizing them using mathematic prowess and computer formulas
- Creativity and Insight: Knowing what statistics are important and how to leverage them
In a recent post under the title Data beats math, Jeff Jonas[1] wrote:
Over the years, folks have often asked me what kind of math am I using to create large scale, real-time, context accumulating systems (e.g., NORA). Some fond of Bayesian speculate I am using Bayesian techniques. Some ask if I am using neural networks or heuristics. A math professor said I was doing advanced work in the field of Set Theory.
My answer is always, “I don’t know any math. I didn’t finish high school. But I can explain how it works, step-by-step, and it is really quite simple.”
So data science starts with the passionate interest for the data. Then you are adding tools, processes, algorithms, and science to discover the secrets hidden inside data.
Thursday, 7 April 2011
A Taxonomy of Data Science
Hilary Mason[1] and Chris Wiggins[2]:
We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).
The clearest list of what a modern data scientist is supposed to know and do.
Original title and link: A Taxonomy of Data Science (NoSQL databases © myNoSQL)
via: http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
Wednesday, 6 April 2011
R: What Is and How Can It Help?
From Loraine Lawson interview with Jeff Erhardt[1].
What is R?
R is an open source statistical programming language. The easiest way to think about it is the largest commercial competitor in the states is a company called SAS, and while it’s not a perfect analogy, one way to think about R is as an open source version of SAS. It’s not perfectly correct, but for people who have not heard of R, that’s one way to explain it.
Where can R help?
- analyzing and gaining meaning from collected data
- developing models and extracting the insight from data
- implementing these analytics within an enterprise and disseminating the knowledge across the enterprise
Now are you ready to bet what will be the data processing platform of tomorrow?
-
Jeff Erhardt: COO of Revolution Analytics, the company offering products and services for R ↩
Original title and link: R: What Is and How Can It Help? (NoSQL databases © myNoSQL)
The Data Processing Platform for Tomorrow
In the blue corner we have IBM with Netezza as analytic database, Cognos for BI, and SPSS for predictive analytics. In the green corner we have EMC with Greenplum and the partnership with SAS[1]. And in the open source corner we have Hadoop and R.
Update: there’s also another corner I don’t know how to color where Teradata and its recently acquired Aster Data partner with SAS.
Who is ready to bet on which of these platforms will be processing more data in the next years?
Original title and link: The Data Processing Platform for Tomorrow (NoSQL databases © myNoSQL)
Monday, 4 April 2011
Origin of BigData and How Hadoop Can Help
Michael Olson[1] about origins of BigData in an interview on ODBMS Industry Watch:
It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.
These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.
and how Hadoop can help:
Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.
-
Michael Olson: CEO Cloudera, former CEO of Sleepycat Software, makers of Berkeley DB acquired by Oracle, @mikeolson ↩
Original title and link: Origin of BigData and How Hadoop Can Help (NoSQL databases © myNoSQL)
Saturday, 2 April 2011
Stop Trying to Put a Monetary Value on Data - It's the Wrong Path
Rob Karel:
data in and of itself has no value!
The only value data/information has to offer – and the reason I do still consider it an “asset” at all – is in the context of the business processes, decisions, customer experiences, and competitive differentiators it can enable.
Just a different way to correctly say that BigData is snake oil.
Original title and link: Stop Trying to Put a Monetary Value on Data - It’s the Wrong Path (NoSQL databases © myNoSQL)
The Birth of a Word: The Future of Data Science
Even if the name of this TED talk is “The birth of a word”, I would have called it anything from the future of data science, extreme data analysis, and brilliant informatio visualization. Anyway, it is a must see:
Original title and link: The Birth of a Word: The Future of Data Science (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling