Big Data: All content tagged as Big Data in NoSQL databases and polyglot persistence
Szilard Pafka has run an survey about the tools used by data scientist and he presents an overview of the results in the video embedded below.
As I’ve learned over time, the R language is the preferred data analysis tool and the survey confirms it. But what surprised me was to see Excel coming in the second place. Python and Unix shell tools are coming after SAS to complete the top five tools.
Szilard Pafka: founder and organizer of the Los Angeles R user group ↩
While writing quite a bit lately about Big Data marketplaces, I thought it would be worth mentioning Tim Berners-Lee 5-start deployment scheme for Linked Open Data:
- make your stuff available on the Web (whatever format) under an open license
- make it available as structured data (e.g., Excel instead of image scan of a table)
- use non-proprietary formats (e.g., CSV instead of Excel)
- use URIs to identify things, so that people can point at your stuff
- link your data to other data to provide context
See Tim Berner-Lee talking about the star scheme at gov 2.0 expo:
Not sure I’ve got the rest of the post, but really liked these two definitions of big data:
Big data means nothing. It’s a well meaning term for (literally) big piles of data, sitting in various massive balls of infrastructure, randomly scattered around our enterprise. More common terms include data warehouses or decision support systems, etc.
Big data is created by copying transactional data and sticking it on another system. We copy ALL our transactional data and stick it on these systems. Over time, those systems become supersets of our transactional systems. We make lots of copies and put them in lots of big data systems.
Interesting question and answer on HBase mailing list:
[…] is it feasible to use HBase table in “read-mostly” mode with trillions of rows, each contains small structured record (~200 bytes, ~15 fields). Does anybody know a successful case when tables with such number of rows are used with HBase?
My follow up questions:
- where is that data currently stored?
- how will you migrate it?
- if this is just what you estimate you’ll get, how soon will you reach these numbers?
Volumes aside1, why is this classification important?
- Curt Monash: Examples and definition of machine-generated data
- Daniel Abadi: Machine vs himan generated data
- The Economist: It’s a smart world (nb: recommended read)
Judging also by the definitions Curt and Daniel came up with, I still think it’s an useless classification.