NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



datascience: All content tagged as datascience in NoSQL databases and polyglot persistence

Learn R and become a data scientist

An interactive, quite polished, guide for learning R… to become a data scientist.

As for a definition of data science, here’s the most up:

A data scientist is a statistician who lives in San Francisco.

Data Science is statistics on a Mac.

A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

Original title and link: Learn R and become a data scientist (NoSQL database©myNoSQL)


Data processing command line-style

Very often I jump to using Python for any sort of data processing. And I totally forget about the powerful tools available on pretty much every Linux/Mac box1.

Jeroen Janssens’s 7 command-line tools for data science presents 6 command line tools for fetching, filtering and transforming data: jq, json2csv, csvkit, scrape, xml2json, sample.

Then Leonardo Trabuco’s Working with data on the command line gives a quick roundup of the standard Linux tools: head, tail, less, awk, cut, sort, uniq, wc, grep, shuf.

If you understand the philosophy of Linux tools and get familiar with some of the tools listed above — I’ve never got too deep into awk and sed almost always tricks me, you’ll be able to do some nice data processing experimentation directly from the command line.

  1. The one excuse I usually find for myself when doing this is that debugging command line tools behavior is not as pleasant as debugging some Python scripts. _Sort of an OK argument, but still an excuse._ 

Original title and link: Data processing command line-style (NoSQL database©myNoSQL)

Visualizing RunKeeper data in R

In Academic torrents: Almost 1.7TB of research data available, I complained about the lack of interesting open data. Dan Goldin’s Visualizing RunKeeper data in R is a good example of what I mean. While learning R, he used his own data about his running results. That made it both interesting and fun.

What better way to celebrate running 1000 miles in 2013 than dumping the data into R and generating some visualizations? It’s also a step in my quest to replace Excel with R.

I hope no one will argue that this is a more exciting experience than learning a new technology while using the Enron email archive.

Original title and link: Visualizing RunKeeper data in R (NoSQL database©myNoSQL)

Academic torrents: Almost 1.7TB of research data available

The Academic Torrents initiative:

The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

Over the weekend, I’ve played a bit with the Python data crunching toolkit:pandas,NumPy, and matlibplot; truth is that I’ve started with A pandas cookbook by Julia Evans, but ended up spending most of the time trying to get the latest version of matplotlib installed on OS X and convincing it to display XKCD styled plots. This aside, after getting everything’s working, I got stuck at the “what now” phase — what data can I use to play with? This situation reminded me of past experiences when trying to learn or build demos around data.

We’re talking about Big Data and the lack of trained people in this space. But if you look around, you’ll realize that: 1) there’s very little data that those interested to learn can use; and 2) most of it is boring.

Plus I’m sure not everyone is inclined to spend months hacking OkCupid and having 88 dates to validate their methods and algorithms.

Original title and link: Academic torrents: Almost 1.7TB of research data available (NoSQL database©myNoSQL)