Daniel Gutierrez posted a pretty good summary of the recent discussions about the preferred or most productive or most used data processing environments (R or Python):
While R has traditionally been the programming language of choice for data
scientists, some believe it is ceding ground to Python. Here is a short list
of some the arguments I’ve heard of late, along with my personal assessment
The summary of a summary is that this conversation can be reduced to familiarity vs highly specialized algorithms.
Original title and link: Data Science Wars: Python vs. R
This is a sample code snippet from the Getting started guide for the recently announced Google Cloud Datastore:
req = datastore.BlindWriteRequest()
entity = req.mutation.upsert.add()
path = entity.key.path_element.add()
path.kind = 'Greeting'
path.name = 'foo'
message = entity.property.add()
message.name = 'message'
value = message.value.add()
value.string_value = 'to the cloud and beyond!'
except datastore.RPCError as e:
# remember to do something useful with the exception
I haven’t seen in a while such a terrible API. Makes me wonder what was wrong with the Google AppEngine API; this one is more verbose than even XML.
Original title and link: A sample of Google Cloud Datastore Python API ( ©myNoSQL)
Bloom filters are super efficient data structures that allow us to
tell if an object is most likely in a data set or not by checking a
few bits. Bloom filters return some false positives but no false
negatives. Luckily we can control the amount of false positives we
receive with a trade off of time and memory.
Explanations and code included.
Original title and link: Creating a Simple Bloom Filter in Python ( ©myNoSQL)
Most Pig tutorials you will find assume that you are working with
data where you know all the column names ahead of time, and that the
column names themselves are just labels, versus being composites of
labels and data. For example, when working with HBase, it’s actually
not uncommon for both of those assumptions to be false. Being a
columnar database, it’s very common to be working to rows that have
thousands of columns. Under that circumstance, it’s also common for
the column names themselves to encode to dimensions, such as date
and counter type.
Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs ( ©myNoSQL)
Uri Laserson’s dives into the world of Python frameworks for Hadoop:
So my first order of business was to investigate some of the options
that exist for working with Hadoop from Python.
In this post, I will provide an unscientific, ad hoc review of my
experiences with some of the Python frameworks that exist for
working with Hadoop, including:
- Hadoop Streaming
For easy access, links to these frameworks:
Original title and link: A Guide to Python Frameworks for Hadoop ( ©myNoSQL)
Nice data experiment run by Sebastien Goasguen against the CloudStack mailing list:
To get the graphs I grabbed the emails archive from Apache. I used
Python to load the mbox files into single Mongo collections. I
cleaned the data to avoid replications of senders as well as remove
JIRA and Review Board entries. Then with a little bit of PyMongo I
made the queries and build the graph with NetworkX. Finished up with
the graph visualization and calculations using Gephi. Since there
are thousands of emails and threads, there is still some work to
pre-process the data, avoid duplicates and match individuals to
multiple email addresses.
- would using a graph database made this experiment easier?
- would Linkurious be able to generate these graphics?
- is the code available anywhere so someone else could try to use a graph database and maybe run other types of visualizations?
Original title and link: Social Network Analysis of Apache CloudStack ( ©myNoSQL)
MongoMem, a Python tool, by Wish Tech team:
Today, we’re releasing the first of these tools, MongoMem. MongoMem
solves the age-old problem of figuring out how much memory each
collection is using. In MongoDB, keeping your working set in memory
is pretty important for most apps. The problem is, there’s not
really a way to get visibility into the working set or what’s in
memory beyond looking at resident set size or page faults rate.
Original title and link: MongoMem: Memory Usage by Collection in MongoDB ( ©myNoSQL)
Marcel Caraciolo describes the MapReduce-based friends recommendation engine:
That’s a simple algorithm used at Atépassar for recommending friends using some basic graph analysis concepts.
Considering the network only has 140k users, the first question that came to my mind was why MapReduce and not a graph database?
Original title and link: Recommending Friends With MapReduce and Python ( ©myNoSQL)
A nice demo of the recently announced MapReduce engine written on Python r3 library1 against commit histories from GitHub:
It is pretty simple to get r3 to do some cool calculations for us. I got the whole sample in a very short amount of time. It took me more time to write this post than to make r3 calculate the commiter percentages.
Original title and link: Demoing the Python-Based Map-Reduce R3 Against GitHub Data ( ©myNoSQL)
A.Jesse Jiryu Davis:
PyMongo is three and a half years old. The core module is 3000 source lines of code. There are hundreds improvements and bugfixes, and 7000 lines of unittests. Anyone who tries to make a non-blocking version of it has a lot of work cut out, and will inevitably fall behind development of the official PyMongo. With Motor’s technique, I can wrap and reuse PyMongo whole, and when we fix a bug or add a feature to PyMongo, Motor will come along for the ride, for free.
Original title and link: How I Asynchronized MongoDB Python Synchronous Library ( ©myNoSQL)
Anyway to cut a long story short my attempt eventually failed because my mathematical naivety hid the fact that a brute force attack would result in far too many hours of computation and a database that was simply too vast. I gave up after running it for three hours and it was showing that it had computed 24 million board states, it still had 18 million un-computed child boards to investigate and had a 23Gig database. I think it is still possible to do this almost completely with brute force if I remove symmetrical board states (apparently if done right there are only 23 million possible board states when symmetry is considered) but that is way beyond just investigating the technology and object orientation.
Peg Solitaire sounds like a good excuse to look into Python and MongoDB.
Original title and link: Peg Solitaire With Python and MongoDB ( ©myNoSQL)
A list of DynamoDB libraries covering quite a few popular languages and frameworks:
A couple of things I’ve noticed (and that could be helpful to other NoSQL database companies):
- Amazon provides official libraries for a couple of major programming languages (Java, .NET, PHP, Ruby)
- Amazon is not shy to promote libraries that are not official, but established themselves as good libraries (e.g. Python’s Boto)
- The list doesn’t seem to include anything for C and Objective C (Objective C is the language of iOS and Mac apps)
Original title and link: DynamoDB Libraries, Mappers, and Mock Implementations ( ©myNoSQL)