ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

python: All content tagged as python in NoSQL databases and polyglot persistence

Data Science Wars: Python vs. R

Daniel Gutierrez posted a pretty good summary of the recent discussions about the preferred or most productive or most used data processing environments (R or Python):

While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. Here is a short list of some the arguments I’ve heard of late, along with my personal assessment of each…

The summary of a summary is that this conversation can be reduced to familiarity vs highly specialized algorithms1.


  1. While Python can get many of the specialized tools available in R, R has a lot more work to do to become a familiar environment for devs. 

Original title and link: Data Science Wars: Python vs. R (NoSQL database©myNoSQL)

via: http://inside-bigdata.com/2013/12/09/data-science-wars-python-vs-r/


A sample of Google Cloud Datastore Python API

This is a sample code snippet from the Getting started guide for the recently announced Google Cloud Datastore:

def WriteEntity():
  req = datastore.BlindWriteRequest()
  entity = req.mutation.upsert.add()
  path = entity.key.path_element.add()
  path.kind = 'Greeting'
  path.name = 'foo'
  message = entity.property.add()
  message.name = 'message'
  value = message.value.add()
  value.string_value = 'to the cloud and beyond!'
  try:
    datastore.blind_write(req)
  except datastore.RPCError as e:
    # remember to do something useful with the exception
    pass

I haven’t seen in a while such a terrible API. Makes me wonder what was wrong with the Google AppEngine API; this one is more verbose than even XML.

Original title and link: A sample of Google Cloud Datastore Python API (NoSQL database©myNoSQL)


Creating a Simple Bloom Filter in Python

Max Burstein:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

Explanations and code included.

Original title and link: Creating a Simple Bloom Filter in Python (NoSQL database©myNoSQL)

via: http://maxburstein.com/blog/creating-a-simple-bloom-filter/


Flatten Entire HBase Column Families With Pig and Python UDFs

Chase Seibert:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (NoSQL database©myNoSQL)

via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html


A Guide to Python Frameworks for Hadoop

Uri Laserson’s dives into the world of Python frameworks for Hadoop:

So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

  • Hadoop Streaming
  • mrjob
  • dumbo
  • hadoopy
  • pydoop

For easy access, links to these frameworks:

Original title and link: A Guide to Python Frameworks for Hadoop (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/


Social Network Analysis of Apache CloudStack

Nice data experiment run by Sebastien Goasguen against the CloudStack mailing list:

To get the graphs I grabbed the emails archive from Apache. I used Python to load the mbox files into single Mongo collections. I cleaned the data to avoid replications of senders as well as remove JIRA and Review Board entries. Then with a little bit of PyMongo I made the queries and build the graph with NetworkX. Finished up with the graph visualization and calculations using Gephi. Since there are thousands of emails and threads, there is still some work to pre-process the data, avoid duplicates and match individuals to multiple email addresses.

csusers

Three questions:

  1. would using a graph database made this experiment easier?
  2. would Linkurious be able to generate these graphics?
  3. is the code available anywhere so someone else could try to use a graph database and maybe run other types of visualizations?

Original title and link: Social Network Analysis of Apache CloudStack (NoSQL database©myNoSQL)

via: http://sebgoa.blogspot.ch/2013/01/social-network-analysis-of-apache.html


MongoMem: Memory Usage by Collection in MongoDB

MongoMem, a Python tool, by Wish Tech team:

Today, we’re releasing the first of these tools, MongoMem. MongoMem solves the age-old problem of figuring out how much memory each collection is using. In MongoDB, keeping your working set in memory is pretty important for most apps. The problem is, there’s not really a way to get visibility into the working set or what’s in memory beyond looking at resident set size or page faults rate.

Original title and link: MongoMem: Memory Usage by Collection in MongoDB (NoSQL database©myNoSQL)

via: http://eng.wish.com/mongomem-memory-usage-by-collection-in-mongodb/


Recommending Friends With MapReduce and Python

Marcel Caraciolo describes the MapReduce-based friends recommendation engine:

That’s a simple algorithm used at Atépassar for recommending friends using some basic graph analysis concepts.

Considering the network only has 140k users, the first question that came to my mind was why MapReduce and not a graph database?

Original title and link: Recommending Friends With MapReduce and Python (NoSQL database©myNoSQL)

via: http://aimotion.blogspot.ca/2012/10/atepassar-recommendations-recommending.html


Demoing the Python-Based Map-Reduce R3 Against GitHub Data

A nice demo of the recently announced MapReduce engine written on Python r3 library1 against commit histories from GitHub:

It is pretty simple to get r3 to do some cool calculations for us. I got the whole sample in a very short amount of time. It took me more time to write this post than to make r3 calculate the commiter percentages.


  1. r3 is a Python-based map-reduce engine using Redis as a backend 

Original title and link: Demoing the Python-Based Map-Reduce R3 Against GitHub Data (NoSQL database©myNoSQL)

via: http://blog.heynemann.com.br/2012/08/04/r3-a-quick-demo-of-usage/


How I Asynchronized MongoDB Python Synchronous Library

A.Jesse Jiryu Davis:

PyMongo is three and a half years old. The core module is 3000 source lines of code. There are hundreds improvements and bugfixes, and 7000 lines of unittests. Anyone who tries to make a non-blocking version of it has a lot of work cut out, and will inevitably fall behind development of the official PyMongo. With Motor’s technique, I can wrap and reuse PyMongo whole, and when we fix a bug or add a feature to PyMongo, Motor will come along for the ride, for free.

Original title and link: How I Asynchronized MongoDB Python Synchronous Library (NoSQL database©myNoSQL)

via: http://emptysquare.net/blog/motor-internals-how-i-asynchronized-a-synchronous-library/


Peg Solitaire With Python and MongoDB

David Taylor:

Anyway to cut a long story short my attempt eventually failed because my mathematical naivety hid the fact that a brute force attack would result in far too many hours of computation and a database that was simply too vast. I gave up after running it for three hours and it was showing that it had computed 24 million board states, it still had 18 million un-computed child boards to investigate and had a 23Gig database. I think it is still possible to do this almost completely with brute force if I remove symmetrical board states (apparently if done right there are only 23 million possible board states when symmetry is considered) but that is way beyond just investigating the technology and object orientation.

Peg Solitaire sounds like a good excuse to look into Python and MongoDB.

Original title and link: Peg Solitaire With Python and MongoDB (NoSQL database©myNoSQL)

via: http://davidandrewtaylor.blogspot.co.uk/2012/05/python-and-nosql-after-listening-to.html


DynamoDB Libraries, Mappers, and Mock Implementations

A list of DynamoDB libraries covering quite a few popular languages and frameworks:

DynamoDB Libraries, Mappers, and Mock Implementations

A couple of things I’ve noticed (and that could be helpful to other NoSQL database companies):

  1. Amazon provides official libraries for a couple of major programming languages (Java, .NET, PHP, Ruby)
  2. Amazon is not shy to promote libraries that are not official, but established themselves as good libraries (e.g. Python’s Boto)
  3. The list doesn’t seem to include anything for C and Objective C (Objective C is the language of iOS and Mac apps)

Original title and link: DynamoDB Libraries, Mappers, and Mock Implementations (NoSQL database©myNoSQL)

via: http://aws.typepad.com/aws/2012/04/amazon-dynamodb-libraries-mappers-and-mock-implementations-galore.html