NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



clojure: All content tagged as clojure in NoSQL databases and polyglot persistence

Parkour - Idiomatic Clojure for Map Reduce

If you are running out of interesting projects to experiment with during this seasonal break, Parkour is a Clojure library for writing MapReduce jobs.

From Marshall Bockrath-Vandegrift’s guest post on Cloudera’s blog:

Parkour is our new Clojure library that carries this philosophy to the Apache Hadoop’s MapReduce platform. Instead of hiding the underlying MapReduce model behind new framework abstractions, Parkour exposes that model with a clear, direct interface. Everything possible in raw Java MapReduce is possible with Parkour, but usually with a fraction of the code.

Original title and link: Parkour - Idiomatic Clojure for Map Reduce (NoSQL database©myNoSQL)

Cascalog and Cascading: Productivity Solutions for Data Scientists

A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:

Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.

Original title and link: Cascalog and Cascading: Productivity Solutions for Data Scientists (NoSQL database©myNoSQL)


Thoughts About Datomic

Sergio Bossa has left a great comment on Fogus’s blog post about Datomic that encapsulates in much more detail all my notes (and some more) about Datomic:

I waited for the Datomic announcement with great excitement, and I’d like now to share some thoughts, hoping they will be food for more comments or blog posts.

Datomic certainly provides interesting features, most notably:

  1. Clojure-style data immutability, separating entity values in time.
  2. Declarative query language with powerful aggregation capabilities.

But unfortunately, my list of concerns is way longer, maybe because some lower level aspects weren’t addressed in the whitepaper, or maybe because my expectations were really too high. Let’s try to briefly enumerate the most relevant ones:

  1. Datomic provides powerful aggregation/processing capabilities, but violates one of the most important rules in distributed systems: collocating processing with data, as data must be moved from storage to peers’ working set in order to be aggregated/processed. In my experience, this is a huge penalty when dealing with even medium-sized datasets, and just answering that “we expect it to work for most common use cases” isn’t enough.

    My comment: The answer to similar comments pointed to the local caches. But I think it is still a very valid observation.

  2. In-process caching of working sets usually leads in my experience to compromising overall application reliability: that is, the application usually ends up spending lots of time dealing with the working set cache, either faulting/flushing objects or gc’ing them, rather than doing its own business.

  3. Transactors are both a Single Point Of Bottleneck and Single Point Of Failure: you may don’t care about the former (which I’d do btw), but you have to care about the latter.

    My comment: The Datomic paper contains an interesting formulation about the job of transactors for reads and writes:

    When reads are separated from writes, writes are never held up by queries. In the Datomic architecture, the transactor is dedicated to transactions, and need not service reads at all!

    In an ACID system, both reads and writes represent transactions though.

  4. You say you avoid sharding, but being transactors a single point of bottleneck, when the time you have too much data over a single transactor system will come, you’ll have to, guess what, shard, and Datomic has no support for this apparently.

  5. There’s no mention about how Datomic deals with network partitions.

I think that’s enough. I’ll be happy to read any feedback about my points.

As Sergio Bossa, I’d really love to hear some answers from the Datomic team.

Original title and link: Thoughts About Datomic (NoSQL database©myNoSQL)

Connection Management in MongoDB and CongoMongo

Are connections pooled or not? Konrad Garus digs to find the answer:

Easy. Too easy and comfortable. Coming from the old good and heavy JDBC/SQL I felt uneasy with the connection management. How does it work? Does it just open a connection and leave it dangling in the air the whole time? Might be good for a quick spike in REPL, but not for a real application which needs concurrency, is supposed to be running for days and weeks, and so on. How do you maintain it properly?

Original title and link: Connection Management in MongoDB and CongoMongo (NoSQL database©myNoSQL)


Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials

Even if you are not familiar with Clojure, you’ll still enjoy this fantastic HBase tutorial:

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

Original title and link: Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials (NoSQL database©myNoSQL)


Creating a Query DSL Using Clojure and MongoDB

Christopher Maier:

Not only does creating a DSL make querying easy (particularly with complex conditions), but it also insulates your application from change in a few important ways. Especially in the initial, exploratory stages of a project, it is common to change and evolve a data schema, and NoSQL environments make this very simple. Using a DSL will shield your code from these changes; you only need to change the DSL “atoms” that the schema change affects.

In case you missed it, Foursquare open sourced they type-safe Scala DSL for MongoDB.

Original title and link: Creating a Query DSL Using Clojure and MongoDB (NoSQL database©myNoSQL)


Simhashing in Hadoop with MapReduce, Cascalog and Cascading

Simhashing in MapReduce is a quick way to find clusters in a huge amount of data. By using Cascading and Cascalog we’re able to work with MapReduce jobs at the level of functions rather than individual map-reduce phases.

Chris K.Wensel

Original title and link: Simhashing in Hadoop with MapReduce, Cascalog and Cascading (NoSQL databases © myNoSQL)


Using CouchDB with Clojure

All CouchDB basic features explained using Clojure:

This article shows how to access the CouchDB APIs using Clojure, a dynamic language for the JVM. Examples use the Clutch API and clj-http library in parallel to illustrate a higher-level CouchDB API and lower-level REST-based calls, respectively.

Original title and link: Using CouchDB with Clojure (NoSQL databases © myNoSQL)


Riak Map/Reduce Queries in Clojure

Over this week I’ve been working on a proof of concept to see if it’s possible to use Clojure as the map/reduce language for Riak, in the same way now we can use Javascript and Erlang for that purpose. To accomplish that I needed a way to call Clojure code from Erlang. So I set up a very simple server in Clojure that runs as an Erlang node using Closerl.

Theoretically nice… practically I’d say there is a fundamental problem with this idea (different than the ones listed in the article). Map and reduce functions are supposed to run on the nodes hosting the data[1]. If you need to wire this data is like implementing mapreduce on your application so the data locality property is lost. Not to mention that adding another variable to the equation (the JVM) your distributed system will become more sensible to failures.

  1. As mentioned in this question about Riak MapReduce, currently Riak runs only the map functions on all nodes, while reduce function is run on the node receiving the request.  ()

Original title and link for this post: Riak Map/Reduce Queries in Clojure (published on the NoSQL blog: myNoSQL)


Quick Guide for Riak with Clojure

From installation to using the Clojure library for Riak ☞ clj-riak including MapReduce with Riak:

This brief introduction leaves many aspects of Riak unaddressed. For example, we have not looked at throughput, scalability, fault tolerance, conflict resolution, or production operations – all critical to a complete understanding of the datastore.

Quick Guide for Riak with Clojure originally posted on the NoSQL blog: myNoSQL


CouchDB: 5.5k inserts/sec with fire-and-forget and bulk ops

After saying that MongoDB’s default fire-and-forget behavior is wrong, CouchDB community welcomed this sample Clojure code showing 5500 inserts/second implemented with a fire-and-forget behavior and bulk inserts:

So I contemplated the problem some and wondered whether Clojure’s STM (Software Transactional Memory) could be leveraged. As requests come in, instead of connecting immediately to the database, why not queue them up until we have an optimal number and then do a bulk insert?