ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Complex data manipulation in Cascalog, Pig, and Hive

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.

I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)

via: http://blog.brunobonacci.com/2014/06/01/cascalog-by-examples-part1/


Tamr - new data cleanup company from Michael Stonebraker

If you think this sounds a lot like Trifacta, well the different seems to be a single 0:

Palmer, who along with Stonebraker created database company Vertica Systems (which HP bought in 2011), said what separates the company’s new product from other similar ones, like Trifacta, is the emphasis on analyzing thousands of data sources as opposed to hundreds with humans acting as the guiding light.

Original title and link: Tamr - new data cleanup company from Michael Stonebraker (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/05/19/michael-stonebrakers-new-startup-tamr-wants-to-help-get-messy-data-in-shape/


Time to regulate Big Data?

I had a conversation recently on this subject. As someone born and raised in a communist country, the perspective reality of having no control over what and who owns data about you is very concerning. Terrifying.

For years, data brokers have been collecting and selling billions of pieces of your personal information — from your income to your shopping habits to your medical ailments. Now federal regulators say it’s time you have more control over what’s collected and whether it will be used at all.

After reading this post I was close to cry finally. Then I’ve realized that this bill would need to pass first. And with the right lobbying that might actually never happen (as in “But so far Rockfeller’s bill has gone nowhere).

Original title and link: Time to regulate Big Data? (NoSQL database©myNoSQL)

via: http://money.cnn.com/2014/05/27/pf/ftc-big-data/


Causality: A discussion of causality, vector clocks, version vectors, and CRDTs

If you have a quiet Sunday and want to listen to something extremely awesome, you should try this episode of ThinkDistributed podcast covering causality in distributed systems with guest hosts: Peter Bailis, Carlos Baquero, and Marek Zawirski.

The links in the show notes will fill your reading list for a good while.

Original title and link: Causality: A discussion of causality, vector clocks, version vectors, and CRDTs (NoSQL database©myNoSQL)


A Tour of Machine Learning Algorithms

After we understand the type of machine learning problem we are working with, we can think about the type of data to collect and the types of machine learning algorithms we can try. In this post we take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms to get a general idea of what methods are available.

Timeless

Original title and link: A Tour of Machine Learning Algorithms (NoSQL database©myNoSQL)

via: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms


New PostgreSQL guns for NoSQL market

Joab Jackson (PCWorld):

Embracing the widely used JSON data-exchange format, the new version of the PostgreSQL open-source database takes aim at the growing NoSQL market of nonrelational data stores, notably the popular MongoDB.

I’ve always appreciated the openness of the PostgreSQL developers to consider new features and their efforts to bring these to a relational database. What’s missing from the picture is how many users are actually using these features.

Original title and link: New PostgreSQL guns for NoSQL market (NoSQL database©myNoSQL)

via: http://www.pcworld.com/article/2155780/new-postgresql-guns-for-nosql-market.html


Monitoring CouchDB with Munin

A long, but extremely useful list of metrics to get from CouchDB:

The most of monitoring systems plugins for CouchDB are unable to handle all the described cases since they are trying to work with just /_stats resource - it’s good, but, as you may noted, not enough to see full picture of your CouchDB.

However, at least for Munin there is one that’s going to handle all this post recommendations.

Original title and link: Monitoring CouchDB with Munin (NoSQL database©myNoSQL)

via: http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html


A map to reviewing RavenDB code base

Ayende provides some answers to a series of questions about specific code base details of RavenDB. This could be a very good starting point for those interested into how RavenDB is implemented.

Original title and link: A map to reviewing RavenDB code base (NoSQL database©myNoSQL)

via: http://ayende.com/blog/166658/a-map-to-reviewing-ravendb?Key=6566461c-3541-40c3-8094-7ae313c036f3


Paper: Parallel Graph Partitioning for Complex Networks

Authored by a team from Karlsruhe Institute of Technology, the paper “Parallel graph partitioning for complex networks” presents a parallelized and adapting label propagation technique for partitioning graphs:

The graph partitioning problem is NP-complete [3], [4] and there is no approximation algorithm with a constant ratio factor for general graphs [5]. Hence, heuristic algorithms are used in practice.

A successful heuristic for partitioning large graphs is the multilevel graph partitioning (MGP) approach depicted in Figure 1, where the graph is recursively contracted to achieve smaller graphs which should reflect the same basic structure as the input graph.


To SQL or to NoSQL?

Bob Lambert’s thoughts about a post about migrating from a NoSQL database back to a relational database:

I really liked this post, but particularly for these two points:

  • All data is relational, but NoSQL is useful because sometimes it isn’t practical to treat it as such for volume/complexity reasons.
  • In the comments, Jonathon Fisher’s remarked that NoSQL is really old technology, not new. (Of course you have to like any commenter that uses the word “defenestrated”).

I have lost count of how many times I’ve read exactly these arguments. But let’s take a different look at this post and the original article:

  1. the first thing that strikes me is that there’s no mention of the NoSQL database that was used; adding to that there’s no explanation of what led to using that database in the first place. What if it was just an experiment? What if the initial implementation was just a fashionable decision?

  2. all data is relational

    A more accurate statement would be “all data is connected“. The way we represent these connections can take many different forms and many times depends on the ways we use the data. This is exactly the core principle when doing data modeling in the NoSQL world too.

    The relational model is the most common as a consequence of the popularity of relational databases. One quick example of connected but not relational is hierarchical data, an area where relational databases are still not excelling (even if some have built custom solutions).

  3. “data in a relational model is optimized for the set of all possible operations”.

    Actually, the relational model is optimizing for space efficiency and set operations. There’s no such thing that optimizes for everything. Take graph data and traversal operations as an obvious counter example for relations and operations that are outside the capabilities of a relational database. And there are quite a few other examples: massive sparse matrices, etc.

  4. “Todd Homa recounts one horror story that shows how NoSQL data modelers must be aware of the corners into which they paint themselves as they optimize one access path at the expense of others.”

    This is like saying that everyone using a relational database has shut all their chances to grow their application to more than one server. Both of these are quite inaccurate.

Last, I think we should change the “choose the right tool for the job” advise to something that is a bit more clear: “understand and choose the trade-offs that correspond to your requirements”. Doesn’t sound as nice, but I think it’s better.

Original title and link: To SQL or to NoSQL? (NoSQL database©myNoSQL)

via: http://robertlambert.net/2014/05/to-sql-or-to-nosql/


The data analytics handbook

A free book based on interviews with data scientists, data analysts, researchers. Available here.

Original title and link: The data analytics handbook (NoSQL database©myNoSQL)


MMS and the state of backups in MongoDB land

So just to be clear, if you are doing it yourself, you are probably settling for something other than a consistent snapshot. Even then, it’s not simple.

I’m always fascinated by companies introducing products by calling out how shitty and complicated their other products are. Axion. Now cleans 10 times better than before.

Original title and link: MMS and the state of backups in MongoDB land (NoSQL database©myNoSQL)

via: http://blog.mms.mongodb.com/post/83410779867/mongodb-backup-for-sharded-clusters