ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Pivotal People

I really like the people page on recently announced Pivotal’s website and in particular a couple of the individual pictures:

Pivotal People

✚ I did some digging but I came out empty about the relationship between Pivotal and Sir Tim Berners-Lee and Professor Joseph M. Hellerstein (CEO of Trifacta). Update: according to this post, the page lists people that inspire the Pivotal team.

Original title and link: Pivotal People (NoSQL database©myNoSQL)


Boundary for Splunk app for correlating alerts

Alex Williams for TechCrunch:

Boundary‘s application performance monitoring technology is now integrated into Splunk‘s enterprise platform, providing a window into apps that increasingly are distributed across cloud and on-premise virtualized environments.

At first I thought this means Boundary will use Splunk as the backend for the data. But Boundary is a service so that’s not the case. Plus Splunk can already be used for network management and monitoring.

According to the post, “Splunk real-time alerts are tagged as annotations in Boundary’s time-series graphs. Customers can then correlate alerts against application flow and performance data.” So basically this is monitoring your monitoring system, right?

Original title and link: Boundary for Splunk app for correlating alerts (NoSQL database©myNoSQL)

via: http://techcrunch.com/2013/04/25/new-boundary-app-for-splunk-predicts-root-cause-of-app-brownouts/


Actian/Pervasive

My thanks to Actian/Pervasive for sponsoring this week of myNoSQL to promote their “pull data from pretty much anywhere and load it into Hadoop”, Actian Rushloader tool.

it looks like Actian wants to play an important role in the Big Data market as they have recently announced the acquisition of the Amazon-funded ParAccel whose main tool powers Amazon Redshift data warehouse service.

Original title and link: Actian/Pervasive (NoSQL database©myNoSQL)

via: http://www.actian.com


Project Falcon: Tackling Hadoop Data Lifecycle Management

Venkatesh Seetharam announcing a new Apache incubating project in the Hadoop ecosystem open sourced by InMobi and Hortonworks:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

I think this diagram describes Project Falcon best:

Project Falcon at a Glance

✚ Was there any other project addressing this space?

Original title and link: Project Falcon: Tackling Hadoop Data Lifecycle Management (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/


Quick Intro to LevelDB

If you haven’t heard of LevelDB before or you forgot some of the details, read Rod Vagg’s short post that will give you an overview of the basics and internals. Particularly interesting is the file organization in LevelDB (which also gives it the name).

Original title and link: Quick Intro to LevelDB (NoSQL database©myNoSQL)

via: http://dailyjs.com/2013/04/19/leveldb-and-node-1/


PostgreSQL Transaction System

This is a gem.

Original title and link: PostgreSQL Transaction System (NoSQL database©myNoSQL)


3 Big Data Use Cases in Banking

An article on Sys-Con about 3 high level and generic use cases of Big Data in banking:

  1. Customer experience
  2. Risk management
  3. Operations optimization

The first and the third are common across multiple fields. Risk management is critical to banks’ core business and I assume this is the domain where most of the technology investment happens.

Original title and link: 3 Big Data Use Cases in Banking (NoSQL database©myNoSQL)

via: http://bigdata.sys-con.com/node/2623407/print


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html


MongoDB Pub/Sub With Capped Collections

Rick Copeland designs a MongoDB Pub/Sub system based on:

  • MongoDB’s capped collections,
  • tailable data-awaiting cursors,
  • sequences (using find_and_modify()),
  • a “poorly documented option” of capped collections: oplog_replay1.

If you’ve been following this blog for any length of time, you know that my NoSQL database of choice is MongoDB. One thing that MongoDB isn’t known for, however, is building a publish / subscribe system. Redis, on the other hand, is known for having a high-bandwith, low-latency pub/sub protocol. One thing I’ve always wondered is whether I can build a similar system atop MongoDB’s capped collections, and if so, what the performance would be. Read on to find out how it turned out…

The solution is definitely ingenious and it could probably work for systems with not so many requirements for their pub/sub. It’s definitely a good excercise in combining some interesting features of MongoDB (I like the capped collections and the tailable data-awaiting cursors).

✚ I’m wondering if the behavior of the tailable data-awaiting cursors is the one of the non-blocking polls.


  1. I don’t really understand how this works. 

Original title and link: MongoDB Pub/Sub With Capped Collections (NoSQL database©myNoSQL)

via: http://blog.pythonisito.com/2013/04/mongodb-pubsub-with-capped-collections.html


Counting in MongoDB Just Got Much Faster

Antoine Girbal about counts in MongoDB:

Doing counts in MongoDB has always been a slow operation even on an indexed field… until now. To do the count, it would iterate through every single element in the index and try to match the key, giving a response time of several seconds for just a million documents. It would be especially slow on values with high cardinality, meaning that the count is high.

A bug-fix and an optimization using MongoDB’s B-trees.

Original title and link: Counting in MongoDB Just Got Much Faster (NoSQL database©myNoSQL)

via: http://edgystuff.tumblr.com/post/47080433433/counting-in-mongodb-just-got-much-faster


Testing MapReduce With MRUnit

Mansoor Ashraf about MRUnit:

Testing and debugging multi threaded programs is hard. Now take the same programs and massively distribute them across multiple JVMs deployed on a cluster of machines and the complexity goes off the roof. One way to overcome this complexity is to do testing in isolation and catch as many bugs as possible locally. MRUnit is a testing framework that lets you test and debug Map Reduce jobs in isolation without spinning up a Hadoop cluster. In this blog post we will cover various features of MRUnit by walking through a simple MapReduce job.

The code samples look quite legible and there doesn’t seem to be a lot of boilerplate code involved. That’s a great thing for a testing framework.

Original title and link: Testing MapReduce With MRUnit (NoSQL database©myNoSQL)

via: http://m-mansur-ashraf.blogspot.com/2013/02/testing-mapreduce-with-mrunit.html


Bitly Forget Table - Building Categorical Distributions in Redis

In the comment thread of the post “Using Redis as an external index for surfacing interesting content“, Micha Gorelick pointed to a post covering a similar solution used at Bitly:

We store the categorical distribution as a set of event counts, along with a ‘normalising constant’ which is simply the number of all the events we’ve stored. […]

All this lives in a Redis sorted set where the key describes the variable which, in this case, would simply be bitly_country and the value would be a categorical distribution. Each element in the set would be a country and the score of each element would be the number of clicks from that country. We store a separate element in the set (traditionally called z) that records the total number of clicks stored in the set. When we want to report the categorical distribution, we extract the whole sorted set, divide each count by z, and report the result.

Storing the categorical distribution in this way allows us to make very rapid writes (simply increment the score of two elements of the sorted set) and means we can store millions of categorical distributions in memory. Storing a large number of these is important, as we’d often like to know the normal behavior of a particular key phrase, or the normal behavior of a topic, or a bundle, and so on.

The Bitly team has open sources their solution named Forget Table and the code is available on GitHub.

Original title and link: Bitly Forget Table - Building Categorical Distributions in Redis (NoSQL database©myNoSQL)

via: http://word.bitly.com/post/41284219720/forget-table