ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Apache Kafka: Next generation distributed messaging system

Abhishek Sharma in an 3000 words article on InfoQ:

Its architecture consists of the following components:

  • A stream of messages of a particular type is defined as a topic. A Message is defined as a payload of bytes and a Topic is a category or feed name to which messages are published.
  • A Abhishek Sharma can be anyone who can publish messages to a Topic.
  • The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
  • A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.

Producer can choose their favorite serialization method to encode the message content. For efficiency, the producer can send a set of messages in a single publish request. Following code examples shows how to create a Producer to send messages.

Kafka is an amazing system. I just wish the article would have actually looked into what makes it unique and how it compares to systems like RabbitMQ or ActiveMQ.

✚ Cameron Purdy in one of the comments:

If you carefully read the article, you’ll note that Kafka is not actually a message queue. It’s just a specialized database with some messaging semantics in its API. That means if you need the behaviors that you would associate with a message queue, you can’t get them with Kafka (or if you can, the performance will plummet.)

Original title and link: Apache Kafka: Next generation distributed messaging system (NoSQL database©myNoSQL)

via: http://www.infoq.com/articles/apache-kafka


Apache Mesos company Mesosphere raises $10M

Ron Miller (TechCrunch):

According to Matt Trifiro, SVP at Mesosphere, this is possible because of containerization technology developed by Google. Building on Google’s concept, Mesos allows system administrators to take complex applications that run at scale and use the resources of the entire datacenter as a single unit, using containerization to isolate the processes.

Meanwhile, last week at its first conference, Docker announced the 1.0 version and Google was pretty quick to announce both support and additional tools for it.

Original title and link: Apache Mesos company Mesosphere raises $10M (NoSQL database©myNoSQL)

via: http://techcrunch.com/2014/06/09/mesosphere-grabs-10m-in-series-a-funding-to-transform-datacenters/


Two questions about the Oracle in-memory database

Two questions about the Oracle in-memory database, announced in Sep. 2013, re-announced now, and coming… sometime later:

  1. Why would the performance improvement be visible only a specific hardware?

    Ellison said users can expect real-time analytics queries 100 times faster and online transaction processing that is two times faster as long as they are using hardware that supports the Oracle 12c database.

    I’ll assume that this could only mean that these results will be seen when data fits in memory. And not that one will need custom hardware to enable this feature. As a side note, I’m not sure I’m reading the announcement correctly, but it looks like a paying Oracle database customer will have to pay extra for the in-memory option.

  2. Can anyone explain how data can be stored both in columnar and row format?

    Additionally, the software will allow people to store data in both columns (used for analytics) and rows (used for transactions) as opposed to only one method; Ellison described this function as being “the magic of Oracle.”

    Magic has very little to do with databases and performance.

Original title and link: Two questions about the Oracle in-memory database (NoSQL database©myNoSQL)


Complex data manipulation in Cascalog, Pig, and Hive

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.

I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)

via: http://blog.brunobonacci.com/2014/06/01/cascalog-by-examples-part1/


Tamr - new data cleanup company from Michael Stonebraker

If you think this sounds a lot like Trifacta, well the different seems to be a single 0:

Palmer, who along with Stonebraker created database company Vertica Systems (which HP bought in 2011), said what separates the company’s new product from other similar ones, like Trifacta, is the emphasis on analyzing thousands of data sources as opposed to hundreds with humans acting as the guiding light.

Original title and link: Tamr - new data cleanup company from Michael Stonebraker (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/05/19/michael-stonebrakers-new-startup-tamr-wants-to-help-get-messy-data-in-shape/


Time to regulate Big Data?

I had a conversation recently on this subject. As someone born and raised in a communist country, the perspective reality of having no control over what and who owns data about you is very concerning. Terrifying.

For years, data brokers have been collecting and selling billions of pieces of your personal information — from your income to your shopping habits to your medical ailments. Now federal regulators say it’s time you have more control over what’s collected and whether it will be used at all.

After reading this post I was close to cry finally. Then I’ve realized that this bill would need to pass first. And with the right lobbying that might actually never happen (as in “But so far Rockfeller’s bill has gone nowhere).

Original title and link: Time to regulate Big Data? (NoSQL database©myNoSQL)

via: http://money.cnn.com/2014/05/27/pf/ftc-big-data/


Causality: A discussion of causality, vector clocks, version vectors, and CRDTs

If you have a quiet Sunday and want to listen to something extremely awesome, you should try this episode of ThinkDistributed podcast covering causality in distributed systems with guest hosts: Peter Bailis, Carlos Baquero, and Marek Zawirski.

The links in the show notes will fill your reading list for a good while.

Original title and link: Causality: A discussion of causality, vector clocks, version vectors, and CRDTs (NoSQL database©myNoSQL)


A Tour of Machine Learning Algorithms

After we understand the type of machine learning problem we are working with, we can think about the type of data to collect and the types of machine learning algorithms we can try. In this post we take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms to get a general idea of what methods are available.

Timeless

Original title and link: A Tour of Machine Learning Algorithms (NoSQL database©myNoSQL)

via: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms


New PostgreSQL guns for NoSQL market

Joab Jackson (PCWorld):

Embracing the widely used JSON data-exchange format, the new version of the PostgreSQL open-source database takes aim at the growing NoSQL market of nonrelational data stores, notably the popular MongoDB.

I’ve always appreciated the openness of the PostgreSQL developers to consider new features and their efforts to bring these to a relational database. What’s missing from the picture is how many users are actually using these features.

Original title and link: New PostgreSQL guns for NoSQL market (NoSQL database©myNoSQL)

via: http://www.pcworld.com/article/2155780/new-postgresql-guns-for-nosql-market.html


Monitoring CouchDB with Munin

A long, but extremely useful list of metrics to get from CouchDB:

The most of monitoring systems plugins for CouchDB are unable to handle all the described cases since they are trying to work with just /_stats resource - it’s good, but, as you may noted, not enough to see full picture of your CouchDB.

However, at least for Munin there is one that’s going to handle all this post recommendations.

Original title and link: Monitoring CouchDB with Munin (NoSQL database©myNoSQL)

via: http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html


A map to reviewing RavenDB code base

Ayende provides some answers to a series of questions about specific code base details of RavenDB. This could be a very good starting point for those interested into how RavenDB is implemented.

Original title and link: A map to reviewing RavenDB code base (NoSQL database©myNoSQL)

via: http://ayende.com/blog/166658/a-map-to-reviewing-ravendb?Key=6566461c-3541-40c3-8094-7ae313c036f3


Paper: Parallel Graph Partitioning for Complex Networks

Authored by a team from Karlsruhe Institute of Technology, the paper “Parallel graph partitioning for complex networks” presents a parallelized and adapting label propagation technique for partitioning graphs:

The graph partitioning problem is NP-complete [3], [4] and there is no approximation algorithm with a constant ratio factor for general graphs [5]. Hence, heuristic algorithms are used in practice.

A successful heuristic for partitioning large graphs is the multilevel graph partitioning (MGP) approach depicted in Figure 1, where the graph is recursively contracted to achieve smaller graphs which should reflect the same basic structure as the input graph.