ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Project Rhino goal: at-rest encryption for Apache Hadoop

Although network encryption has been provided in the Apache Hadoop platform for some time (since Hadoop 2.02-alpha/CDH 4.1), at-rest encryption, the encryption of data stored on persistent storage such as disk, is not. To meet that requirement in the platform, Cloudera and Intel are working with the rest of the Hadoop community under the umbrella of Project Rhino — an effort to bring a comprehensive security framework for data protection to Hadoop, which also now includes Apache Sentry (incubating) — to implement at-rest encryption for HDFS (HDFS-6134 and HADOOP-10150).

Looks like I got this wrong: Apache Sentry will become part of Project Rhino.

Original title and link: Project Rhino goal: at-rest encryption for Apache Hadoop (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/06/project-rhino-goal-at-rest-encryption/


Hadoop security: unifying Project Rhino and Sentry

One result of Intel’s investment in Cloudera is putting together the teams to work on the same projects:

As the goals of Project Rhino and Sentry to develop more robust authorization mechanisms in Apache Hadoop are in complete alignment, the efforts of the engineers and security experts from both companies have merged, and their work now contributes to both projects. The specific goal is “unified authorization”, which goes beyond setting up authorization policies for multiple Hadoop components in a single administrative tool; it means setting an access policy once (typically tied to a “group” defined in an external user directory) and having it enforced across all of the different tools that this group of people uses to access data in Hadoop – for example access through Hive, Impala, search, as well as access from tools that execute MapReduce, Pig, and beyond.

A great first step.

You know what would be even better? A single security framework for Hadoop instead of two.

Original title and link: Hadoop security: unifying Project Rhino and Sentry (NoSQL database©myNoSQL)

via: http://vision.cloudera.com/project-rhino-and-sentry-onward-to-unified-authorization/


Hortonworks’ Hadoop secret weapon is... Yahoo

Derrick Harris:

Hortonworks was working right alongside Yahoo all through that process. They’ve also worked together on things like rolling upgrades so Hadoop users can upgrade software without taking down a cluster.

  1. who didn’t know about Hortonworks and Yahoo’s collaboration?
  2. what company and product management team would choose not to work with one of the largest user of the technology it is working on?

    This is the perfect example of testing and validating new ideas, learning about the pain your customers are facing in real life. Basically by the book product/market fit.

Original title and link: Hortonworks’ Hadoop secret weapon is… Yahoo (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/06/16/when-it-comes-to-hadoop-yahoo-is-still-hortonworks-secret-weapon/


Storing, processing, and computing with graphs

Marko Rodriguez is on the roll with yet another fantastic article about graphs:

To the adept, graph computing is not only a set of technologies, but a way of thinking about the world in terms of graphs and the processes therein in terms of traversals. As data is becoming more accessible, it is easier to build richer models of the environment. What is becoming more difficult is storing that data in a form that can be conveniently and efficiently processed by different computing systems. There are many situations in which graphs are a natural foundation for modeling. When a model is a graph, then the numerous graph computing technologies can be applied to it.

✚ If you missed it, the other recent article I’m referring to is “Knowledge representation and reasoning with graph databases

Original title and link: Storing, processing, and computing with graphs (NoSQL database©myNoSQL)

via: http://www.javacodegeeks.com/2014/06/on-graph-computing.html


Consensus-based replication in HBase

Konstantin Boudnik (WANdisco):

The idea behind consensus-based replication is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Not enough details, but doesn’t this sound like Paxos applied earlier?

Original title and link: Consensus-based replication in HBase (NoSQL database©myNoSQL)

via: http://blogs.wandisco.com/2014/06/16/consunsus-based-replication-hbase/


Where to look for Hadoop reliability problems

Dan Woods (Forbes) gets a list of 10 possible problems in Hadoop from Raymie Stata (CEO Altiscale) that can be summarized as:

  1. using default configuration options
  2. doing no tuning
  3. understanding Amazon Elastic MapReduce’s behavior

Original title and link: Where to look for Hadoop reliability problems (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2014/06/16/solving-the-mystery-of-hadoop-reliability/


A Story of graphs, DBs, and graph databases

After Marko Rodriguez’s Knowledge representation and reasoning with graph databases, another great intro to graph databases resource is Joshua Shinavier’s presentation:


Knowledge representation and reasoning with graph databases

A graph database and its ecosystem of technologies can yield elegant, efficient solutions to problems in knowledge representation and reasoning. To get a taste of this argument, we must first understand what a graph is.

And Marko Rodriguez delivers a dense but very readable intro to modeling with graphs.

Original title and link: Knowledge representation and reasoning with graph databases (NoSQL database©myNoSQL)

via: http://www.javacodegeeks.com/2014/06/knowledge-representation-and-reasoning-with-graph-databases.html


Dude, missing indexes? Seriously….

dude missing indexes

The problem is not the tool itself

I didn’t know about CommitStrip. Until now.

Original title and link: Dude, missing indexes? Seriously…. (NoSQL database©myNoSQL)


Apache Kafka: Next generation distributed messaging system

Abhishek Sharma in an 3000 words article on InfoQ:

Its architecture consists of the following components:

  • A stream of messages of a particular type is defined as a topic. A Message is defined as a payload of bytes and a Topic is a category or feed name to which messages are published.
  • A Abhishek Sharma can be anyone who can publish messages to a Topic.
  • The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
  • A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.

Producer can choose their favorite serialization method to encode the message content. For efficiency, the producer can send a set of messages in a single publish request. Following code examples shows how to create a Producer to send messages.

Kafka is an amazing system. I just wish the article would have actually looked into what makes it unique and how it compares to systems like RabbitMQ or ActiveMQ.

✚ Cameron Purdy in one of the comments:

If you carefully read the article, you’ll note that Kafka is not actually a message queue. It’s just a specialized database with some messaging semantics in its API. That means if you need the behaviors that you would associate with a message queue, you can’t get them with Kafka (or if you can, the performance will plummet.)

Original title and link: Apache Kafka: Next generation distributed messaging system (NoSQL database©myNoSQL)

via: http://www.infoq.com/articles/apache-kafka


Apache Mesos company Mesosphere raises $10M

Ron Miller (TechCrunch):

According to Matt Trifiro, SVP at Mesosphere, this is possible because of containerization technology developed by Google. Building on Google’s concept, Mesos allows system administrators to take complex applications that run at scale and use the resources of the entire datacenter as a single unit, using containerization to isolate the processes.

Meanwhile, last week at its first conference, Docker announced the 1.0 version and Google was pretty quick to announce both support and additional tools for it.

Original title and link: Apache Mesos company Mesosphere raises $10M (NoSQL database©myNoSQL)

via: http://techcrunch.com/2014/06/09/mesosphere-grabs-10m-in-series-a-funding-to-transform-datacenters/


Two questions about the Oracle in-memory database

Two questions about the Oracle in-memory database, announced in Sep. 2013, re-announced now, and coming… sometime later:

  1. Why would the performance improvement be visible only a specific hardware?

    Ellison said users can expect real-time analytics queries 100 times faster and online transaction processing that is two times faster as long as they are using hardware that supports the Oracle 12c database.

    I’ll assume that this could only mean that these results will be seen when data fits in memory. And not that one will need custom hardware to enable this feature. As a side note, I’m not sure I’m reading the announcement correctly, but it looks like a paying Oracle database customer will have to pay extra for the in-memory option.

  2. Can anyone explain how data can be stored both in columnar and row format?

    Additionally, the software will allow people to store data in both columns (used for analytics) and rows (used for transactions) as opposed to only one method; Ellison described this function as being “the magic of Oracle.”

    Magic has very little to do with databases and performance.

Original title and link: Two questions about the Oracle in-memory database (NoSQL database©myNoSQL)