NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



rdf: All content tagged as rdf in NoSQL databases and polyglot persistence

Distributed Temporal Graph Database Using Datomic

Davy Suvee describes the solution in the Gremlin group and shares the code on GitHub:

Last week I spend some time on implementing the Blueprints interface on top of Datomic. The RDF and SPARQL feel of the Datomic data model and query approach makes it a good target for implementing a property graph. I finished the implementation and all unit tests are passing. Now, what makes it really cool is that it is the only distributed “temporal” graph database that I’m aware of. It allows to perform queries against a version of the graph in the past.

This is the first solution I’m reading about addressing the time dimension in a graph model.

Original title and link: Distributed Temporal Graph Database Using Datomic (NoSQL database©myNoSQL)

NoSQL Databases Best Practices and Emerging Trends

Jans Aasman (CEO AllegroGraph) interviewed by Srini Penchikala:

InfoQ: What best practices and architecture patterns should the developers and architects consider when using a solution like this one in their software applications?

Jans: If your application requires simple straight joins and your schema hardly changes then any RDBM will do.

If your application is mostly document based, where a document can be looked at as a pre-joined nested tree (think a Facebook page, think a nested JSON object) and where you don’t want to be limited by an RDB schema then key-value stores and document stores like MongoDB are a good alternative.

If you want what is described in the previous paragraph but you have to perform complex joins or apply graph algorithms then the MongoGraph approach might be a viable solution.

Thinking about the products and projects I’ve been working on, most of them have had to deal with all these aspects in different areas of the applications and with different importance to the final solution. Mistakenly though, in most of the cases they ended up using a relational database only. With polyglot persistence here, this shouldn’t happen anymore. That’s not to say though that every project must use all of these technologies just because they are available. But it could use any of them or all combined.

InfoQ: What are the emerging trends in combining the NoSQL data stores?

Jans: From the perspective of a Semantic Web - Graph database vendor what we see is that nearly all graph databases now perform their text indexing with Lucene based indexing (Solr or Elastic Search) and I wouldn’t be surprised that most vendors soon will allow JSON objects as first class objects for graph databases. It was surprisingly straightforward to mix the JSON and triple/graph paradigm. We are also experimenting with key-value stores to see how that mixes with the triple/graph paradigm.

This topic was also discussed during my NoSQL Applications panel, but due to a panel time constraints we couldn’t reach a conclusion. But it’s definitely an interesting perspective.

Original title and link: NoSQL Databases Best Practices and Emerging Trends (NoSQL database©myNoSQL)


InfiniteGraph and RDF Tuples or Why Using a Specialized Solution Is the Way to Go

An excellent explanation for why it makes sense to use a specialized tool for the job:

Yes, InfiniteGraph can be used to analyze triples and RDF. But if that’s all you want to do, then you really should just use a triple store.

Our graph database trades some of the runtime flexibility (but not a lot) for well defined types and performance. RDF is fine for all the examples that have been circulated, if I just want to list all my friends or all the people I know who are married, its no big deal because the fanout of a single degree is extremely small. In fact, you can probably even just do it in mySQL for that matter. When we talk about scalability however, it’s not really about how much data we can store, but how quickly we can run across it. Storing RDF makes this effort slower. Its hard to make RDF perform, because the whole graph is self describing and therefore is computationally expensive to parse… Think of it like representing data in XML versus a defined binary format. XML is lovely to work with, basically human readable, but it is very verbose and inefficient.

The little secret here is that using a generic solution will usually work in the beginning. And if using a specialized solution implies bigger costs or longer time to market starting with what you know is just fine. But once your application grows, a specialized solution would not only provide an optimized solution, but will get you passed the initial problems that come with growth.

Original title and link: InfiniteGraph and RDF Tuples or Why Using a Specialized Solution Is the Way to Go (NoSQL database©myNoSQL)


1 Trillion RDF Triples With Franz’s AllegroGraph

Patrick Durusau mentioned on his blog a new record set by Franz’s AllegroGraph: 1 trillion RDF triples. This comes only 2 months after the previous Franz’s AllegroGraph record of 310 billion triples.

My first thought was: why is this important? It was one of the few times I’ve found the answer in the PR announcement:

A trillion RDF Statements […] is a primary interest for companies like Amdocs that use triples to represent real-time knowledge about telecom customers. Per-customer, Amdocs uses about 4,000 triples, so a large telecom like China Mobile would easily need 2 trillion triples to have detailed knowledge about each single customer.

Original title and link: 1 Trillion RDF Triples With Franz’s AllegroGraph (NoSQL database©myNoSQL)

Redis Based Triplestore Database

Using Redis as a triple store back-end requires an interesting combination of data types, operations, and multi-commands:

Combination of SUNIONSTORE, SINTERSTORE, SDIFFSTORE, SORT and similar commands allows for interesting use case. Sufficiently complex query can be expressed as dataflow between inputs, intermediate results stored as temporary keys and output result returning optionally sorted and paged values. Ability to store intermediate result in temporary key is essential for performance — it allows to avoid roundtripping of intermediate results between Redis database and application server. In addition Redis enables pipelined execution where multiple commands are sent without waiting for replies. This creates condition for “stored procedure”-like execution with single roundtrip to database. Experimentally Redis supports embedded Lua scripting where instead of pipelining multiple do-and-store commands it is possible to submit single EVAL command with equivalent Lua script.

The post exemplifies a query dataflow:

Redis Triple Store Query

Original title and link: Redis Based Triplestore Database (NoSQL databases © myNoSQL)


State of the Linking Open Data Cloud

The following diagram visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set.

Linked Open Data cloud

There is a section in the document focusing on compliance with best practices for data provisioning that is a detailed explanation of the Sir Tim Berners-Lee 5-star deployment scheme for linked open data

Original title and link: State of the Linking Open Data Cloud (NoSQL databases © myNoSQL)


Why do we need so many different databases?

My ideal database would borrow from RDBMS (like SQL Server), Document databases (like MongoDB), Graph Databases and Semantic Web Triple Stores; it would be the perfect hybrid of all of these and it would configure itself to be as efficient as possible answering queries.

That’s exactly the definition of polyglot persistence.

Every application could benefit of using different data models. The data ingestion module being a document store, a reporting module using relational data, another a graph model, etc.

But if all these models would exist in the same tool that will be a mammoth. It will be a tool good for everything, best at none — doesn’t that sound familiar in a way? Too heavy, too complicated, not agile.

Think of programming languages and multi-paradigms: object-oriented, functional, logic, etc. I’d love to be able to use any of them. But having a single language supporting all of these, I don’t know.

What I’d like is to have the option. And good, or even better, standardized inter-communication. Differently put, what I don’t want is a monolith, nor a highly heterogeneous environment.

Original title and link: Why do we need so many different databases? (NoSQL databases © myNoSQL)


RDF Stores are The Silverbullet

I wasn’t aware that RDF stores are the silverbullet for storage:

Modern triplestores have developed to the point where they offer the rigor of relational databases, the scalability of big data systems, and still support big complicated joins.

What confused me about the article is the definitions by which triplestores come out on top in terms of flexibity and triplestores win hands down for complex event analysis. Maybe someone could explain.

Original title and link: RDF Stores are The Silverbullet (NoSQL databases © myNoSQL)


Storing RDF in Wide-Column Databases (Cassandra, HBase)

You basically have two options in how to store RDF data in wide-column databases like HBase and Cassandra: the resource-centric approach and the statement-centric approach.

In the statement-oriented approach, each RDF statement corresponds to a row key (for instance, a UUID) and contains subject, predicate and object columns. In Cassandra, each of these would be supercolumns that would then contain subcolumns such as type and value, to differentiate between RDF literals, blank nodes and URIs. If you needed to support named graphs, each row could also have a context column that would contain a list of the named graphs that the statement was part of.


In view of the previous considerations, the resource-oriented approach is generally a better natural fit for storing RDF data in wide-column databases. In this approach, each RDF subject/resource corresponds to a row key, and each RDF predicate/property corresponds to a column or supercolumn. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs.

Leaving aside Cassandra or HBase or Riak or the over a dozen existing solutions, you can always build a triple store on MongoDB.

Original title and link: Storing RDF in Wide-Column Databases (Cassandra, HBase) (NoSQL databases © myNoSQL)


A Semantic Triple Store Built on MongoDB

An interesting semantic triple store data modeling exercise with MongoDB:

In the MongoDB version of my semantic store I take a different approach to storing the basic building blocks of semantic knowledge representation. For starters I decided that typical ABox and TBox knowledge has really quite different storage requirements and that smashing all the complex TBox assertions into simple triples and stringing them together with meta fields only to immediately join then back up whenever needed just seemed like a bad idea from the NOSQL / document-database perspective.

Original title and link: A Semantic Triple Store Built on MongoDB (NoSQL databases © myNoSQL)

Riak: Storing an RDF Graph

I wrote before that Riak links is a very web-like and interesting feature (nb truth being told this is not the only Riak exclusive feature). Michael Hausenblas has tried to use Riak links for storing an RDF graph in Riak

The main issue then was how to map the RDF graph into Riak buckets, objects and keys. Here is what I came up so far — I use a RDF resource-level approach with a special object key that I called:id, which is the RDF resource URI or the bNode. Further, in order to maintain the graph provenance, I store the original RDF document URI in the metadata of the Riak bucket. Each RDF resource is mapped into a Riak object; for each literal RDF object value the literal value is stored directly via an Riak object-key, for each resource object (URI ref or bNode), I use a Link header.

Due to their high connectivity graph databases are difficult to scale and this approach could be an interesting approach.

Original title and link: Riak: Storing an RDF Graph (NoSQL databases © myNoSQL)


3 Differences between RDF Databases and Other NoSQL Solutions

RDF database systems form the largest subset of this last NoSQL category. RDF data can be thought of in terms of a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values.

Bottom line it sounds like there’s only one difference: standardization.

  • A simple and uniform standard data model: all RDF database systems share the same well-specified and W3C-standardized data model at their base.
  • A powerful standard query langauge: SPARQL is a very big win for RDF databases here, providing a standardized and interoperable query language that even non-programmers can make use of, and one which meets or exceeds SQL in its capabilities and power while retaining much of the familiar syntax.
  • Standardized data interchange formats: RDF databases, by contrast, all have import/export capability based on well-defined, standardized, entirely implementation-agnostic serialization formats such as N-Triples and N-Quads.