NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

Unstructured Data: What Is It?

Paige Roberts writes in a post about integrating predictive analytics with Hadoop:

Unstructured is really a misnomer. I think it was Curt Monash who coined the term polystructured. That makes a lot more sense, since if data was truly without structure, even humans wouldn’t be able to make sense of it. In every seemingly unstructured dataset, there is some form of structure. An email has structure. A web page has structure. A Twitter stream has structure. Facebook interactions have structure. Machine generated log files have structure. But none of those structures are remotely alike. Nor are they remotely similar to the structure of a standard transactional record.

I don’t think there are many that are thinking of unstructured data as data with completely random structure. My understanding of the term unstructured refers to three dimensions:

  1. variability: data representing the same entities can take different forms and contain different details. The simplest example I could think of is the information about a video shared on two different platforms.
  2. multi-purpose: data is not representing a single entity, but rather a set of related entities in an aggregated or compo
  3. data closer to natural language than mathematical structure: take for example some normal English text—according to the grammar rules it has structure, but it’s not easily understandable by machines (nb: maybe machine descriptiveness would be a better way to name this dimension)

Original title and link: Unstructured Data: What Is It? (NoSQL database©myNoSQL)

How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)


Virtual Nodes Strategies

Interesting post by Acuno on 3 strategies for virtual nodes:

At Acunu we wanted to bring virtual nodes support to Cassandra to alleviate some of these operations headaches and I spent some time looking at some of the different approaches we could take. The three main variants that I found could be broadly divided into the following three categories: 1) random token assignment; 2) fixed partition assignment; 3) automatic sharding

Original title and link: Virtual Nodes Strategies (NoSQL database©myNoSQL)


NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints

In a sort of follow up to the NoSQL panel I’ve hosted at QCon, InfoQ published a virtual panel covering the following topics:

  1. What is the current state of NoSQL databases in terms of enterprise adoption?
  2. Can you discuss the core architectural patterns your NoSQL database product supports in the areas of data persistence, retrieval, and other database related concerns?
  3. What are the use cases or applications that are best suited to use your product?
  4. What are the limitations or constraints of using the product?
  5. Cross-Store Persistence concept is getting a lot of attention lately. This can be used to persist the data into different databases including NoSQL and Relational databases. Can you talk about this approach and how it can help the application developers who need to connect to multiple databases from a single application?
  6. What do you see as the role of In-memory data grids (IMDG) in the polyglot persistence space?
  7. What is the current state of security in this space and what’s coming?
  8. What is the future road map for your product in terms of new features and enhancements?

I’d suggest reading the answers to questions 2 to 6 at least.

Original title and link: NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints (NoSQL database©myNoSQL)


The Behavior of EC2/EBS Metadata Replicated Datastore

The Amazon post about the service disruption that happened late last month provides an interesting description of the behavior of the Amazon EC2 and EBS metadata datastores:

The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.

Original title and link: The Behavior of EC2/EBS Metadata Replicated Datastore (NoSQL database©myNoSQL)


NoSQL Data Models and Adoption

In an interview for the DataStax blog, Philippe Modard, engineer and CTO at V2i:

The big difference over relational databases is the data model. Once we understood how things needed to be modeled and defined, everything else was a piece of cake.

Indeed NoSQL new data models are the first obstacles developers encounter when considering a NoSQL database. Some could think it’s about using new APIs, lacking a query language like SQL, or having to use a different one. But I don’t think these are the real causes.

The first time I’ve experienced the unfamiliarity of a new data model was back in 2005 when I’ve started using Jackrabbit JCR implementation (a hierarchical model). Then a couple years later I’ve had the same feeling when first using the Google App Engine data store.

It wasn’t about the new APIs though. And it wasn’t about the query languages either. For me it was about rethinking how I store and access data. It was striking to realize how used I was to think in terms of a relational model. Even if not everything I’ve implemented before was purely relational.

Looking at the various NoSQL databases around, you could see how those that started with a data model that felt closer to the relational model have seen faster adoption. And I don’t think the main reason behind it is better data models per se, but just familiarity.

Original title and link: NoSQL Data Models and Adoption (NoSQL database©myNoSQL)

The Myth of Auto Scaling as a Capacity Planning Approach

A quite old, but very educative post by James Golick dissecting the mythical extra server capacity:

There’s this idea floating around that we can scale out our data services “just in time”. Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It’s a great sounding story, but sadly, things don’t quite work that way.

This is the Mythical Man-Month of the IT department.

John Allspaw

Original title and link: The Myth of Auto Scaling as a Capacity Planning Approach (NoSQL database©myNoSQL)


NO DB - the Center of Your Application Is Not the Database

Uncle Bob:

The center of your application is not the database. Nor is it one or more of the frameworks you may be using. The center of your application are the use cases of your application. […] If you get the database involved early, then it will warp your design. It’ll fight to gain control of the center, and once there it will hold onto the center like a scruffy terrier. You have to work hard to keep the database out of the center of your systems. You have to continuously say “No” to the temptation to get the database working early.

Original title and link: NO DB - the Center of Your Application Is Not the Database (NoSQL database©myNoSQL)


The Grand Picture of Big Data and the Impact on the Architecture of Systems

In a recent interview for AllThingsD, Mike Rhodin, the senior vice president of IBM’s Software Solutions Group gave a very realistic description of what the future of data looks like:

[…] it comes out of the digitization of the physical world, the instrumentation of physical processes that’s going to generate huge amounts of new data, which is going to drive issues around storage, and what to do with all the data, how to analyze it. That pushes you toward real-time analytics and streaming technologies, because with real time, you don’t have to save the data — you want to look for anomalies as they occur.

This is indeed the grand picture of Big Data.

Now think for a second how many companies have such systems in place. Not many. Think now how many companies can offer as-complete-as-possible integrated systems to address these challenges. Very few.

These two answers are revealing an interesting perspective about the future of the Big Data market.

On one side we have vendors building top notch solutions—consider the new features in the relational databases, NoSQL databases, Hadoop, etc. By looking at this space you’ll have to agree that all these are excellent solutions for tackling a sub-space of the overall problem. They are getting closer and closer to offering local optimum solutions.

On other side there are the system integrators and platform vendors. Their systems may not be the best in solving every aspect of a problem, but their focus is in addressing and solving the complete problem. Their sales pitch is integration and/or specialization.

As someone writing about polyglot persistence and the 1001 NoSQL, NewSQL, and the development of the relational databases, I could be tempted to think that every company would have the budget, the know-how, and the time to take top-notch sub-systems and create solutions crafted to their problem. But looking back in time and also applying the lessons from other markets, I think it is safe to say that integrated solutions are preferred.

The lesson to be learned by both NoSQL and relational database vendors, actually by all (sub)system vendors that are playing in the Big Data market is to design products with openness and integration in mind. Very few, if any, sub-systems will be part of the grand solution if they are architected as silos. They can continue to provide the ultimate local optimum solutions, but as long as they are not architected to be part of a collaborative integrated platform they’ll lose important segments of the market. Many products I’m writing about are already following this principle, many are making steps towards being friendlier in terms of integration, and many are still taking the silver bullet approach.

Original title and link: The Grand Picture of Big Data and the Impact on the Architecture of Systems (NoSQL database©myNoSQL)

Neo4j Data Modeling: What Question Do You Want to Answer?

Mark Needham:

Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.

This same principle should be applied to modeling with any NoSQL database. Thinking in terms of access patterns is one of the major differences between doing data modeling in the NoSQL space and the relational world, which is driven, at least in the first phases and theoretically, by the normalization rules.

Original title and link: Neo4j Data Modeling: What Question Do You Want to Answer? (NoSQL database©myNoSQL)


In Defense of ORMs

Martin Fowler in a post defending the role of ORMs:

As you might have gathered, I think NoSQL is technology to be taken very seriously. If you have an application problem that maps well to a NoSQL data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. Indeed this is often a reason I’ve heard teams go with a NoSQL solution.

Truth is that some NoSQL databases are getting more mapping libraries and frameworks than what I’ve seen done for relational databases before. Even worse, there are attempts to hide all NoSQL databases behind the same libraries, specs, or APIs. I think that the old principle of decoupling the application from the underlying database is way too ingrained in the software community, so much that almost nobody is asking the real questions: will I really need to change the data storage, will it actually work as planned, or what will I lose if I add just another indirection layer?

As Martin Fowler correctly emphasizes in his post, the real benefit of ORMs comes from eliminating the plumbing :

A framework that allows me to avoid 80% of that [boiler-plate code] is worthwhile even if it is only 80%. The problem is in me for pretending it’s 100% when it isn’t.

But that doesn’t mean that everyone should create its own mapping library/framework for each and every project.

Original title and link: In Defense of ORMs (NoSQL database©myNoSQL)


The Future of NoSQL With Java EE

Markus Eisele:

We already have a lot in place for the so-called “NoSQL” DBs. And the groundwork for integrating this into new Java EE standards is promising. Control of embedded NoSQL instances should be done via JSR 322 (Java EE Connector Architecture) with this being the only allowed place spawn threads and open files directly from a filesystem. I’m not a big supporter of having a more general data abstraction JSR for the platform comparable to what Spring is doing with Spring Data. To me the concepts of the different NoSQL categories are too different than to have a one-size-fits-all approach. 


Original title and link: The Future of NoSQL With Java EE (NoSQL database©myNoSQL)