ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

Migrating Between Two Different Types of NoSQL Databases

Teacher asking a student:

After the Presentation the team leader asked me how it is, to migrate the db’s under various types. […] Can I migrate from a, key-value store db, like dynamo, to a, document store db, like mongoDB?

I’m not sure if this would have been reflected on the final grade, but I would have asked how many times did the teacher had to, really had to migrate data between multiple relational databases? And how many times it worked automatically? If allowed I’d have followed up with a very brief dialogue about the complexity of migrating applications to different programming languages (even when they use the same programming paradigm) and brought up examples of important differences of access and mutations of data structures. In the end I might have failed the exam though.

On a more serious note, there are so many aspects of migrating data that is very difficult to have a good short answer to this question. A sign of this problem’s complexity is the wide range of companies and products trying to solve ETL.

Original title and link: Migrating Between Two Different Types of NoSQL Databases (NoSQL database©myNoSQL)

via: https://groups.google.com/forum/?fromgroups#!topic/nosql-discussion/_Oj4rJ2Q4q8%5B1-25%5D


Addiction to Familiar Systems

Marco Arment about switching from familiar systems or programming languages to better ones:

The fear of making the “wrong” choice actually makes the familiar, mastered PHP more attractive. […] If you can get PHP programmers to agree that they need to stop using it, the first question that comes up is what to use instead, and they’re met with a barrage of difficult choices and wildly different opinions and recommendations.

The same problem plagues anyone interested in switching to Linux (Which distro? Which desktop environment? Which package manager?), and the paralysis of choice-overload usually leads people to abandon the choice and just stick with Windows or OS X. But the switching costs of choosing the “wrong” programming language for a project are much larger.

Once you master a programming language or system you start seeing the other options from a different perspective. It doesn’t mean you have a better or an objective perspective though. What you’ve got is a new dimension you are considering in all decisions: familiarity. Every other option you have will go through your familiarity filter: does it feel familiar? does it allow me to do what I’ve been doing all this time? does it work in a similar way?

You might think that using a familiar system is all about productivity. I think that is only partially true. A familiar system doesn’t come with a learning curve and so in the early stages it feels productive. But many times you’ll just have to write over and over again the same things, avoid the same traps and made the tweaks you’ve learned. In a way this part of being productive feels like repetition.

But what all these have to do with databases? The answer is probably obvious.

Familiarity is in so many cases the main reason new systems start with a relational database. It feels familiar. It is familiar. As your application grows and new features are needed there will be cases when the relational database would become a less optimal solution. But in the name of familiarity, you’ll be tempted to stick with it. Make a change here and there, declare a feature too complicated, tweak it, optimize it. Repeat.

After a while, taking a step back might make you realize that what you’ve built is not anymore familiar. Or maybe it’s still familiar to you, but to a new project team member it will feel different and new. Or maybe very similar to a different database that you could have started with.

The costs of sticking with familiar programming languages, systems, or databases could be much larger than you’d think of.

Original title and link: Addiction to Familiar Systems (NoSQL database©myNoSQL)


The Benefits of Virtual Nodes and Performance Results

Sam Overton and Tom Wilkie of Acunu explain the advantages of using virtual nodes in distributed data storage engines and the performance they’ve measure introducing virtual nodes in Acunu platform when compared with Apache Cassandra:

One of the factors that limits the amount of data that can be stored on each node is the amount of time it takes to re-replicate that data when a node fails. That time matters, because it is a period during which the cluster is more vulnerable than normal to data loss. The challenge is that the more data stored on a node, the longer it takes to re-replicate it. Therefore, to store more data per node safely, we want to reduce the time taken to return to normal. This was one of our aims with virtual nodes.

Virtual Nodes reduces the time taken to re-replicate data as it involves every node in the cluster in the operation. In contrast, Apache Cassandra v1.1 will only involve a number of nodes equal to the Replication Factor (RF) of your keyspace. What’s more, with Virtual Nodes, the cluster remains balanced after this operation - you do not need to shuffle the tokens on the other nodes to compensate for the loss!

Original title and link: The Benefits of Virtual Nodes and Performance Results (NoSQL database©myNoSQL)

via: http://www.acunu.com/2/post/2012/07/virtual-nodes-performance-results.html


Unstructured Data: What Is It?

Paige Roberts writes in a post about integrating predictive analytics with Hadoop:

Unstructured is really a misnomer. I think it was Curt Monash who coined the term polystructured. That makes a lot more sense, since if data was truly without structure, even humans wouldn’t be able to make sense of it. In every seemingly unstructured dataset, there is some form of structure. An email has structure. A web page has structure. A Twitter stream has structure. Facebook interactions have structure. Machine generated log files have structure. But none of those structures are remotely alike. Nor are they remotely similar to the structure of a standard transactional record.

I don’t think there are many that are thinking of unstructured data as data with completely random structure. My understanding of the term unstructured refers to three dimensions:

  1. variability: data representing the same entities can take different forms and contain different details. The simplest example I could think of is the information about a video shared on two different platforms.
  2. multi-purpose: data is not representing a single entity, but rather a set of related entities in an aggregated or compo
  3. data closer to natural language than mathematical structure: take for example some normal English text—according to the grammar rules it has structure, but it’s not easily understandable by machines (nb: maybe machine descriptiveness would be a better way to name this dimension)

Original title and link: Unstructured Data: What Is It? (NoSQL database©myNoSQL)


How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)

via: http://tech.flurry.com/137492485


Virtual Nodes Strategies

Interesting post by Acuno on 3 strategies for virtual nodes:

At Acunu we wanted to bring virtual nodes support to Cassandra to alleviate some of these operations headaches and I spent some time looking at some of the different approaches we could take. The three main variants that I found could be broadly divided into the following three categories: 1) random token assignment; 2) fixed partition assignment; 3) automatic sharding

Original title and link: Virtual Nodes Strategies (NoSQL database©myNoSQL)

via: http://www.acunu.com/2/post/2012/07/virtual-nodes-strategies.html


NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints

In a sort of follow up to the NoSQL panel I’ve hosted at QCon, InfoQ published a virtual panel covering the following topics:

  1. What is the current state of NoSQL databases in terms of enterprise adoption?
  2. Can you discuss the core architectural patterns your NoSQL database product supports in the areas of data persistence, retrieval, and other database related concerns?
  3. What are the use cases or applications that are best suited to use your product?
  4. What are the limitations or constraints of using the product?
  5. Cross-Store Persistence concept is getting a lot of attention lately. This can be used to persist the data into different databases including NoSQL and Relational databases. Can you talk about this approach and how it can help the application developers who need to connect to multiple databases from a single application?
  6. What do you see as the role of In-memory data grids (IMDG) in the polyglot persistence space?
  7. What is the current state of security in this space and what’s coming?
  8. What is the future road map for your product in terms of new features and enhancements?

I’d suggest reading the answers to questions 2 to 6 at least.

Original title and link: NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints (NoSQL database©myNoSQL)

via: http://www.infoq.com/articles/virtual-panel-nosql-database-patterns


The Behavior of EC2/EBS Metadata Replicated Datastore

The Amazon post about the service disruption that happened late last month provides an interesting description of the behavior of the Amazon EC2 and EBS metadata datastores:

The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.

Original title and link: The Behavior of EC2/EBS Metadata Replicated Datastore (NoSQL database©myNoSQL)

via: http://aws.amazon.com/message/67457/


NoSQL Data Models and Adoption

In an interview for the DataStax blog, Philippe Modard, engineer and CTO at V2i:

The big difference over relational databases is the data model. Once we understood how things needed to be modeled and defined, everything else was a piece of cake.

Indeed NoSQL new data models are the first obstacles developers encounter when considering a NoSQL database. Some could think it’s about using new APIs, lacking a query language like SQL, or having to use a different one. But I don’t think these are the real causes.

The first time I’ve experienced the unfamiliarity of a new data model was back in 2005 when I’ve started using Jackrabbit JCR implementation (a hierarchical model). Then a couple years later I’ve had the same feeling when first using the Google App Engine data store.

It wasn’t about the new APIs though. And it wasn’t about the query languages either. For me it was about rethinking how I store and access data. It was striking to realize how used I was to think in terms of a relational model. Even if not everything I’ve implemented before was purely relational.

Looking at the various NoSQL databases around, you could see how those that started with a data model that felt closer to the relational model have seen faster adoption. And I don’t think the main reason behind it is better data models per se, but just familiarity.

Original title and link: NoSQL Data Models and Adoption (NoSQL database©myNoSQL)


The Myth of Auto Scaling as a Capacity Planning Approach

A quite old, but very educative post by James Golick dissecting the mythical extra server capacity:

There’s this idea floating around that we can scale out our data services “just in time”. Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It’s a great sounding story, but sadly, things don’t quite work that way.

This is the Mythical Man-Month of the IT department.

John Allspaw

Original title and link: The Myth of Auto Scaling as a Capacity Planning Approach (NoSQL database©myNoSQL)

via: http://jamesgolick.com/2010/10/27/we-are-experiencing-too-much-load-lets-add-a-new-server..html


NO DB - the Center of Your Application Is Not the Database

Uncle Bob:

The center of your application is not the database. Nor is it one or more of the frameworks you may be using. The center of your application are the use cases of your application. […] If you get the database involved early, then it will warp your design. It’ll fight to gain control of the center, and once there it will hold onto the center like a scruffy terrier. You have to work hard to keep the database out of the center of your systems. You have to continuously say “No” to the temptation to get the database working early.

Original title and link: NO DB - the Center of Your Application Is Not the Database (NoSQL database©myNoSQL)

via: http://blog.8thlight.com/uncle-bob/2012/05/15/NODB.html


The Grand Picture of Big Data and the Impact on the Architecture of Systems

In a recent interview for AllThingsD, Mike Rhodin, the senior vice president of IBM’s Software Solutions Group gave a very realistic description of what the future of data looks like:

[…] it comes out of the digitization of the physical world, the instrumentation of physical processes that’s going to generate huge amounts of new data, which is going to drive issues around storage, and what to do with all the data, how to analyze it. That pushes you toward real-time analytics and streaming technologies, because with real time, you don’t have to save the data — you want to look for anomalies as they occur.

This is indeed the grand picture of Big Data.

Now think for a second how many companies have such systems in place. Not many. Think now how many companies can offer as-complete-as-possible integrated systems to address these challenges. Very few.

These two answers are revealing an interesting perspective about the future of the Big Data market.

On one side we have vendors building top notch solutions—consider the new features in the relational databases, NoSQL databases, Hadoop, etc. By looking at this space you’ll have to agree that all these are excellent solutions for tackling a sub-space of the overall problem. They are getting closer and closer to offering local optimum solutions.

On other side there are the system integrators and platform vendors. Their systems may not be the best in solving every aspect of a problem, but their focus is in addressing and solving the complete problem. Their sales pitch is integration and/or specialization.

As someone writing about polyglot persistence and the 1001 NoSQL, NewSQL, and the development of the relational databases, I could be tempted to think that every company would have the budget, the know-how, and the time to take top-notch sub-systems and create solutions crafted to their problem. But looking back in time and also applying the lessons from other markets, I think it is safe to say that integrated solutions are preferred.

The lesson to be learned by both NoSQL and relational database vendors, actually by all (sub)system vendors that are playing in the Big Data market is to design products with openness and integration in mind. Very few, if any, sub-systems will be part of the grand solution if they are architected as silos. They can continue to provide the ultimate local optimum solutions, but as long as they are not architected to be part of a collaborative integrated platform they’ll lose important segments of the market. Many products I’m writing about are already following this principle, many are making steps towards being friendlier in terms of integration, and many are still taking the silver bullet approach.

Original title and link: The Grand Picture of Big Data and the Impact on the Architecture of Systems (NoSQL database©myNoSQL)