NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence
In a recent interview for AllThingsD, Mike Rhodin, the senior vice president of IBM’s Software Solutions Group gave a very realistic description of what the future of data looks like:
[…] it comes out of the digitization of the physical world, the instrumentation of physical processes that’s going to generate huge amounts of new data, which is going to drive issues around storage, and what to do with all the data, how to analyze it. That pushes you toward real-time analytics and streaming technologies, because with real time, you don’t have to save the data — you want to look for anomalies as they occur.
This is indeed the grand picture of Big Data.
Now think for a second how many companies have such systems in place. Not many. Think now how many companies can offer as-complete-as-possible integrated systems to address these challenges. Very few.
These two answers are revealing an interesting perspective about the future of the Big Data market.
On one side we have vendors building top notch solutions—consider the new features in the relational databases, NoSQL databases, Hadoop, etc. By looking at this space you’ll have to agree that all these are excellent solutions for tackling a sub-space of the overall problem. They are getting closer and closer to offering local optimum solutions.
On other side there are the system integrators and platform vendors. Their systems may not be the best in solving every aspect of a problem, but their focus is in addressing and solving the complete problem. Their sales pitch is integration and/or specialization.
As someone writing about polyglot persistence and the 1001 NoSQL, NewSQL, and the development of the relational databases, I could be tempted to think that every company would have the budget, the know-how, and the time to take top-notch sub-systems and create solutions crafted to their problem. But looking back in time and also applying the lessons from other markets, I think it is safe to say that integrated solutions are preferred.
The lesson to be learned by both NoSQL and relational database vendors, actually by all (sub)system vendors that are playing in the Big Data market is to design products with openness and integration in mind. Very few, if any, sub-systems will be part of the grand solution if they are architected as silos. They can continue to provide the ultimate local optimum solutions, but as long as they are not architected to be part of a collaborative integrated platform they’ll lose important segments of the market. Many products I’m writing about are already following this principle, many are making steps towards being friendlier in terms of integration, and many are still taking the silver bullet approach.
Original title and link: The Grand Picture of Big Data and the Impact on the Architecture of Systems ( ©myNoSQL)
Patrick Valduriez, co-author of the “Principles of Distributed Database Systems” book, has published a paper Principles of Distributed Data Management in 2020? (pdf) translating the main topic into the following 3 questions:
- What are the fundamental principles behind the emerging solutions?
- Is there any generic architectural model, to explain those principles?
- Do we need new foundations to look at data distribution?
Wrt (2), I showed that emerging solutions can still be explained along the three main dimensions of distributed data management (distribution, autonomy, heterogeneity), yet pushing the scales of the dimensions high up. However, I raised the question of how generic should distributed data management be, without hampering application-specific optimizations. Emerging NOSQL solutions tend to rely on a specific data model (e.g. Bigtable, MapReduce) with a simple set of operators easy to use from or with a programming language. It is also interesting to witness the development of algebras, with specific operators, to raise the level of abstraction in a way that enables optimization . What is missing to explain the principles of emerging solutions is one or more dimensions on generic/specific data model and data processing.
What I think this paper does is actually looking at two different questions, a bit less generic but still useful in proving that the new generation of distributed database systems was clearly triggered by the new requirements and the evolution of the current applications:
- Is there a need for new approaches in distributed data management systems?
- What are some of the approaches used by the emerging solution to deal with the challenges posed by today’s data-intensive applications?
You can read or download Patrick Valduriez’s paper here:
Download the slides, set aside 1 hour and 10 minutes of uncontended time, click the Maximize button, and watch this great presentation by Martin Thompson and Michael Barker diving into the Intel x86_64 processors and memory models for implementing lock-free algorithms. Once you’re done make sure to also read The Single Writer Principle by the same Martin Thompson.
Original title and link: Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works ( ©myNoSQL)
A reminder to those thinking that networks never fail and automation can solve everything. Christina Ilvento, on behalf of the App Engine team:
The root cause of the outage was a combination of two factors during a scheduled network maintenance in one of our datacenters. As part of the scheduled maintenance, network capacity to and from this datacenter was reduced. This alone was expected, and was not a problem. However, this maintenance exposed a previously existing misconfiguration in the system that manages network bandwidth capacity.
Ordinarily, the bandwidth management system helps isolate and prioritize traffic. When capacity is reduced because of maintenance, network failure, or due to an excess of normal traffic, the bandwidth management system keeps things running smoothly by throttling back the rate of low priority traffic. However, as mentioned, the bandwidth management system had a latent misconfiguration which did not show up until capacity was reduced due to the scheduled maintenance. This misconfiguration under-reported the available network capacity to and from the datacenter, causing the network modeler to believe that there was less overall capacity than actually existed.
The configuration error in the bandwidth management system, when combined with an expected reduction in capacity due to the scheduled maintenance, led the system to conclude that there was insufficient bandwidth available for current traffic demand to and from this datacenter. (In reality, there was more than sufficient excess capacity, as otherwise the maintenance would not have been allowed to go forward.) Because of this combination of misconfiguration and scheduled maintenance, a number of services were automatically blocked from sending network traffic. […]
The outage occurred because two independent systems failed at the same time, which resulted in mistakes in our usual escalation procedures which significantly impacted the duration of the outage.
Original title and link: Networks Never Fail ( ©myNoSQL)