ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

And the prize goes to… FoundationDB Fault Tolerance Demo

Best. Demo. Ever.

Some details about the demo and the setup can be found in the blog post and in the HN thread.

Original title and link: And the prize goes to… FoundationDB Fault Tolerance Demo (NoSQL database©myNoSQL)


4 Modern Data Loading Tips to Save Time, Money & Your Sanity

April Healy for Attunity1:

Here are 4 tips for modern data loading that can save you a lot of time, money, and your sanity!

  1. Ditch the development — Look into data ingestion/integration software that frees up your developer hours. Save those hours for revenue generating projects that exploit the opportunities you uncover with all the new data you are able to analyze.
  2. Find your “wingman” — Partner with a company that has the right solutions for your business needs. Things you might need: the capability to handle heterogeneous data sources, minimal impact on the source systems, automated processes, and/or graphical interfaces that make set up a snap. Your wingman should be able to help you get the data into your BI environment when you want it without a lot of hassle.
  3. Consider ingesting all the data in raw form — Gone are the days when data had to be cleansed, no pristine, before it entered the hallowed halls of the data warehouse. Let your data analysts sort out what is valuable within the BI environment. Who knows what nuggets they will find.
  4. Boost your BI System ROI by loading smarter — Don’t wait weeks or months to access new data sources; enhance your loading processes so you can “do it now!”

All good advice (well, the first three at least). After considering them, draw a line and compare:

  1. what you’ll gain? (the answer is probably a combination of the speed of solving the problem and learning about the process from experts in the fields)
  2. what you’ll lose? (the answer is probably that all your eggs will be in that vendor’s basket—whatever he wants you to pay for the next update or change or upgrade you’ll have to pay)

My generic answer is that I don’t believe in complete white or complete black: the optimal solution is probably a combination of finding a set of good tools and hiring people that can take care of it and evolve the solution. And even if I don’t see the world in black and white, I still don’t see it black and white—there might be companies out there where having a solution right now could show great ROI, thus pushing building a team to a second place priority. Or vice-versa.


  1. You guessed it right: they sell ETL solutions. 

Original title and link: 4 Modern Data Loading Tips to Save Time, Money & Your Sanity (NoSQL database©myNoSQL)

via: http://www.attunity.com/blog/4-modern-data-loading-tips-save-time-money-your-sanity


28msec - query data from any source in real time

Derrick Harris writing about 28msec, still-in-stealth-mode, generic query language:

Their solution was to create a platform able to extract data from any of these sources, transform it into a standard format, and then let users analyze it using a single query language that looks a lot like the SQL they already know. 28msec is based on the open source JSONiq and Zorba query languages and will be available as a cloud service.

This sounds like a variant of an ETL process: Extract-Transform-Query. But it got me thinking of what Daniel Abadi has wrote about the difference between Hadapt and PolyBase, HAWQ—just replace Hadoop with another source of data and SQL with JSONiq:

[…] they all can access data in Hadoop, but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

The question is not if this process will work (ETL processes have been around for quite a while), but what can you do to optimize this extract-transform-query process.

Original title and link: 28msec - query data from any source in real time (NoSQL database©myNoSQL)

via: http://gigaom.com/2013/06/11/stealth-mode-28msec-wants-to-build-a-tower-of-babel-for-databases/


With New Product Packaging, Adopting the Platform for Big Data is Even Easier | Apache Hadoop for the Enterprise | Cloudera

In addition, by choosing Cloudera Enterprise, you open the door to add other capabilities to your subscription as you wish – powerful tools like:

  • Cloudera Enterprise RTD (Real Time Delivery) – Support for HBase
  • Cloudera Enterprise RTQ (Real Time Query) – Support for Impala
  • Cloudera Enterprise BDR (Backup and Disaster Recovery) - Support for BDR
  • Cloudera Navigator – Data management for your Cloudera Enterprise deployment

And when Cloudera Search (beta) becomes generally available, you’ll be able to add:

  • RTS (Real Time Search) – Support for Cloudera Search

Isn’t this called nickel-and-diming?

Original title and link: With New Product Packaging, Adopting the Platform for Big Data is Even Easier | Apache Hadoop for the Enterprise | Cloudera (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/06/adopting-cloudera-platform-even-easier/


IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2

The details are pretty confusing1

[…] the new standard — which encompasses the MongoDB API, data representation (BSON), query language and wire protocol — appears to be all about establishing a way for mobile and other next-generation applications to connect with enterprise database systems such as IBM’s popular DB2 database and its WebSphere eXtreme Scale data grid.

But the juicy part is in the comments; if you can ignore the pitches.


  1. if this is a new standard and it is all based on the already existing MongoDB API, BSON, and wire protocol, then 1) what’s new about it and 2) what exactly will make it a standard

Original title and link: IBM and 10gen are collaborating on a standard that would make it easier to write applications that can access data from both MongoDB and relational systems such as IBM DB2 (NoSQL database©myNoSQL)

via: http://gigaom.com/2013/06/04/ibm-throws-its-weight-behind-mongodb-for-mobile-apps/


Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H

Daniel Abadi in an email to Curt Monash analyzing a the Microsoft Polybase paper1:

The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

It’s a long paragraph, but the difference Daniel Abadi is emphasizing is critical: “Hadoop/HDFS data that could theoretically have been stored in DBMS all along”.


  1. According to Microsoft GraySystemsLab page on Polybase

    […] the goal of the Polybase project is to allow SQL Server PDW users to execute queries against data stored in Hadoop, specifically the Hadoop distributed file system (HDFS). Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance.

    And here’re my (very) brief thoughts about Polybase when I first learned about it.

Original title and link: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2013/06/02/sql-hadoop-architectures-compared/


Cloudant's phenomenal response time

James Mundy writing about using Cloudant from his app deployed on Microsoft Azure cloud:

When I began implementing Cloudant’s CouchDB based distributed database as a service (daas) to replace our NoSQL Azure Table solution I had some reservations about the time making calls from our Azure Web Roles to their separate data centre would add to response times.

Turns out that really wasn’t anything to worry about at all.

This is very interesting (even if James’s experiment is not really a benchmark). I assume that the way Cloudant pulls this is by offering their service only from top notch connected datacenters. That on top of making sure the service is correctly tuned.

Original title and link: Cloudant’s phenomenal response time (NoSQL database©myNoSQL)

via: http://mendez.quora.com/Cloudants-phenomenal-response-time?srid=3nu1&share=1


Announcing Open Source, Interactive Search on Hadoop

Announced through a webinar with all big name analysts listening, Cloudera announced Cloudera Search:

Cloudera Search brings full-text, interactive search and scalable indexing to your data in Hadoop. Cloudera Search adds to and extends the value of Apache Solr™, the enterprise standard for open source search. With Cloudera’s 100% open source Big Data platform, CDH, Cloudera Search gains the same fault tolerance, scale, visibility, and flexibility provided to other workloads, like MapReduce, Apache Hive™, and Cloudera Impala.

You know who did this first, right? DataStax. And it was over a year ago.

Original title and link: Announcing Open Source, Interactive Search on Hadoop (NoSQL database©myNoSQL)

via: http://app.go.cloudera.com/e/es.aspx?s=1465054361&e=9583&elq=2a81ee10fb714c3c9afc2225da89700c


Bill Gates: Four Areas of Technology I’d look into

Bill Gates in a tweet-based interview:

Q: @fesja: @BillGates if you were 20 years old now, what would you do? which area?

A: Bill Gates: When it comes to technology, there are four areas where I think a lot of exciting things will happen in the coming decades: big data, machine learning, genomics, and ubiquitous computing. So if I were 20 years old today, I’d be looking into one (or maybe more!) of those fields.

To say that Bill Gates always had a great understanding of technology trends would be an understatement.

Original title and link: Bill Gates: Four Areas of Technology I’d look into (NoSQL database©myNoSQL)


Thoughts on Intel's Hadoop distribution

Gwen Shapira has a very interesting theory about what led Intel of creating its own distribution of Hadoop:

Intel is doing for Hadoop the same thing it did for C compilers – make sure they use the best hardware enhancements available in the CPUs and other hardware components available from Intel. The nice thing is that the enhancements are available as open source – Intel doesn’t care that the software is free, since they are selling the hardware!

But I don’t believe it.

  1. Intel is already using Hadoop internally. They can test the hell out of Hadoop and create improvements.
  2. As far as I know, Intel already has Hadoop committers that could push these improvements out. Even if Intel wouldn’t have Hadoop committers, I don’t think anyone from the Hadoop community would veto patches providing better utilization of cluster resources.

On the other hand I don’t have an alternative theory1. But I have a hypothesis that is related to the list of partners Intel signed when announcing their distribution2.


  1. I’m not going to state the obvious that Intel wants to make sure Hadoop works best on Intel so they don’t lose any market share. 

  2. Hint: check how many in that list are server/cluster/appliance vendors. 

Original title and link: Thoughts on Intel’s Hadoop distribution (NoSQL database©myNoSQL)

via: http://www.pythian.com/blog/intel-hadoop-distribution/


4 Good Things About CouchDB

Will Conant:

CouchDB has four features that really make it stand out:

  1. It has no read locks.
  2. You can back up a database with cp without shutting it down.
  3. Any record (row, document, whatever) can participate in any index any number of times.
  4. Replication is easy and can be bidirectional.

I totally agree with the author. But when using a database, it’s not only about the features that stand out. It’s also about the unique features that fit the project, the missing features, the frequency with which those missing features are addressed. And I could go on for a while.

CouchDB’s bidirectional replication has always been its strongest, differentiating feature. But in my books, users had to fight too much on other parts of the database.

Original title and link: 4 Good Things About CouchDB (NoSQL database©myNoSQL)

via: http://willconant.com/posts/2013-06-02/4-good-things-about-couchdb


PostgreSQL as NoSQL with Data Validation

Szymon Guz writes about JSON support in PostgreSQL:

So, I’ve shown you how you can use PostgreSQL as a simple NoSQL database storing JSON blobs of text. The great advantage over the simple NoSQL databases storing blobs is that you can constrain the blobs, so they are always correct and you shouldn’t have any problems with parsing and getting them from the database.

You can also query the database very easily, with huge speed. The ad-hoc queries are really simple, much simpler than the map-reduce queries which are needed in many NoSQL databases.

Since before NoSQL was called NoSQL, I’ve always thought that there’s a market, and more important, there are use cases for using single, unitary platforms for handling data. But there’s also a market, and the corresponding uses cases, for using different platforms for handling data. And there’s also the federated database systems and the logical data warehouses.

✚ I have this dream about how the databases will look in the future, but I never get around to putting together all the pieces, crossing the t’s and dotting the i’s.

Original title and link: PostgreSQL as NoSQL with Data Validation (NoSQL database©myNoSQL)

via: http://blog.endpoint.com/2013/06/postgresql-as-nosql-with-data-validation.html