ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Why are you using MySQL?

Mark Callaghan puts out a great explanation of why pitching new databases to large MySQL users will almost always fail:

Leaving out quality of service, a simple definition for scalability is that a given workload requires A people, B hardware units and C lines of automation code. For something to scale better than MySQL it should reduce some of A, B and C. For many web-scale deployments the cost of C has mostly been paid and migrating to something new means a large cost for C. Note that B represents many potential bottlenecks. The value of B might be large to get more IOPs for IO-bound workloads with databases that are much bigger than RAM. It might be large to get more RAM to keep everything cached. Unfortunately, some deployments are not going to fully describe that context (some things are secret). The value of A is influenced by the features in C and the manageability features in the DBMS but most web-scale companies don’t disclose the values of B and A.

I can still see some good reasons why a new database “vendor” should continue to try to get big users to take a look at their product:

  1. it’s the only way to learn:

    1. what’s unique for these users
    2. what’s at the top of their concerns, and
    3. how they address them

    While you won’t be able to make them switch, if your product addresses their issues, it will be prepared for the next big user. 2. there’re still chances for smaller, greenfield, internal projects to start using your product

It’s common sense to build a product using the tools you are already familiar with — it helps reducing the risks and also cutting down the time to market . What happens though is that there are always people that don’t follow this rule starting new products using tools that are they are not that familiar with. There’re also products that grow faster than the team’s know-how evolves. These are just two quick examples where a new product that has learned from big users can help.

Original title and link: Why are you using MySQL? (NoSQL database©myNoSQL)

via: http://smalldatum.blogspot.com/2014/04/why-arent-you-using-x-version-2.html


Scalable Atomic Visibility with RAMP Transactions

We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are.

Still digesting Peter Bailis’s post and the accompanying Scalable Atomic Visibility with RAMP Transactions paper.

Original title and link: Scalable Atomic Visibility with RAMP Transactions (NoSQL database©myNoSQL)

via: http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/


Hydra takes on Hadoop

A good interview on InfoQ comparing Hadoop with AddThis’s open source Hydra:

What use case(s) is Hydra better suited for compared to Hadoop. When would Hadoop be a better choice?

Hydra is better at data exploration. You can follow a number of interesting leads from the results of a single, probably rather fast, map job. Queries on the resultant tree usually take on the order of seconds (or milliseconds).

Non-programmers can produce functioning products with a small amount of guidance. The web UI provides most everything that might be needed; it might be as simple as pressing clone on an existing job, changing the tree to use a couple different features and hitting go. In minutes they have a new URL endpoint to show your impressive new KPI on your company home page.

Hadoop has a few advantages though. It has stronger native support for very large, one-off joins. Technically speaking this just means more implicit sorting of files. Sorting huge numbers of things is expensive so we try pretty hard to avoid it, and as a result first order support for it is a little lacking. On the other hand, you might find that you don’t really need the full, perfect join and are instead content with a Bloom-filter-based probabilistic hybrid — in which case Hydra will once again save you some sweet cycles.

Original title and link: Hydra takes on Hadoop (NoSQL database©myNoSQL)

via: http://www.infoq.com/news/2014/04/hydra


Intel kills a Hadoop and feeds another

I seriously doubt you could have missed the 2nd part of this, but here’s the shortest executive summary:

  1. Intel has killed its own distribution of Hadoop — is there anyone that would disagree this is a good idea?
  2. Intel has invested $740mil in Cloudera (for 18%) — there’s no typo. 740 millions.

The main questions:

  1. where will Cloudera put the $900mil raised in the last round(s)?
  2. why Intel invested so much?

These questions were also asked by Dan Primack for CNN Money and after looking at different angles he comes out empty.

So let’s check other sources:

  1. TechCrunch has initially speculated that much of the investment went to existing shareholders.

    The post was later updated with a comment from Cloudera’s VP of marketing stating that the majority of the money went to the company. But no word on how they’ll be used.

  2. Reuters writes that Intel made the investment to ensure their leading position in server processors:

    Intel hopes that encouraging more companies to leap into Big Data analysis will lead to higher sales of its high- end Xeon server processors. The chipmaker believes that hitching its wagon to Cloudera’s version of Hadoop, instead of pushing its own version, will make that happen faster.

    Still no word on how Cloudera will be using the money.

  3. Derrick Harris for GigaOm writes that the deal makes a lot of sense for both companies1:

    Cloudera needs capital and Intel’s huge sales force to keep up its engineering efforts and grow the company internationally.

    As part of the deal, Cloudera will be an early adopter of Intel gear and will optimize its Hadoop software to run on Intel’s latest technologies. Intel will port some of its work into the Cloudera distribution and will maintain its own Hadoop engineering team that will work alongside Cloudera’s engineers to help unite the two company’s goals.

  4. Jeff Kelly for SiliconAngle emphasizes the same channel advantages:

    Cloudera’s biggest reseller partner is Oracle. Based on my reading of the Intel announcement, the deal is not an official reseller partnership, but Intel will “market and promote CDH and Cloudera Enterprise to its customers as its preferred Hadoop platform.” Not quite as nice as having the Intel salesforce closing deals for it, but Cloudera stands to gain significant new business from the arrangement.


So how about this short list on how this round will be used by Cloudera:

  1. a part goes for international expansion
  2. a larger part goes to early shareholders
  3. the largest part goes into acquisitions

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?


  1. Insert snarky comment here about a $740m deal that would not make sense to one of the parties. How about not making sense to any of them? 

Original title and link: Intel kills a Hadoop and feeds another (NoSQL database©myNoSQL)


Scaling the Facebook data warehouse to 300 PB

Fascinating read, raising interesting observations on different levels:

  1. At Facebook, data warehouse means Hadoop and Hive.

    Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

  2. I don’t see how in-memory solutions, like Hana, will see their market expanding.

    In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.

    On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.

    In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.

  3. At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.

    There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS.

  4. For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the ORCFile format.

    There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).

Original title and link: Scaling the Facebook data warehouse to 300 PB (NoSQL database©myNoSQL)

via: https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/


7 quick facts about R

Based on a slide deck by David Smith:

  1. R is the highest paid IT skill (Dice.com survey, January 2014)
  2. R is the most-used data science language after SQL (O’Reilly survey, January 2014)
  3. R is used by 70% of data miners (Rexer survey, October 2013)
  4. R is #15 of all programming languages (RedMonk language rankings, January 2014)
  5. R is growing faster than any other data science language (KDNuggets survey, August 2013)
  6. R is the #1 Google Search for Advanced Analytics software (Google Trends, March 2014)
  7. R has more than 2 million users worldwide (Oracle estimate, February 2012)

I can see a couple of actionable items based on this list:

  1. if you’re interested in data science, you should consider R
  2. if you are already using R, ask for a raise

Original title and link: 7 quick facts about R (NoSQL database©myNoSQL)


Three opinions about the future of Hadoop and Data Warehouse

Building on the same data coming from Gartner and a talk from Hadoop Summit (exactly the same), Matt Asay1 and Timo Elliott2 place Hadoop on the data warehouse map.

Matt Asay writes in the ReadWrite article that Hadoop is not replacing existing data warehouses, but it’s taking all new projects:

Hadoop (and its kissing cousin, the NoSQL database) isn’t replacing legacy technology so much as it’s usurping its place in modern workloads. This means enterprises will end up supporting both legacy technology and Hadoop/NoSQL to manage both existing and new workloads […]

Of course, given “the effective price of core Hadoop distribution software and support services is nearly zero” at this point, as Jeff Kelly highlights, more and more workloads will gravitate to Hadoop. So while data warehouse vendors aren’t dead—they’re not even gasping for breath—they risk being left behind for modern data workloads if they don’t quickly embrace Hadoop and other 21st Century data infrastructure.

On his blog, Timo Elliott makes sure that there’s some SAP in that future picture and uses their Hadoop partner, Hortonworks to depict it:

No. Ignoring the many advantages of Hadoop would be dumb. But it would be just as dumb to ignore the other revolutionary technology breakthroughs in the DW space. In particular, new in- memory processing opportunities have created a brand-new category that Gartner calls “hybrid transactional/analytic platforms” (HTAP)

hadoopmodernarchitecture_thumb

The future I’d like to see is the one where:

  1. there is an integrated data platform. Note that in this ideal world, integrated does not mean any form of ETL
  2. it supports and runs in isolation different workloads from online transactions and bulk upload to various forms of analytics
  3. data is stored on dedicated mediums (spinning disks, flash, memory) depending on the workloads that touch it
  4. data would move between these storage mediums automatically, but the platform would allow fine tuning for maintaining the SLAs of the different components

  1. Matt Asay is VP of business development and corporate strategy at MongoDB 

  2. Timo Elliott is an Innovation Evangelist for SAP 

Original title and link: Three opinions about the future of Hadoop and Data Warehouse (NoSQL database©myNoSQL)


When is MongoDB the Right Tool for the Job?

This puts me in a quandary, because my recent stint on the job market has shown that just about everybody is using MongoDB, and I’ve just never been in any situation that I have needed to use it.

I also can’t foresee any situation where there is a solid technical reason for choosing MongoDB over it’s competitors either, and the last thing I want to do is lead people astray or foist my preconceptions onto them.

2438326-laughing-hysterically

Then the top comment on reddit.

Original title and link: When is MongoDB the Right Tool for the Job? (NoSQL database©myNoSQL)

via: http://daemon.co.za/2014/04/when-is-mongodb-the-right-tool/


How to choose the best tool for your Big Data project

A decision matrix by Wenming Ye, Microsoft Research senior research program manager:

Wenming Ye

If things would be so easy, we wouldn’t even have a buzzword for Big Data.

Original title and link: How to choose the best tool for your Big Data project (NoSQL database©myNoSQL)

via: http://www.lifehacker.com.au/2014/04/how-to-choose-the-best-tool-for-your-big-data-project/


A genetic census of America

Nice little study and visualization example:

Where did the ancestors of today’s Americans come from? Do Americans in the Midwest hail from similar places of the world as in the Northeast, or as in the South?

A Genetic Census of America

Original title and link: A genetic census of America (NoSQL database©myNoSQL)

via: http://blogs.ancestry.com/ancestry/2014/04/04/a-genetic-census-of-america/


Aerospike has C and Node.js Clients for Mac OS X Available [sponsor]

Aerospike, sponsor of myNoSQL, has a quick announcement for developers using OS X:


In case you missed it, Aerospike C client 3.0.51 and Node.js client for the Mac OS X are now available. These client libraries also run on CentOS 6, RHEL 6, Debian 6+ and Ubuntu 12.04.

Check these out.

Original title and link: Aerospike has C and Node.js Clients for Mac OS X Available [sponsor] (NoSQL database©myNoSQL)


Big data: are we making a big mistake?

In a very entertaining article for FT.com, Tim Harford writes:

Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.

Unfortunately, these four articles of faith are at best optimistic oversimplifications.

  1. I’m very sure that the first wheel was not a Pirelli.
  2. If I’m yelling that I want a blue unicorn, I’m pretty sure sooner rather than later a bunch of people will try to sell me one.

✚ As you’d expect the Hacker News thread is also highly entertaining:

Another conclusion to draw from this article (which I really enjoyed, by the way) is that Big Data has been turned into one of the most abstract buzzwords ever. You thought “cloud” was bad? “Big Data” is far worse in its specificity.

Original title and link: Big data: are we making a big mistake? (NoSQL database©myNoSQL)

via: http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html