ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Skytree Launches a MacHine Learning Server

Skytree Server connects to any number of existing data stores, including Hadoop, and, says Hack, is tens of thousands of times faster than existing tools, performing in minutes tasks that would have taken hours or days. As of now, it’s tuned to five specific use cases the company says are the most common — recommendation systems, anomaly/outlier identification, predictive analytics, clustering and market segmentation, and similarity search.

Skytree Server Architecture

There’s a limited but free Skytree version available on demand, so I expect to read some more about it soon.

Original title and link: Skytree Launches a MacHine Learning Server (NoSQL database©myNoSQL)

via: http://gigaom.com/cloud/skytree-intros-machine-learning-for-the-masses/


10gen Signs Partnerships to Strengthen MongoDB Hosting

Leaving aside for a second the aspect of immediate win for 10gen and quite possibly the visible benefits for the end users, I’m wondering if such partnerships (or the lack of them) could be part of the answer to the question why only some NoSQL databases are present in managed hosting offers.

Here’s how MongoLab is introducing this partnership:

MongoLab provides, as always, primary support for operational issues (e.g. password resets, service plan upgrades, maintenance and monitoring) and usage guidance (e.g. index recommendations, schema design).  Starting now, 10gen provides support escalation for code-level database and driver issues, acting as our backstop to provide patches or effective workarounds to issues that can not be solved by configuration or architecture changes.

From my NoSQL market observer position, it looks like a win-win-win situation.

Original title and link: 10gen Signs Partnerships to Strengthen MongoDB Hosting (NoSQL database©myNoSQL)

via: http://blog.10gen.com/post/18067595934/three-new-cloud-providers-join-the-mongodb-ecosystem


WhySQL: MySQL/InnoDB ACID Guarantees for Evernote

Dave Engberg has published on the Evernote Techblog a post explaining why the Atomicity, Consistency, and Durability characteristics of a single replicated MySQL/InnoDB deployment are essential to the way Evernote operates.

While it’s difficult to argue about a technical decision with so little details available, I still wanted to point out a couple of things:

  1. Atomicity: most of the NoSQL databases offer atomic operation at the level of a single record. For distributed systems that do not want to rely on 2PC, it is the multi-row atomic operations that are not supported.

    The example presented in the post does not require multi-row transactions, but rather guaranteed client operation ordering. This is achievable in most NoSQL databases.

  2. Consistency: the post talks about data consistency from the perspective of data integrity guarantees through usage of foreign keys.

    In the world of NoSQL similar behavior could be achieved by different data modeling solutions. Using Cassandra as an example for the notebook deletion scenario, one could store all the notes of a notebook in a single Cassandra row, thus making the delete operation safe.

    It’s also worth mentioning that many of the eventually consistent NoSQL databases offer different consistent read and write operations.

  3. Durability: with just a few known exceptions, most NoSQL databases offer strong durability guarantees.

In conclusion, based only on the few details of the post, one could easily argument that a NoSQL database would fit the bill. But most of the time the reality behind is much different, making technical decisions a tad more complicated.

Original title and link: WhySQL: MySQL/InnoDB ACID Guarantees for Evernote (NoSQL database©myNoSQL)

via: http://blog.evernote.com/tech/2012/02/23/whysql/


Generating Numerical Sequences With Redis

Thomas James:

The GUID/UUID data type is great for replacing the numerical ID of a record with something that can stand up to the challenges of distributed data, but they are not very suitable for use by the end-user. In some applications you really do still want to be able to generate a reliable numerical sequence number, such as for an invoice number. […] It may seem like overkill to add an additional component to our software stack just to handle the task of keeping a counter, but Redis fits the mould perfectly in this case. Its also simple to setup and use, not adding much to the solution’s overhead. Especially considering that, like CouchDB, the redis instance is “in the cloud”.

I do feel this is overkill and one could generate the sequence at the application level. And even if there is a proposed feature for adding a 128-Bit K-ordered unique id generator to Redis, for high scale decentralized ID generation one should look into Twitter’s Snowflake or Boundary’s Flake.

Original title and link: Generating Numerical Sequences With Redis (NoSQL database©myNoSQL)

via: http://www.thomasvjames.com/2012/02/relax-unwind-with-a-little-redis/


Gremlin vs Cypher

Romiko Derbynew comparing Gremlin and Neo4j Cypher:

  • Simple graph traversals are much more efficient when using Gremlin
  • Queries in Gremlin are 30-50% faster for simple traversals
  • Cypher is ideal for complex traversals where back tracking is required
  • Cypher is our choice of query language for reporting
  • Gremlin is our choice of query language for simple traversals where projections are not required
  • Cypher has intrinsic table projection model, where Gremlins table projection model relies on AS steps which can be cumbersome when backtracking e.g. Back(), As() and _CopySplit, where cypher is just comma separated matches
  • Cypher is much better suited for outer joins than Gremlin, to achieve similar results in gremlin requires parallel querying with CopySplit, where as in Cypher using the Match clause with optional relationships
  • Gremlin is ideal when you need to retrieve very simple data structures
  • Table projection in gremlin can be very powerful, however outer joins can be very verbose

So in a nutshell, we like to use Cypher when we need tabular data back from Neo4j and is especially useful in outer joins.

Patrick Durusau

Original title and link: Gremlin vs Cypher (NoSQL database©myNoSQL)

via: http://romikoderbynew.com/2012/02/22/gremlin-vs-cypher-initial-thoughts-neo4j/


The Open Data Handbook: The Why, What, and How

The Open Data Handbook is available online under a Creative Commons Attribution license:

This handbook discusses the legal, social and technical aspects of open data. It can be used by anyone but is especially designed for those seeking to open up data. It discusses the why, what and how of open data – why to go open, what open is, and the how to ‘open’ data.

Original title and link: The Open Data Handbook: The Why, What, and How (NoSQL database©myNoSQL)


Data Scientist’s Anthem

Shamir Karkal:

Data Scientist’s anthem - We R Who We R

Andrei Savu

Original title and link: Data Scientist’s Anthem (NoSQL database©myNoSQL)


A Guide to Elastic MapReduce and Hadoop Streaming for Astrophysicists

Arfon Smith1:

A couple of months ago I wrote about how the astrophysics community should place more value on those individuals building tools for their community - the informaticians. One example of a tool that I don’t think is particularly well known in many areas of research is the Apache Hadoop software framework.

Hadoop is a great tool but it can be fiddly to configure. With Elastic MapReduce you can focus on the design of your map/reduce workflow rather than figuring out how to get your cluster setup. Next I’m planning on making some small changes to software used by radio astronomers to find astrophysical sources in data cubes of the sky to make it work with Hadoop Streaming - bring it on SKA!

Clearly Hadoop has issues. Meanwhile it helps local communities to plan for snow removal, geophysicists find oil in the oceans, and who knows exactly how many other similar problematic implementations are out there.

Peter Skomoroch


  1. Arfon Smith is Director of Citizen Science at The Adler Planetarium where I build citizen science projects for The Zooniverse 

Original title and link: A Guide to Elastic MapReduce and Hadoop Streaming for Astrophysicists (NoSQL database©myNoSQL)

via: http://arfon.org/getting-started-with-elastic-mapreduce-and-hadoop-streaming


Hadoop Has Promise but Also Problems… Show Me the Cheaper or Simpler Alternatives

Jessica E. Vascellaro for WSJ:

But some early adopters of Hadoop now say using the technology is challenging and rolling it out will take time.

[…]

Mr. Boroditsky says Hadoop is “immature” and comes with additional costs of hiring in-house expertise and consultants. “There is a very substantial cost to free software,” he says, declining to comment on dollar figures.

I’m starting to believe that the “Hadoop has problems and is complex” chorus is a vendor reaction very similar to the reaction they had to open source in general. Thus, before joining the group complaining about the complexity, costs, and lack of know-how, ask yourself the following questions:

  1. how many other tools can lead you to the same solution?

    Here are a couple of examples of what people choosing Hadoop had to say:

    • Infolinks using HBase and Hadoop:

      We started exploring the NoSQL solutions more than a year ago. We did some research on the available solutions and chose Hadoop/HBase for few reasons: 1. Java based 2. Open source 3. Hadoop - quite mature compared to other Java based solutions. Hadoop is also used by many web companies. 4. HBase - using Hadoop (so you get for free Hadoop stability, APIs etc.), like BigTable

      We tested this solution for 6 months (as a small cluster) and were very happy with it.

    • Zions Bancorporation after reaching the limits of Data Warehouse technologies:

      The quest for a solution began in 2009 with an investigation of Zion’s existing Microsoft and Oracle technologies, as well as other technologies within the firm and new solutions on the market, Wood relates. After developing a list of six potential vendors, he says, he and his team quickly focused on two Hadoop-based solutions. The team, Wood explains, recognized the potential in Hadoop for “making security decisions proactively rather than reactively, based on mining business intelligence and combining it with event data from security devices.”

  2. based on the list of tools helping you solve the same problem:

    1. how many are cheaper for your scenario?
    2. for how many of them you’ll find more resources?
    3. how many are operationally simpler?
  3. how many of these tools evolve as fast as Hadoop and its ecosystem?

  4. how many of them allow you to go beyond the initial scenario and start addressing other questions?

    Here is what people say about what happens after adopting Hadoop.

It would be great if Hadoop administration would get simpler and operational costs would go down and if know-how would be easier to find. Rest assured that all these will happen. And if for the time being these are problems you cannot overcome, tell me about the alternatives.

Original title and link: Hadoop Has Promise but Also Problems… Show Me the Cheaper or Simpler Alternatives (NoSQL database©myNoSQL)


A Tour of Amazon DynamoDB Features and API

Mathias Meyer’s walk through the DynamoDB features and API with commentary:

Sorted range keys, conditional updates, atomic counters, structured data and multi-valued data types, fetching and updating single attributes, strong consistency, and no explicit way to handle and resolve conflicts other than conditions. A lot of features DynamoDB has to offer remind me of everything that’s great about wide column stores like Cassandra, but even more so of HBase. This is great in my opinion, as Dynamo would probably not be well-suited for a customer-facing system. And indeed, Werner Vogel’s post on DynamoDB seems to suggest DynamoDB is a bastard child of Dynamo and SimpleDB, though with lots of sugar sprinkled on top.

Think of it as an extended, better articulated and closer to the API version of my notes about Amazon DynamoDB.

Original title and link: A Tour of Amazon DynamoDB Features and API (NoSQL database©myNoSQL)

via: http://www.paperplanes.de/2012/1/30/a-tour-of-amazons-dynamodb.html


More Details About the Teradata and Hortonworks Partnership

Some more interesting bits about the Teradata and Hortonworks partnership in Timothy Prickett Morgan’s “Teradata grabs Hortonworks by trunk” on The Register:

The Cloudera deal from September 2010 provided a pipe from a Hadoop cluster into the Teradata data warehouses, while the Hortonworks partnership announced today is providing a pipe between Hadoop and Aster Data appliances.

Hortonworks and Teradata will do joint marketing and development, and are exploring ways to better integrate their respective software. This will specifically be done on Data Platform 1.0 from Hortonworks and Aster Database 5.0 from Teradata. Future engineering work could include running the HortonWorks and Aster Data programs on the same physical clusters, side-by-side, although this is not the way customers tend to do it today, according to Argyros.

Original title and link: More Details About the Teradata and Hortonworks Partnership (NoSQL database©myNoSQL)


Automating Cassandra Operations and Management With Netflix's Priam Tool

A new open source tool from Netflix, Priam—back in November, Netflix has released Curator, a ZooKeeper library—, used to simplify and automate the operations and management of a Cassandra cluster:

Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:

  • Backup and recovery
    • snapshot and incremental backups
    • compression and multipart off-site uploading
    • data recovery and data testing
  • Bootstrapping and automated token assignment

    Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.

  • Centralized configuration management: All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.

  • RESTful monitoring and metrics: provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra’s ring information. They also expose key Cassandra JMX commands such as repair and refresh.

Original title and link: Automating Cassandra Operations and Management With Netflix’s Priam Tool (NoSQL database©myNoSQL)

via: http://techblog.netflix.com/2012/02/announcing-priam.html