ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

cache: All content tagged as cache in NoSQL databases and polyglot persistence

What Is BigCache: Off-Heap Caching Solution for the JVM

BigCache:

BigCache addresses this problem by persisting the cached data in memory within the same JVM process, but outside the JVM heap. This prevents the Garbage Collector from interacting with the cache’s memory zone, allowing the JVM heap size to be scaled based on processing needs only. While this solution is slightly slower than in-heap data access, it is faster than disk or network data transfers. These aspects make BigCache a solution that not only delivers performance, but can also scale up to tens or hundreds of gigabytes of RAM on the same machine.

BigCache

This sounds like an open source (Apache licensed) alternative to Terracotta’s BigMemory.

Sid Anand

Original title and link: What Is BigCache: Off-Heap Caching Solution for the JVM (NoSQL database©myNoSQL)


NoSQL Everywhere? Not So Fast

So how can big companies get in on the action? Let’s contrast the nature of data suited for NoSQL with the properties of enterprise data that requires the single-source-of-truth systems that we talked about. We’ll use three V’s: volume, velocity, and variety.

Just in case you want to read an InformationWeek post with no start, no end, and no logic, but (ab)using all the necessary buzzwords.

Original title and link: NoSQL Everywhere? Not So Fast (NoSQL database©myNoSQL)

via: http://www.informationweek.com/news/software/info_management/232901328?printer_friendly=this-page


Enterprise Caches Versus Data Grids Versus NoSQL Databases

RedHat/JBoss Manik Surtani:

[…] If you want to compare distributed systems, both data grids and NoSQL have kind of come from different starting points, if you will. They solve different problems, but where they stand today they’ve kind of converged. Data grids have been primarily in-memory but now they spill off onto disk and so on and so forth and they’ve added in-query and mapreduce onto it while NoSQL have primarily been on disk, but now cache stuff in-memory anyway for performance. They are starting to look the same now, or are very similar.

One big difference though that I see between data grids and NoSQL, something that still exists today, is how you actually interact with these systems. Data grids tend to be in VM, they tend to be embedded, you tend to launch a Java or JVM program, you tend to connect to a data grid API and you work with it whereas NoSQL tends to be a little bit more client server, a bit more like old-fashion databases where you open a socket to your NoSQL database or your NoSQL grid, if you will, and start talking to it. That’s the biggest difference I see today, but even that will eventually go away.

They seem to converge, but:

  • spilling off to disk is not equivalent to optimized disk access
  • distributed, sometimes even transactional caches are not equivalent with single node caches

Original title and link: Enterprise Caches Versus Data Grids Versus NoSQL Databases (NoSQL database©myNoSQL)

via: http://www.infoq.com/interviews/JSR347-Manik-Surtani


Distributed Caches, NoSQL Databases, and RDBMS

Greg Luck[1] following up on his article Ehcache: Distributed Cache or NoSQL Store? talks about architectural differences between distributed caches, NoSQL database, and RDBMS and where distributed caches fit:

NoSQL and RDBMS are generally on disk. Disks are mechanical devices and exhibit large latencies due to seek time as the head moves to the right track and read or write times dependent on the RPM of the disk platter. NoSQL tends to optimise disk use, for example, by only appending to logs with the disk head in place and occasionally flushing to disk. By contrast, caches are principally in memory. […] With RDBMS a cache is added to avoid these scale out difficulties. For NoSQL, scale out is built-in, so the cache will get used when lower latencies are required.


  1. Greg Luck: Founder and CTO, Ehcache  

Original title and link: Distributed Caches, NoSQL Databases, and RDBMS (NoSQL database©myNoSQL)

via: http://www.infoq.com/news/2011/11/distributed-cache-nosql-data-sto


MongoDB, memcached, EHCache: Compared as Distributed L2 Caches

As can be seen, whether the off-host process that manages the cache-data is MongoD or MemcacheD or Terracotta-Server, architecturally they all look equivalent - i.e. a pure-L2 with no-L1 - so that all data needs to be retrieved from over the network and then massaged into a POJO for consumption by the application.

MongoDB, memcached, EHCache compared

When speaking about caching systems, I’d also include criteria like:

  • warm up strategy
  • locking strategy
  • single-machine memory spill strategy

Original title and link: MongoDB, memcached, EHCache: Compared as Distributed L2 Caches (NoSQL databases © myNoSQL)

via: http://javamuse.blogspot.com/2011/03/nosql-document-based-or-distributed.html


Ehcache: Distributed Cache or NoSQL Store?

This is a guest post by Greg Luck, Founder and CTO, Ehcache .

Is Ehcache a NoSQL store? No, I would not characterise it as that, but I have seen it used for some NoSQL use cases. In these situations it compared very well — with higher performance and more flexible consistency than the well-known NoSQL stores. Let me explain.

Ehcache is the de facto open source cache for Java. It is used to boost performance, offload databases and simplify scalability. Backed by the Terracotta Server Array, Ehcache becomes a linearly scalable distributed cache. It is a schema-less, key-value, Java-based distributed cache. It provides flexible consistency control, data durability, and with the release of Ehcache 2.4, search by key, value and attribute indexes.

Flexible Consistency

From the very first integration of Ehcache and Terracotta, we enabled coherent data sharing across a cluster. In recognition of the CAP theorem and the limitations of having a single hard-coded trade-off, we created a rich consistency model configurable on a cache-by-cache basis. And to ease understanding, we adopted the standard client consistency model to describe and configure it.

We offer a much richer consistency model than NoSQL solutions. Across a cluster we offer on a per cache basis:

  • Pessimistically locked strong consistency (the default)
  • Unlocked Weak Consistency, with read-your-writes, monotonic reads and monotonic writes
  • Optimistically locked Compare and Swap (“CAS”) atomic operations across the cluster
  • XA and Local transactions
  • An explicit locking API that allows custom consistency

Performance

The Ehcache architecture is very different from NoSQL architectures. With Ehcache, each application server JVM has a resident hot set of the cache determined using an LRU algorithm. The size of that hot set can be 4-6 GB for heap storage, and with BigMemory, can be hundreds of GBs. This is the Level 1 (“L1”) cache and is entirely in-process. Access from the L1 is less than 1 μs.

The entire cache is always stored in the Terracotta Server Array. This is the Level 2 (“L2”) cache. Access from the L2 is less than 2 ms.

The mileage you get from this architecture depends on the nature of your usage profile. The most common one, the Pareto distribution, will read from a correctly-sized L1 80% of the time, and from the L2 20% of the time. So the average latency for the most common case is less than .401 ms. By comparison, the Yahoo! Cloud Serving Benchmark shows the average latencies of HBase and Cassandra ranging from 8 to 18 ms.

That makes Ehcache an order of magnitude faster than NoSQL.

(d)urability

Caches can be set to be persistent, which gives durability and restartability. A write ahead log is used, and the cache will recover to a consistent state. When configured as a distributed cache, Ehcache uses the Terracotta Server Array as a Level 2 cache. Terracotta servers are usually deployed in pairs for each partition, giving HA. In addition, backups can be taken from Terracotta using our JMX tools, or directly from the underlying storage mechanism. Backups and recoveries can be done live. RDBMSs typically offer a very extensive tool set for archiving, ETL and so on. For that reason I think of Ehcache as being durable with a small “d”. Having said that, perhaps this also applies to many NoSQL implementations.

Big Data

The definition of Big Data is a moving target. Today it is generally understood to start at a few dozen terabytes and go up into petabytes.

By this definition we do not do Big Data, but we are close. We have seen users take Ehcache up to about 2 TB with the current implementation. And if the past is anything to go by, each major release of Ehcache supports data volumes that have increased by an order or magnitude.

Another way to look at Big Data is that it is defined as data that is too large or unwieldy to process using RDBMSs. Because the mean latency from the cache is much lower for data retrieval, Ehcache works well for applications requiring rapid response, and thus serves big data hot sets well.

Finally, with BigMemory we can create very high densities. A Terracotta server can hold 250GB or more in memory per server. NoSQL solutions use a mix of memory and disk. Java based ones like Cassandra are subject to garbage collection issues, so are limited to running with very small heaps. BigMemory stores cache data off-heap but within the Java process using NIO’s DirectByteBuffer. The end result is that you can get the same storage using a much smaller server deployment.

Enter Distributed Caching

So if Ehcahce is not NoSQL, what is it? The answer is that Ehcache is a distributed cache. Like its NoSQL cousins, it is often used when the database cannot cope. But more than just data can be cached. For example, caching a web page or an expensive CPU computation are also common use cases.

The focus is on fast access to these cached results, not persistence. This fast access is expressed architecturally in Java as in-process caching, and over the network as in-memory cache stores.

Helping to draw the distinction, both Gartner and Forrester have in the last year created their respective definitions of Distributed Caching and Elastic Caching. According to Gartner, “Distributed caching platforms enable users to manage very large in-memory data stores to enable DBMS workload offloading, cloud and cloud transaction processing, extreme transaction processing, complex-event processing, and high-performance computing”. They also added distributed caches to their application Platform as a Service (“aPaaS”) reference architecture.

This makes sense. We see distributed caches being used alongside both RDBMSs and NoSQL.

But once the cache gets distributed and takes on enterprise features like search, the capabilities expand and overlap with NoSQL. Yet the difference in emphasis remains.

Use Cases

So what are some use cases where you might want to consider a distributed cache? In general, any use case where a ‘key-value plus search’-type NoSQL solution fits. As long as data volumes are less than 2 TB and the durability toolset is acceptable.

Specifically we see the following use cases:

  • General Purpose Caching

    e.g. Hibernate Caching. JDBC caching. Web caching. Collection caching. We do this one pretty thoroughly.

  • In conjunction with an in-house or third-party analytics engine, very fast lookup of analytics results.

    e.g. A credit card company needs to score real-time credit card transactions. There are hundreds of millions per day. Results of an in-house fraud model with transactions up to the close of business the previous day are loaded into the cache. The cache is further adjusted during the current business day for actual usage and can return fraud scores on billions of credit card numbers in a fraction of a second.

  • System of Record (“SOR”) for short to medium term business processes

    e.g. A phone company does mobile phone contract processing and provisioning with rapidly changing plan details. They create a value object for each plan and persist that to the cache, avoiding database schema changes thus adding business agility. The final result of the provisioning is recorded in the database after approximately two weeks. The value in the cache is akin to the document in a Document Store.

  • In-memory dataset search, including applications where all data can be held in memory as well as those where a partial (or hot) dataset is in memory

    e.g. A logistics company needs to lookup consignment nodes by id, sender name, addressee name and date range. They generate 400 GB per fortnight and 98% of searches are within two weeks. The database is overloaded handling the volume of lookups. By storing the most used data in the cache a 98% database offload can be achieved.

Conclusion

Distributed Caches and NoSQL Stores have been born out of the same need to supplement the RDBMS in powering web scale architectures. Both are cognizant of the CAP theorem and the impossibility of taking along the old certainties into a large scale distributed world.

While NoSQL is aimed at replacing the durability feature of RDBMS, caches are aimed at low latency and speed. Moreover, the cache is also seeking to avoid forcing the application to go out to any store, whether it be RDBMS or NoSQL.

I see distributed caches and NoSQL as being two useful and complimentary technologies that can supplement the RDBMS for what it cannot do, but also provide their own unique new features.

Original title and link: Ehcache: Distributed Cache or NoSQL Store? (NoSQL databases © myNoSQL)


Drupal 7 and MongoDB as Pluggable Storage or Cache

NoSQL adoption.

“New Drupal 7 features such as the field API, pluggable storage and cache, enabled us to use Mongo as a “NoSQL” solution for high performance and scalability, while the new unit testing framework ensures a very stable core, even over several large core-merge efforts throughout the project. Examiner.com, with its high traffic volume and instant publishing capabilities, would have been very difficult or impossible to implement on earlier versions of Drupal.” said Jim Davidson, President, Examiner.com.

Original title and link: Drupal 7 and MongoDB as Pluggable Storage or Cache (NoSQL databases © myNoSQL)

via: http://drupal.org/node/1015646


Caching and Replication: The Differences

As always a fantastic read from Jeff Darcy:

A replica is supposed to be complete and authoritative. A cache can be incomplete and/or non-authoritative.

I’m using “suppose” here in the almost old-fashioned sense of assume or believe, not demand or require. The assumption or belief might not actually reflect reality. A cache might in fact be complete, while a replica might be incomplete – and probably will be, when factors such as propagation delays and conflict resolution are considered. The important thing is how these two contrary suppositions guide behavior when a client requests data. This distinction is most important in the negative case: if you can’t find a datum in a replica then you proceed as if it doesn’t exist anywhere, but if you can’t find a datum in a cache then you look somewhere else. Here are several other possible distinctions that I think do not work as well.

Original title and link: Caching and Replication: The Differences (NoSQL databases © myNoSQL)

via: http://pl.atyp.us/wordpress/?p=3149


Questions about Caching and Persistence

Something to think about it:

  • if you are using some caching in your application, would you call that the persistence layer?
  • if you are using a distributed cache, would you call that your persistence layer?
  • if you are using a replicated and distributed cache, would you call that your persistence?
  • if your replicated and distributed cache does some sort of snapshotting to disk, would you call that your persistence?

Some are saying ☞ RAM is the new disk, so I’m wondering what their answers to the above questions are.

Original title and link: Questions about Caching and Persistence (NoSQL databases © myNoSQL)


Grails/GORM for Redis Interview

Graeme Rocher (@graemerocher), Grails project lead about Grails support for Redis:

Redis is a key/value store which supports extremely fast read operations so it’s useful in a number of situations where caching solutions have been used, however since Redis supports complex data types like sets, lists and hashes you can also do some more advanced querying compared to other key/value stores. This makes it appropriate for a range of scenarios from sorting data, calculating statistics, queueing jobs or just basic caching. As an example Redis’ set type allows you to store a unique set of values and easily return random entries or pop (remove) entries from a set. Implementing this type of functionality in a performant way on a relational database is typically much harder.

Even if compared with a caching solution, Redis will actually persist your data.

Original title and link for this post: Grails/GORM for Redis Interview (published on the NoSQL blog: myNoSQL)

via: http://jaxenter.com/gorm-for-redis-interview-31813.html