NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



terracotta: All content tagged as terracotta in NoSQL databases and polyglot persistence

Hadoop + Terracotta BigMemory: Run, Elephant, Run!

While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.

Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.

✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.

Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (NoSQL database©myNoSQL)


Main Features of In-Memory Data Grids

Good article about In-Memory Data Grids on Cubrid’s blog by Ki Sun Song.

The features of IMDG can be summarized as follows:

  • Data is distributed and stored in multiple servers.
  • Each server operates in the active mode.
  • A data model is usually object-oriented (serialized) and non-relational.
  • According to the necessity, you often need to add or reduce servers.

Even if you don’t read it all, but plan to use an IMDG solution, the first two questions you want to ask your vendor are: what the approach you are proposing to deal with the limited memory capacity and what’s the strategy for reliability. You’ll get good answers from well established products, but these answers are not necessarily the ones that provide the exact requirements your solution will need.

Original title and link: Main Features of In-Memory Data Grids (NoSQL database©myNoSQL)


Dealing With JVM Limitations in Apache Cassandra

A couple of most notable NoSQL databases targeting large scalable systems are written in Java: Cassandra, HBase, BigCouch. Then there’s also Hadoop. Plus a series of caching and data grid solutions like Terracotta, Gigaspaces. They are all facing the same challenge: tuning the JVM garbage collector for predictable latency and throughput.

Jonathan Ellis’s slides presented at Fosdem 2012 are covering some of the problems with GC and the way Cassandra tackles them. While this is one of those presentations where the slides are not enough to understand the full picture, going through them will still give you a couple of good hints.

For those saying that Java and the JVM are not the platform for writing large concurrent systems, here’s the quote Ellis is finishing his slides with:

Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.

Enjoy the slides after the break.

Distributed Caches, NoSQL Databases, and RDBMS

Greg Luck[1] following up on his article Ehcache: Distributed Cache or NoSQL Store? talks about architectural differences between distributed caches, NoSQL database, and RDBMS and where distributed caches fit:

NoSQL and RDBMS are generally on disk. Disks are mechanical devices and exhibit large latencies due to seek time as the head moves to the right track and read or write times dependent on the RPM of the disk platter. NoSQL tends to optimise disk use, for example, by only appending to logs with the disk head in place and occasionally flushing to disk. By contrast, caches are principally in memory. […] With RDBMS a cache is added to avoid these scale out difficulties. For NoSQL, scale out is built-in, so the cache will get used when lower latencies are required.

  1. Greg Luck: Founder and CTO, Ehcache  

Original title and link: Distributed Caches, NoSQL Databases, and RDBMS (NoSQL database©myNoSQL)


Ehcache: Distributed Cache or NoSQL Store?

This is a guest post by Greg Luck, Founder and CTO, Ehcache .

Is Ehcache a NoSQL store? No, I would not characterise it as that, but I have seen it used for some NoSQL use cases. In these situations it compared very well — with higher performance and more flexible consistency than the well-known NoSQL stores. Let me explain.

Ehcache is the de facto open source cache for Java. It is used to boost performance, offload databases and simplify scalability. Backed by the Terracotta Server Array, Ehcache becomes a linearly scalable distributed cache. It is a schema-less, key-value, Java-based distributed cache. It provides flexible consistency control, data durability, and with the release of Ehcache 2.4, search by key, value and attribute indexes.

Flexible Consistency

From the very first integration of Ehcache and Terracotta, we enabled coherent data sharing across a cluster. In recognition of the CAP theorem and the limitations of having a single hard-coded trade-off, we created a rich consistency model configurable on a cache-by-cache basis. And to ease understanding, we adopted the standard client consistency model to describe and configure it.

We offer a much richer consistency model than NoSQL solutions. Across a cluster we offer on a per cache basis:

  • Pessimistically locked strong consistency (the default)
  • Unlocked Weak Consistency, with read-your-writes, monotonic reads and monotonic writes
  • Optimistically locked Compare and Swap (“CAS”) atomic operations across the cluster
  • XA and Local transactions
  • An explicit locking API that allows custom consistency


The Ehcache architecture is very different from NoSQL architectures. With Ehcache, each application server JVM has a resident hot set of the cache determined using an LRU algorithm. The size of that hot set can be 4-6 GB for heap storage, and with BigMemory, can be hundreds of GBs. This is the Level 1 (“L1”) cache and is entirely in-process. Access from the L1 is less than 1 μs.

The entire cache is always stored in the Terracotta Server Array. This is the Level 2 (“L2”) cache. Access from the L2 is less than 2 ms.

The mileage you get from this architecture depends on the nature of your usage profile. The most common one, the Pareto distribution, will read from a correctly-sized L1 80% of the time, and from the L2 20% of the time. So the average latency for the most common case is less than .401 ms. By comparison, the Yahoo! Cloud Serving Benchmark shows the average latencies of HBase and Cassandra ranging from 8 to 18 ms.

That makes Ehcache an order of magnitude faster than NoSQL.


Caches can be set to be persistent, which gives durability and restartability. A write ahead log is used, and the cache will recover to a consistent state. When configured as a distributed cache, Ehcache uses the Terracotta Server Array as a Level 2 cache. Terracotta servers are usually deployed in pairs for each partition, giving HA. In addition, backups can be taken from Terracotta using our JMX tools, or directly from the underlying storage mechanism. Backups and recoveries can be done live. RDBMSs typically offer a very extensive tool set for archiving, ETL and so on. For that reason I think of Ehcache as being durable with a small “d”. Having said that, perhaps this also applies to many NoSQL implementations.

Big Data

The definition of Big Data is a moving target. Today it is generally understood to start at a few dozen terabytes and go up into petabytes.

By this definition we do not do Big Data, but we are close. We have seen users take Ehcache up to about 2 TB with the current implementation. And if the past is anything to go by, each major release of Ehcache supports data volumes that have increased by an order or magnitude.

Another way to look at Big Data is that it is defined as data that is too large or unwieldy to process using RDBMSs. Because the mean latency from the cache is much lower for data retrieval, Ehcache works well for applications requiring rapid response, and thus serves big data hot sets well.

Finally, with BigMemory we can create very high densities. A Terracotta server can hold 250GB or more in memory per server. NoSQL solutions use a mix of memory and disk. Java based ones like Cassandra are subject to garbage collection issues, so are limited to running with very small heaps. BigMemory stores cache data off-heap but within the Java process using NIO’s DirectByteBuffer. The end result is that you can get the same storage using a much smaller server deployment.

Enter Distributed Caching

So if Ehcahce is not NoSQL, what is it? The answer is that Ehcache is a distributed cache. Like its NoSQL cousins, it is often used when the database cannot cope. But more than just data can be cached. For example, caching a web page or an expensive CPU computation are also common use cases.

The focus is on fast access to these cached results, not persistence. This fast access is expressed architecturally in Java as in-process caching, and over the network as in-memory cache stores.

Helping to draw the distinction, both Gartner and Forrester have in the last year created their respective definitions of Distributed Caching and Elastic Caching. According to Gartner, “Distributed caching platforms enable users to manage very large in-memory data stores to enable DBMS workload offloading, cloud and cloud transaction processing, extreme transaction processing, complex-event processing, and high-performance computing”. They also added distributed caches to their application Platform as a Service (“aPaaS”) reference architecture.

This makes sense. We see distributed caches being used alongside both RDBMSs and NoSQL.

But once the cache gets distributed and takes on enterprise features like search, the capabilities expand and overlap with NoSQL. Yet the difference in emphasis remains.

Use Cases

So what are some use cases where you might want to consider a distributed cache? In general, any use case where a ‘key-value plus search’-type NoSQL solution fits. As long as data volumes are less than 2 TB and the durability toolset is acceptable.

Specifically we see the following use cases:

  • General Purpose Caching

    e.g. Hibernate Caching. JDBC caching. Web caching. Collection caching. We do this one pretty thoroughly.

  • In conjunction with an in-house or third-party analytics engine, very fast lookup of analytics results.

    e.g. A credit card company needs to score real-time credit card transactions. There are hundreds of millions per day. Results of an in-house fraud model with transactions up to the close of business the previous day are loaded into the cache. The cache is further adjusted during the current business day for actual usage and can return fraud scores on billions of credit card numbers in a fraction of a second.

  • System of Record (“SOR”) for short to medium term business processes

    e.g. A phone company does mobile phone contract processing and provisioning with rapidly changing plan details. They create a value object for each plan and persist that to the cache, avoiding database schema changes thus adding business agility. The final result of the provisioning is recorded in the database after approximately two weeks. The value in the cache is akin to the document in a Document Store.

  • In-memory dataset search, including applications where all data can be held in memory as well as those where a partial (or hot) dataset is in memory

    e.g. A logistics company needs to lookup consignment nodes by id, sender name, addressee name and date range. They generate 400 GB per fortnight and 98% of searches are within two weeks. The database is overloaded handling the volume of lookups. By storing the most used data in the cache a 98% database offload can be achieved.


Distributed Caches and NoSQL Stores have been born out of the same need to supplement the RDBMS in powering web scale architectures. Both are cognizant of the CAP theorem and the impossibility of taking along the old certainties into a large scale distributed world.

While NoSQL is aimed at replacing the durability feature of RDBMS, caches are aimed at low latency and speed. Moreover, the cache is also seeking to avoid forcing the application to go out to any store, whether it be RDBMS or NoSQL.

I see distributed caches and NoSQL as being two useful and complimentary technologies that can supplement the RDBMS for what it cannot do, but also provide their own unique new features.

Original title and link: Ehcache: Distributed Cache or NoSQL Store? (NoSQL databases © myNoSQL)

Terracotta and Hadoop

Ari Zilka (Terracotta):

Terracotta is generally part of read/write business systems that you might call transactional (not meaning XA / 2-phased commit). Hadoop is generally part of read-only analytics systems you might call business intelligence. Hadoop is used to create cool things like Yahoo’s predictive-typing feature in their search engine. Terracotta is used in financial services to execute more trades per hour or run algorithmic trading solutions faster or in gaming to handle more gamers per day. Terracotta is used to make sure no state is lost if Java or .Net apps stop running mid-day. Hadoop is used to crunch massive datasets looking for patterns.

It’s great that we don’t get back to elastic caching vs NoSQL.

Terracotta and Hadoop originally posted on the NoSQL blog: myNoSQL


Terrastore: A Consistent, Partitioned and Elastic Document Database

Terrastore is a very young Apache licensed document store solution built on top of the Terracotta (an in-memory clustering technology) that released its 0.2 version a couple of days ago.

I had the opportunity to chat with Sergio Bossa (@sbtourist) and have him answer a couple of questions about Terrastore.

Alex: What is it that made you create Terrastore in the first place?

Sergio: I wanted a scalable document store with consistency features, because I think that’s an uncovered topic/space in current implementations, which are all geared toward BASE.

Being a document database, Terrastore belongs to the same category as CouchDB, MongoDB, and Riak. In some regards (f.e. partitioning), Terrastore is similar to Riak. You should also check [1] to find out more about Terrastore and the CAP theorem.

Terracotta replication is not full, nor geared toward all nodes, but only those actually requiring the replicated data. This is more and more optimized in Terrastore, where, thanks to consistent hashing and partitioning, data is not duplicated at all. Terrastore also guarantees that data will never be duplicated among nodes, unless new nodes are joining or older nodes are leaving, thus requiring data redistribution. A Terrastore client doesn’t need to know where the data is: it can contact whatever Terrastore node and requests will be routed to the proper node holding the value (note: this is similar to the way Dynamo, Project Voldemort, Cassandra and other distributed stores are working)

At this point, more people have joined the chat and so more interesting questions and answers were coming up.

Alex: Considering Terrastore is built on top of Terracotta, is it an in-memory storage making it somehow similar to Redis?

Sergio: Correct, it stores everything in memory, but it is persistent as well. It is not as fast as Redis mainly due to some overhead related to its distributed features.

Paulo Gaspar: Terrastore looks very much like a persistent, transactional Memcached service.

Sergio: Persistent, transactional, and partitioned/sharded. An interesting difference is that afaik Memcached partitioning is done client side, while Terrastore has builtin support for data partitioning, distribution and access routing.

Terrastore is already HTTP and JSON friendly [2] and the future might bring support for the memcached protocol too.

Please see the following resources to learn more about Terrastore: