ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Clustrix Sierra Clustered Database Engine

In the light of publicly announcing customers, I wanted to read a bit about Clustrix Clustered Database Systems.

The company homepage is describing the product:

  • scalable database appliaces for Internet-scale work loads
  • Linearly Scalable: fully distributed, parallel architecture provides unlimited scale
  • SQL functionality: full SQL relational and data consistency (ACID) functionality
  • Fault-Tolerant: highly available providing fail-over, recovery, and self-healing
  • MySQL Compatible: seamless deployment without application changes.

All these sounded pretty (too) good. And I’ve seen a very similar presentation for Xeround: Elastic, Always-on Storage Engine for MySQL.

So, I’ve continued my reading with the Sierra Clustered Database Engine whitepaper (PDF).

Here are my notes:

  • Sierra is composed of:
    • database personality module: translates queries into internal representation
    • distributed query planner and compiler
    • distributed shared-nothing execution engine
    • persistent storage
    • NVRAM transactional storage for journal changes
    • inter-node Infiniband
  • queries are decomposed into query fragments which are the unit of work. Query fragments are sent for execution to nodes containing the data.
  • query fragments are atomic operations that can:
    • insert, read, update data
    • execute functions and modify control flow
    • perform synchronization
    • send data to other nodes
    • format output
  • query fragments can be executed in parallel
  • query fragments can be cached with parameterized constants at the node level
  • determining where to sent the query fragments for execution is done using either range-based rules or hash function
  • tables are partitioned into slices, each slice having redundancy replicas
    • size of slices can be automatically determined or configured
    • adding new nodes to the cluster results in rebalancing slices
    • slices contained on a failed device are reconstructed using their replicas
  • one of the slices is considered primary
  • writes go to all replicas and are transactional
  • all reads fo the the slice primary

The paper also exemplifies the execution of 4 different queries:

SELECT * FROM T1 WHERE uid=10

SELECT uid, name FROM T1 JOIN T2 on T1.gid = T2.gid WHERE uid=10

SELECT * FROM T1 WHERE uid<100 and gid>10 ORDER BY uid LIMIT 5

INSERT INTO T1 VALUES (10,20)

Questions:

  • who is coordinating transactions that may be executed on different nodes?
  • who is maintains the topology of the slices? In case of a node failure, you’d need to determine:
    1. what slices where on the failing node
    2. where are the replicas for each of these slices
    3. where new replicas will be created
    4. when will new replicas become available for writes
  • who elects the slice primary?

Clustrix Sierra Clustered Database Engine

Original title and link: Clustrix Sierra Clustered Database Engine (NoSQL databases © myNoSQL)