BigTable: All content tagged as BigTable in NoSQL databases and polyglot persistence
After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:
Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.
You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.
After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).
Basically three are three ways to backup MongoDB:
- Using MMS
- Copying underlying files
Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:
Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.
Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.
✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.
Original title and link: Quick links for how to backup different NoSQL databases ( ©myNoSQL)
Since announcing the GA couple of weeks ago, I’ve been noticing quite a few data related posts on the Google Compute Engine blog:
- Mon., 9th: DataStax Enterprise feels right at home in Google Compute Engine
- Tue., 10th: DataTorrent offers massive-scale, real-time stream analytics on Google Compute Engine
- Thu., 12th: Qubole helps you run Hadoop on Google Compute Engine
If you look at these, you’ll notice a theme: covering data from every angle; Cassandra/DSE from DataStax for OLTP, DataTorrent for stream processing, Qubole for Hadoop, MapR for their Hadoop-like solution. I can see this continuing for a while and making Google Compute Engine a strong competitor for Amazon Web Services.
One question remains though: will they be able to come up with a good integration strategy for all these 3rd party tools?
Original title and link: Google Compute Engine and Data ( ©myNoSQL)
If you’ve never used Thrift (with or without HBase), the two articles authored by Jesse Anderson and posted on Cloudera’s blog will give you both a quick intro and
- How-to: Use the HBase Thrift Interface, Part 1: setting up, getting the language bindings, and connecting;
- How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows: using HBase’s Thrift API from Python
Original title and link: An intro to HBase’s Thrift interface ( ©myNoSQL)
A presentation by Todd Eisenberger about the archival system used by Dropbox based on MySQL and HBase:
- fast queries for known keys over a (relatively) small dataset
- high read throughput
- high write throughput
- large suite of pre-existing tools for distributed computation
- easier to perform large processing tasks
✚ Both are consistent
✚ Most of the benefits in HBase’s section point in the direction of data processing benefits (and not data storage benefits)
This is a an important release for HBase. Both Hortonworks and Cloudera have posts covering it:
- Hortonworks: Announcing Apache HBase 0.96.0, More than 2000 issues resolved!
- Cloudera: HBase 0.96.0 Released!
Original title and link: Apache HBase 0.96.0 released after more than 2000 issues resolved ( ©myNoSQL)
Hortonworks, eBay and Scaled Risk have been collaborating in improving the mean time to recovery in HBase and after long testing performed at eBay, some results are now available for 2 scenarios:
- Node/RegionServer failures while writing
- Node/RegionServer failures while reading
Original title and link: Results of collaboration on improving the Mean Time to Recovery in HBase ( ©myNoSQL)
In 4 years of writing this blog I haven’t seen such a prolific month:
- Apache Hadoop 2.2.0 (more links here)
- Apache HBase 0.96 (here and here)
- Apache Hive 0.12 (more links here)
- Apache Ambari 1.4.1
- Apache Pig 0.12
- Apache Oozie 4.0.0
- Plus Presto.
Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.
Original title and link: A prolific season for Hadoop and its ecosystem ( ©myNoSQL)