Google: All content tagged as Google in NoSQL databases and polyglot persistence
Wednesday, 15 May 2013
Introducing Google Cloud Datastore
Urs Hölzle in a post summarizing some of the announcements at Google I/O:
Google Cloud Datastore is a fully managed and schemaless solution for storing non-relational data. Based on the popular App Engine High Replication Datastore, Cloud Datastore is a standalone service that features automatic scalability and high availability while still providing powerful capabilities such as ACID transactions, SQL-like queries, indexes and more.
I’m heading over to the project’s site to read more.
Original title and link: Introducing Google Cloud Datastore (©myNoSQL)
via: http://googlecloudplatform.blogspot.com/2013/05/ushering-in-next-generation-of.html
Wednesday, 20 February 2013
Amazon Preparing 'Disruptive' Big Data AWS Service?
Interesting speculation by The Register:
AWS already has the AWS Data Pipeline, which helps administrators schedule and shuttle data among various services, AWS Redshift for data warehousing which lets people store large quantities of data in the cloud and run queries on it, its NoSQL SSD-backed DynamoDB, and its Relational Database Service (RDS). So where does MADS fit?
The Reg’s take is that MADS will allow Amazon to build services that can net together the above components and help automate the passing of data among them. It may also become a standalone product in its own right, based on its similarities to the TransLattice and Google Spanner tech.
I almost never bet, but I’d say this could be Amazon’s Spanner.
Original title and link: Amazon Preparing ‘Disruptive’ Big Data AWS Service? (©myNoSQL)
via: http://www.theregister.co.uk/2013/02/19/amazon_new_big_data_aws_service/
Monday, 29 October 2012
Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs
Until I learn more about the recently announced Cloudera Impala and Druid from Metamarkets, this article by Jaikumar Vijayan should offer—with some inherent mistakes1—a good overview of the solutions aiming to offer alternatives to the batch-processing nature of Hadoop:
- Google Dremel (BigQuery)
- Cloudera Impala
- Metamarkets Druid
- Nodeable StreamReduce
- SAP HANA integrated with Hadoop, etc.
-
Just an example: “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies”. Then “The technology [nb Google Dremel] can run queries over trillion-row data tables in seconds…”
Maybe just one more: consider the title “Moving beyond Hadoop” and then the quote from Google’s Ju-kay Kwek: “Google uses Dremel in conjuction with MapReduce. […] Hadoop and Dremel are distributed computing technologies, but each was built to address very different problems.” ↩
Original title and link: Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs (©myNoSQL)
Monday, 1 October 2012
Google BigQuery Adds Support for JSON Import and Hierarchical Data
Besides performance and quota changes, Google BigQuery adds support for importing JSON data and nested/repeated fields:
If you’re using App Engine Datastore or other NoSQL databases, it’s likely you’re taking advantage of nested and repeated data in your data model. For example, a customer data entity might have multiple accounts, each storing a list of invoices. Now, instead of having to flatten that data, you can keep your data in a hierarchical format when you import to BigQuery.
Original title and link: Google BigQuery Adds Support for JSON Import and Hierarchical Data (©myNoSQL)
via: http://googleenterprise.blogspot.com/2012/10/google-bigquery-updates-faster-easier.html
Monday, 24 September 2012
Todd Hoff on Google Spanner's
Todd Hoff of Highscalability.com:
What struck me most in the paper was a deeply buried section essentially describing Google’s motivation for shifting away from NoSQL and to NewSQL. The money quote:
We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.
That’s one piece of the Spanner paper that’s catching everyone’s attention. I’m wondering how much of this reference to transactions refers to:
- multi-operations transactions
- synchronous replication
- data strong consistency
Original title and link: Todd Hoff on Google Spanner’s (©myNoSQL)
Building Spanner Presentation
Alex Lloyd’s talk from Berlin Buzzwords 2012 about Google’s Spanner:
Cloudant's Mike Miller on Google Spanner
Cloudant’s Mike Miller sharing his thoughts about Google’s Spanner paper:
Spanner’s key innovation is around time. It includes a novel system using GPS and Atomic Clocks to distribute a globally synchronized “proper time.” The previous dogma in distributed systems was that synchronizing time within and between datacenters is insurmountably hard and uncertain. Ergo, serialization of requests is impossible at global scale. Google’s key innovation is to accept uncertainty, keep it small (via atomic clocks and GPS), quantify the uncertainty and operate around it. In retrospect this is obvious, but it doesn’t make it any less brilliant.
Original title and link: Cloudant’s Mike Miller on Google Spanner (©myNoSQL)
via: https://cloudant.com/blog/cloudant-labs-on-google-spanner/
Monday, 9 July 2012
Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2
Remember what I was writing in the state of Hadoop market about having a second option for on-demand cloud-based Hadoop services? Benjamin Black compares Google Cloud Platform with Amazon services:
- Cloud Engine is a lot like EC2 & EBS
- Cloud Storage is a lot like S3
- Cloud SQL is a lot like RDS
- Analytics can be used like CloudWatch (and I know of people putting billions of their own data points in Analytics)
- BigQuery has no AWS equivalent, but maybe you could build it with EMR?
- PageSpeed has no AWS equivalent
Hadoop and MapR are already listed as possible use cases for Google Cloud Platform.
I don’t think I could write a better conclusion than Black did in his post:
This is big, planetary scale infrastructure. This is cloud legitimized and super-sized. In the words of the prophet: Shit just got real.
Original title and link: Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2 (©myNoSQL)
via: http://blog.b3k.us/2012/07/04/cloud-independence-day.html
Friday, 8 June 2012
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Google’s paper about their large-scale distributed systems tracing solution Dapper which inspired Twitter’s Zipkin:
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
Download or read the paper after the break.
Wednesday, 2 May 2012
Google BigQuery: Running SQL-like Queries Against Very Large Datasets
Announced at GigaOm Structure Data event, Google launches a new BigData service named BigQuery:
BigQuery enables businesses and developers to gain real-time business insights from massive amounts of data without any upfront hardware or software investments.
A quick bullet point list of BigQuery features and limitations:
- BigQuery is ideal for running queries over vast amounts of data—up to billions of rows—with great speed.
- BigQuery is good for analyzing vast quantities of data quickly, but not for modifying it. In data analysis terms, BigQuery is an OLAP (online analytical processing) system.
- You can import data into BigQuery as CSV data, where it is stored in the cloud in a relatively small number of tables with no explicit relationship to each other.
- BigQuery isn’t a database system:
- It doesn’t support table indexes or other database management features.
- BigQuery supports a specialized subset of SQL; it doesn’t support update or delete requests.
- BigQuery supports joins only when one side of the join is much smaller than the other.
- BigQuery can be used by any client able to send REST commands over the Internet.
After the break you can watch the 15 minutes video recorded at the GigaOm event.
Monday, 13 February 2012
How Web giants store big data
An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS).
Original title and link: How Web giants store big data (©myNoSQL)
Sunday, 5 February 2012
Research in the MapReduce Space
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.
Original title and link: Research in the MapReduce Space (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling