ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

PostgreSQL: All content tagged as PostgreSQL in NoSQL databases and polyglot persistence

The Durable Document Store You Didn't Know You Had, but Did

As it turns out, PostgreSQL has a number of ways of storing loosely structured data/documents in a column on a table.

  • hstore is a data type available as a contrib package that allows you to store key/value structures just like a dictionary or hash.
  • You can store data in JSON format on a text field, and then use PLV8 to JSON.parse() it right in the database.
  • There is a native xml data type, along with a few interesting query functions that allow you to extract and operate on data that sits deep in an XML structure.

I concur. Not knowing your database *must not* be the reason for adopting a NoSQL database.

Original title and link: The Durable Document Store You Didn’t Know You Had, but Did (NoSQL database©myNoSQL)

via: http://robots.thoughtbot.com/post/13829210385/the-durable-document-store-you-didnt-know-you-had-but


Postgres Plus Connector for Hadoop in Private Beta

Not much information available yet on the project page but looks like bidirectional integration of PosgreSQL and Hadoop.

The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters.  Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

Posgres Plus Hadoop

When speaking about PostgreSQL and Hadoop, the first thing that comes to my mind is Daniel Abadi’s HadoopDB that became not long ago the technology behind his startup which has already raised $9.5mil.

Original title and link: Postgres Plus Connector for Hadoop in Private Beta (NoSQL database©myNoSQL)


The Story of Etsy's Architecture

Ars Technica’s Sean Gallagher summarizes a presentation given at Surge conference covering the evolution of Etsy’s architecture from a centralized PostgreSQL stored procedures based solution to a sharded MySQL and going through a failed service oriented-like architecture:

And the team started to shift feature by feature away from a semi-monolithic Postgres back-end to sharded MySQL databases. “It’s a battle-tested approach,” Snyder said. “Flickr is using it on an enormous scale. It scales horizontally, basically, to near infinity, and there’s no single point of failure—it’s all master to master replication.”

Original title and link: The Story of Etsy’s Architecture (NoSQL database©myNoSQL)

via: http://arstechnica.com/business/news/2011/10/when-clever-goes-wrong-how-etsy-overcame-poor-architectural-choices.ars


Tutorial: Building Interactive Maps With Polymaps, TileStach, and MongoDB

A three part tutorial on using MongoDB, PostgreSQL/PostGIS, and Javascript libraries for building interactive maps by Hans Kuder:

  • part 1: goals and building blocks
  • part 2: geo data, PostGIS, and TileStache
  • part 3: client side and MongoDB

Original title and link: Tutorial: Building Interactive Maps With Polymaps, TileStach, and MongoDB (NoSQL database©myNoSQL)


Beyond NoSQL: Using RRD to Store Temporal Data

Patrick Schless describes the pros of using RRDTool for collect write-once data over time, and graph the results.

The projects collect very different data, but this task was painful enough in postgres that I ended up switching to a temporal database for the second go, and it made the data collection & querying much easier. What follows are a brief discussion of the problems I faced with postgres, and how moving to RRD solved them.

Check also the Hacker news thread for a couple of other tricks for RRDTool.

In the NoSQL space, this sort of quick analytics use case was associated with MongoDB:

Other larger platforms have developed their own solutions:

But using a specialized solution has its own benefits… where did we hear that before?

Original title and link: Beyond NoSQL: Using RRD to Store Temporal Data (NoSQL database©myNoSQL)

via: http://www.plainlystated.com/2011/07/beyond-nosql-using-rrd-to-store-temporal-data/


Building an Ad Network Ready for Failure

The architecture of a fault-tolerant ad network built on top of HAProxy, Apache with mod_wsgi and Python, Redis, a bit of PostgreSQL and ActiveMQ deployed on AWS:

The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple “slaveof” calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.

Original title and link: Building an Ad Network Ready for Failure (NoSQL database©myNoSQL)

via: http://dr-josiah.blogspot.com/2011/06/building-ad-network-ready-for-failure.html


Reddit's Story of Running Cassandra & PostgreSQL on Amazon EBS

I’m still distilling what happened at Reddit the other days when failures of EBS in a single availability zone took Reddit down for many hours:

Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. While we do have protections in place to mitigate latency on a small set of disks by using raid-0 stripes, the frequency of degradation has become highly unpalatable.

[…] we have been working to completely move Cassandra off of EBS and onto the local storage which is directly attached to the EC2 instances. […] While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS.

After the outage today, we are going to be investigating doing the same for our Postgres clusters.

One mistake we made was using a single EBS disk to back some of our older master databases

Maybe these will sound as truisms to those working on high available systems, but not for everybody else:

  • when talking high availability, running your application from a single Amazon availability zone is not enough

  • even if EBS promises “highly available, highly reliable storage volumes”, a solution relying on it will have to account for: 1) failures; 2) unreliable performance.

    An ex-Reddit engineer posted details about the serious issues Reddit noticed while using Amazon EBS.

  • Dynamo-style NoSQL databases — where all nodes in a cluster are equal — are able to tolerate failures easier than traditional RDBMS.

    Reddit is working on moving Cassandra off the EBS and onto the local ephemeral EC2 storage.

  • A master/slave replication model combined with the out-of-order commits issue makes me think that the cloud and RDBMS are not yet perfect together.

    Data which had been committed to the slaves was not committed to the masters. In a normal replication scenario, this should never, ever happen. The master commits the data, then tells the slave it is safe to commit the same data.

  • One mistake we made was using a single EBS disk to back some of our older master databases

  • remember the Amazon EBS vs SSD: Price, Performance, QoS?

What else can we learn from Reddit’s experience?

Original title and link: Reddit’s Story of Running Cassandra & PostgreSQL on Amazon EBS (NoSQL databases © myNoSQL)

via: http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24.html


dbShards: MPP DBMS on top of MySQL or PostgreSQL

Three articles about dbShards:

  1. highscalability.com: Product: DbShards - Share Nothing. Shard Everything

    What Kind Of Customer Are You Targeting With DbShards? Who Ends Up Using Your Product And Why?

    The primary customers for dbShards fit into two categories:

    1. fast-growing Web or online applications (e.g., Gaming, Facebook apps, social network sites, analytics)
    2. any application involved in high volume data collection and analysis (e.g., Device Measurement). Any application that requires high rates of read/write transaction volumes with a growing data set is a good candidate for the technology.

    I’ve checked the customers page and I don’t see any company listed there that corresponds to the first point above. As regards the second category, read on.

  2. dbms2.com: dbShards — a lot like an MPP OLTP DBMS based on MySQL or PostgreSQL

    insert performance with dbShards + MySQL + InnoDB is 1500-3000 inserts per shard per second, scaling almost linearly with the number of shards. I forgot to ask how many shards this had been tested for.

    I assume you are aware of some numbers for NoSQL databases. Not to mention the 750k qps NoSQLized MySQL.

    dbShards has good join performance when – you guessed it! – everything being joined is co-located shard-by-shard, because the tables were distributed on the same shard key and/or replicated across each shard. Cory can’t imagine why you’d want to do an inner join under any other circumstances.

    While there’s no surprise in the above quote, I’m not sure how to correlate it with the fact that dbShards targets data analysis clients.

  3. dbms2.com: dbShards update

    dbShards’ replication scheme works like this:

    • A write initially goes to two places at once — to the DBMS and a dbShards agent, both running on the same server.
    • The dbShards agent streams to the dbShards agent on the replica server, and receipt of the streamed write is acknowledged.
    • At that point the commits start. (Cory seemed to say that the commit on the primary server happens first, but I’m not sure why.)

    In essence, two-phase database commit is replaced by two-phase log synchronization.

    Anyone could explain how are these different?

I know all this may come out as too negative. But while I think dbShards has a decent set of features, some of the statements out there are not doing it any favors.

Original title and link: dbShards: MPP DBMS on top of MySQL or PostgreSQL (NoSQL databases © myNoSQL)


PostgresSQL: How to Make it Faster

I usually cross the line and post about RDBMS when there are interesting things that can be learned. Robert Hass explains some PostgreSQL knobs that can be turned to make things faster. But more interesting is what these knobs are doing: some are disabling fsync, others are disabling the write-ahead-log, others are making the commits asynchronous. Even more interesting is that all these knobs are in the end trading off durability for speed.

I think that, in the future, we may be able to provide more options to allow people to relax the data integrity guarantees that PostgreSQL provides in controlled ways. For example, I can imagine a “dirty read” table, where transactions are not used; instead, rows become visible as soon as they’re inserted, and disappear as soon as they’re deleted. Such a table would be unsuitable for many business applications, but if your application only does single-row operations indexed by primary key, it might work just fine; and it would open up a number of interesting optimization opportunities that aren’t available for ordinary tables. Or, you might have a “no snapshot” table, where rows don’t become visible until the inserting transaction commits, but we make no attempt to guarantee serializability: rows appear pop into existence the instant they’re committed, and disappear out from under you if a deleting transaction commits.

Original title and link: PostgresSQL: How to Make it Faster (NoSQL databases © myNoSQL)

via: http://rhaas.blogspot.com/2010/11/when-your-data-isnt-made-of-gold.html


Disqus: Scaling the World’s Largest Django Application

Good lessons on building high availability services from Disqus, the commenting service:

Interesting to note that MongoDB is not mentioned anywhere in the talk, even if Disqus is powered by MongoDB. It is either because MongoDB scaling and high availability weren’t a concern (nb no pun intended, but I doubt that) or that MongoDB is not a central piece of Disqus architecture.

Original title and link: Disqus: Scaling the World’s Largest Django Application (NoSQL databases © myNoSQL)


MySQL is Not ACID Compliant

This is becoming a “trend“:

That’s because you are basically taking your data and vomiting it on the hard drive without any consideration as to if your data you are writing is sensible or simply dreamed up by magic pixies.

If you missed it, make sure you watch MongoDB is Web Scale.

Original title and link for this post: MySQL is Not ACID Compliant (published on the NoSQL blog: myNoSQL)


Oracle impact on the Open Source Relational Databases

Cheap:

Oracle has shut down servers Sun Microsystems was contributing to the build farm for open source database software, PostgreSQL, forcing enthusiasts to scramble to find new hosts to test updates to their software on the Solaris operating system.

Keep in mind that these were 3 (three) servers. Not 300, not even 30.

The fact that I cover NoSQL databases doesn’t mean that I don’t care about relational databases or that we will not need them. Having a healthy open source relational database ecosystem is essential. And please don’t say or even think that this will help in any ways the NoSQL community!

via: http://www.itnews.com.au/News/221051,oracle-shuts-down-open-source-test-servers.aspx