NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hive: All content tagged as hive in NoSQL databases and polyglot persistence

What Is Informatica HParser for Hadoop?

Sifting through the PRish announcements related to Informatica HParser, what I’ve figured out so far is:

  • it is the T in ETL
  • a visual tool for creating parsing definitions for formats like web logs, XML, JSON, FIX, SWIFT, HL7, CDR, WORD, PDF, XLS, etc.
  • transformations can be accessed from Hadoop MapReduce, Hive, or Pig
  • the benefits of using HParser come from being able to share the same parsing definitions/transformations in the context of the Hadoop distributed environment
  • HParser tries to provide an optimal transformation solution when streaming, splitting, and processing large files
  • HParser is available in two licensing formats: community and commercial

Original title and link: What Is Informatica HParser for Hadoop? (NoSQL database©myNoSQL)

Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie

The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:

Hadoop Sqoop Oozie Hive Biodiversity Indexing

And its future:

Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.

Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.

Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie (NoSQL database©myNoSQL)


Choosing Technologies: The Library of Congress and the Twitter Archive

Remember when everyone was suggesting solutions for Twitter architecture? Now the Library of Congress is trying to figure out what technologies to use to store the Twitter archive:

The project is still very much under construction, and the team is weighing a number of different open source technologies in order to build out the storage, management and querying of the Twitter archive. While the decision hasn’t been made yet on which tools to use, the library is testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop.

Note that in terms of storage only HBase is mentioned—Twitter’s main tweet storage is MySQL though.

Original title and link: Choosing Technologies: The Library of Congress and the Twitter Archive (NoSQL database©myNoSQL)


Experimenting with Hadoop using Cloudera VirtualBox Demo

CDH Mac OS X VirtualBox VM

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.

Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.

Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)


Apache Hive 0.7.0: Security and Performance

Long, impressive list of new features (notably authorization and authentication support) and improvements in Apache Hive 0.7.0 released end of March.

Original title and link: Apache Hive 0.7.0: Security and Performance (NoSQL databases © myNoSQL)

DataStax Hadoop on Cassandra Brisk Released

DataStax kept its promise and released Brisk: the Hadoop and Hive distribution using Cassandra, also known as Brangelina.

According to the official documentation, Brisk key advantages:

  • No single point of failure
  • streamlined setup and operations
  • analytics without ETL
  • full integration with DataStax OpsCenter

Brisk Architecture

Useful links:

Original title and link: DataStax Hadoop on Cassandra Brisk Released (NoSQL databases © myNoSQL)

Adopting Apache Hadoop and Hive

Moving Federal Gov analytics from MySQL to Hadoop and Hive:

HDFS offered us a distributed, resilient, and scalable filesystem while Hadoop promised to bring the work to where the data resided so we could make efficient use of local disk on multiple nodes. Hive, however, really pushed our decision in favor of a Hadoop-based system. Our data is just unstructured enough to make traditional RDBMS schemas a bit brittle and restrictive, but has enough structure to make a schema-less NoSQL system unnecessarily vague. Hive let us compromise between the two — it’s sort of a “SomeSQL” system.

Original title and link: Adopting Apache Hadoop and Hive (NoSQL databases © myNoSQL)


How Digg is Built? Using a Bunch of NoSQL technologies

The picture should speak for Digg’s polyglot persistency approach:

Digg Data Storage Architecture

But here is also a description of the data stores in use:

Digg stores data in multiple types system depending on the type of data and the access patterns, and also for historical reasons in some cases :)

  • Cassandra: The primary store for “Object-like” access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. […]

  • HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!

  • MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of JOIN heavy operations which is not a natural fit for the other data stores at this time. However… HBase looks interesting.

  • Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming API and also for the real time view and click counts since it provides super low latency as a memory-based data storage system.

  • Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.

I know this will sound strange, but isn’t it too much in there?


Original title and link: How Digg is Built? Using a Bunch of NoSQL technologies (NoSQL databases © myNoSQL)


Cassandra + Hadoop = Brisk by DataStax

I just heard the announcement DataStax, the company offering Cassandra services, made about Brisk a Hadoop and Hive distribution built on top of Cassandra:

Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by Cassandra.

Brisk was announced officially during the MapReduce panel at Structure Big Data event. But it looks like others have already had a chance to hear about Brisk — is there something that I should be doing to hear the “unofficial” announcements?

DataStax has also made available a whitepaper: “Evolving Hadoop into a Low-Latency Data Infrastructure: Unifying Hadoop, Hive and Apache Cassandra for Real-time and Analytics” that you can download from here

Original title and link: Cassandra + Hadoop = Brisk by DataStax (NoSQL databases © myNoSQL)

Hadoop, Hive and Redis for Foursquare Analytics

Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:

  • Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
  • Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
  • Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
  • Add new data in a simple way (just put it in Amazon S3!).
  • Analyse data from several data sources (mongodb, postgres, log-files).

Foursquare Analytics Architecture

One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.

Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)


Cloudera’s Distribution for Apache Hadoop version 3 Beta 4

New version of Cloudera’s Hadoop distro — complete release notes available here:

CDH3 Beta 4 also includes new versions of many components. Highlights include:

  • HBase 0.90.1, including much improved stability and operability.
  • Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
  • Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
  • Flume 0.9.3, including support for Windows and improved monitoring capabilities.
  • Sqoop 1.2, including improvements to usability and Oracle integration.
  • Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.

Plus many scalability improvements contributed by Yahoo!.

Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.

Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)


The Backstory of Yahoo and Hadoop

We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we’ve invested nearly 300 person-years into these projects. […] Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics.

I assume there’s no surpise to anyone I’m a big fan of Yahoo! open source initiatives.

Original title and link: The Backstory of Yahoo and Hadoop (NoSQL databases © myNoSQL)