The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:
And its future:
Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.
Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.
Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie ( ©myNoSQL)