nutch: All content tagged as nutch in NoSQL databases and polyglot persistence
Wednesday, 14 July 2010
Nutch Using Now HBase
Just another example of NoSQL databases adoption: an almost 2 years old patch providing Nutch[1]
integration with HBase just went into Nutch trunk. The ☞ ticket page provides details about the reasons for having Nutch working with HBase:
- All your data in a central location
- No more segment/crawldb/linkdb merges.
- No more “missing” data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don’t have access to a URL’s fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
- A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.