Some notes and questions about ☞ ReadPath' usage of HBase and Hadoop.
The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute).
This sounds extremely cool, but I’d appreciated more details about the hardware involved. There is only one mention that they are currently using an 8 node HBase/Hadoop cluster. Better stats would be: initial items/node vs current items/node; initial algo LOC vs HBase/Hadoop algo LOC; etc.
One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since […]
If you thought using NoSQL solutions would automatically address and solve backup and restore policies, you were wrong.
Next steps include […] looking at using HBase for the link graph system that ReadPath needs to sort items. The link graph is a much more difficult system, the read/write pattern is completely random which blows away any caching. In preliminary tests, the system ends up being disk bound.
Wondering if for this scenario a graph database wouldn’t be a better fit. But then, I am not sure how well supported is MapReduce/Hadoop in graph databases.