A while ago I’ve read in a PR announcement about Infolinks’, an in-text advertising company, usage of Hadoop and HBase. Lior Schachter, Infolinks Software Architect, has been kind to answer my questions.
What was Infolinks using before going the NoSQL route with HBase and Hadoop?
Our architecture uses a proprietary solution for logs collection and aggregation (similar to Scribe/Flume). This architecture is OSGI based and currently handles billions of lines of logs (per day) from our distributed ad network. On top of this we have 2 kinds of statistics:
- Well-defined real-time statistics: MySQL based.
- Reports: data-mining on the raw logs.
In the past we used a single machine with OSGI server to process these kinds of statistics.
The bottlenecks we identified were in the report generation:
- The amount of data we needed to process constantly grew (longer time to process).
- The number of reports requested by our marketing/support/R&D teams also increased.
We reached the point where generating a single report generation took hours. We needed to find a solution for the scalability problem which would allow us to run hundreds of reports per day on fine grain data (URL, sentence). The Hadoop/HBase framework fits these needs perfectly.
Could you describe how Infolinks is using HBase and Hadoop?
We use Hadoop as our report generation engine - running M/R jobs on raw logs. We have a dedicated GUI for defining reports which are then translated to M/R configuration. We also use M/R to insert data into HBase.
We use HBase to hold aggregation data per day on various aspects of our system (e.g. URLs, Sentences). This data is used as feedback to our front-end servers (which analyze the text and serves the advertisements). This feedback has proven to be very effective. We also allow business experts to query HBase in order to get deeper insights on the system.
In contrast to the Hadoop reports which are asynchronous (you order a report and get it by mail), the HBase dashboard is very fast and supports synchronous results. Our business team loves it. No waiting!
Have you evaluated other solutions? Why picking up Hadoop and HBase?
We started exploring the NoSQL solutions more than a year ago. We did some research on the available solutions and chose Hadoop/HBase for few reasons:
- Java based
- Open source
- Hadoop - quite mature compared to other Java based solutions. Hadoop is also used by many web companies.
- HBase - using Hadoop (so you get for free Hadoop stability, APIs etc.), like BigTable
We tested this solution for 6 months (as a small cluster) and were very happy with it.
Thanks a lot Lior!
Lior Schachter is a Software Architect with extensive hands-on experience in building and evolving large-scale software systems in the Telco and Internet industries. Lior, joined infolinks in 2009 and serves as a R&D team manager, with the following responsibilities: 1) Real-Time advertising engine; 2) Hadoop and HBase infrastructures. Lior holds a BSC degree in Computer Science and Electrical Engineering with honors, from Tel-Aviv University. ↩
Original title and link: HBase and Hadoop at Infolinks ( ©myNoSQL)