Starting from the architecture of Facebook’s realtime analytics presented in the paper Apache Hadoop Goes Realtime at Facebook and Dhruba Borthakur’s excellent posts HDFS: Realtime Hadoop and HBase Usage at Facebook, Nati Shalom describes an alternative approach for real-time analytics using data grids making the following assumptions:
They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.
What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?
While a great read, I get the feeling there’s something wrong. Maybe this:
There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook’s working system: […] We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.
One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.
Original title and link: An Alternative Approach for Big Data Real Time Analytics ( ©myNoSQL)