The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply: When you deal with BigData, the data center is your computer.
Great and concise explanation of the pre-packaged HBase filters and their advantages by Philippe Rigaux:
Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the “WHERE” clause) and the number of columns for each row (with the “SELECT” clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.
Original title and link: HBase Filters Explained: Let HBase Do the Data Selection Job for You ( ©myNoSQL)