search: All content tagged as search in NoSQL databases and polyglot persistence
Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:
- offers a unique architectural approach that significantly reduces the total computations required to query
- creates terms and pattern indexes (basically combinations of terms at indexing time)
- uses jump tables and bloom filters
- heavily optimizes disk I/O
- doesn’t require indexes in memory
- “can often do same query with less than 1% computations”
“when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
- according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
- if Perfect Search is a search engine why comparing with relational databases?
“Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
- Google is using 100 servers for reliability and guaranteeing the speed of results
- “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”
With this preamble, you can watch the video after the break:
HSearch features include:
- Multi-XML formats
- Record and document level search access control
- Continuous index updation
- Parallel indexing using multi-machines
- Embeddable inside application
- A REST-ful Web service gateway that supports XML
- Auto sharding
- Auto replication
In the last week, I’ve seen 3 articles or presentations on using Hadoop-based searches:
- Hadoop and HBase Optimization for read intensive search applications
- Real-Time Searching of Big Data with Solr and Hadoop
and then embedded below sematext’s Search Analytics with Flume and HBase.
Meanwhile, Google went Caffeine to deal with more timely index updates.
As its name implies, couchdb-lucene is based on the well known Lucene library. While I think that such a solution is providing a lot of features and flexibility, my concern is that it also brings additional complexity in terms of scalability as you’ll not only need to take care of scaling CouchDB, but also your Lucene indexes. On the other hand, indexer is using a much simpler approach and stores the indexes directly in the CouchDB, but it is still a prototype version.
But that’s just my opinion, so I’m wondering which one of these would you favor?
To learn more about these projects you can check the following resources:
And for more libraries and projects make sure you check the NoSQL Libraries.