Lucene: All content tagged as Lucene in NoSQL databases and polyglot persistence
Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:
- offers a unique architectural approach that significantly reduces the total computations required to query
- creates terms and pattern indexes (basically combinations of terms at indexing time)
- uses jump tables and bloom filters
- heavily optimizes disk I/O
- doesn’t require indexes in memory
- “can often do same query with less than 1% computations”
“when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
- according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
- if Perfect Search is a search engine why comparing with relational databases?
“Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
- Google is using 100 servers for reliability and guaranteeing the speed of results
- “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”
With this preamble, you can watch the video after the break:
Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.
Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.
Let’s look again at the problem and see what the requirements are:
- store a large and continuously growing amount of messages
- retrieve messages either directly (key-based access) or by searches (full text indexing)
The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:
If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.
Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:
Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.
Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.
Original title and link: Hadoop and Solr for Archiving Emails ( ©myNoSQL)
Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.
When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.
The answer to the the first one is provided in the post.
What is Index Tank? IndexTank is mainly three things:
- IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
- API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
- Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.
For the second, I’ve reached out the the old IndexTank FAQ.
How does IndexTank compare to Lucene and Solr?
- IndexTank was a hosted, scalable service
- IndexTank can add documents to the index
- IndexTank supports updating document variables without re-indexing
- IndexTank supports geolocation functions
For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.
Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr ( ©myNoSQL)