NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



full text indexing: All content tagged as full text indexing in NoSQL databases and polyglot persistence

Big Data Search: Perfect Search

Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:

  • offers a unique architectural approach that significantly reduces the total computations required to query
  • creates terms and pattern indexes (basically combinations of terms at indexing time)
  • uses jump tables and bloom filters
  • heavily optimizes disk I/O
  • doesn’t require indexes in memory
  • “can often do same query with less than 1% computations”
  • “when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
    • according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
    • if Perfect Search is a search engine why comparing with relational databases?
  • “Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
    • Google is using 100 servers for reliability and guaranteeing the speed of results
  • “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”

With this preamble, you can watch the video after the break:

Fulltext search your CouchDB in Ruby

When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:

  • Xapit is still under active development
  • You need to trigger Index update manually
  • It doesn’t Incremental index update at the moment

I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.

Original title and link: Fulltext search your CouchDB in Ruby (NoSQL database©myNoSQL)


Lucene & Solr Year 2011 in Review

I much prefer reviews to predictions. Moreover so when there are so many worthy things to be mention as what Lucene and Solr have accomplished in 2011:

  • Near Real-Time search (freshly added documents can be immediately made visible in search results)
  • Field collapsing or result grouping
  • faceting module
  • language support

Plus the promise of the SolrCloud:

In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  

On the other hand, last December LinkedIn open sourced IndexTank a real-time fulltext search-and-indexing system. Some of its features will definitely sound interesting to Lucene and Solr users.

Original title and link: Lucene & Solr Year 2011 in Review (NoSQL database©myNoSQL)


LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr

Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.

The projects can be found already on GitHub: index tank-engine (the indexing engine) and the API, BackOffice, Storefront, and Nebulizer.

When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.

The answer to the the first one is provided in the post.

What is Index Tank? IndexTank is mainly three things:

  • IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
  • API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
  • Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.

For the second, I’ve reached out the the old IndexTank FAQ.

How does IndexTank compare to Lucene and Solr?

  1. IndexTank was a hosted, scalable service
  2. IndexTank can add documents to the index
  3. IndexTank supports updating document variables without re-indexing
  4. IndexTank supports geolocation functions

For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.

Happy hacking!

Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr (NoSQL database©myNoSQL)

MarkLogic, LexisNexis, XML, and Search

The lessons to be learned from the story about LexisNexis and MarkLogic—GigaOm and PR announcement—are quite simple:

  • Put XML into an XML database, objects into an Object Database, JSON into a document database, relational data into a relational database and you’ll get the best results
  • the better the data store understands the structure of your data, the better search results should be

Original title and link: MarkLogic, LexisNexis, XML, and Search (NoSQL database©myNoSQL)

Is MarkLogic a Search Engine?

A praise to MarkLogic:

What sets MarkLogic apart is that it is not just a search engine.  MarkLogic combines some of the best features of search with a fast performing XML database.  This combination allows MarkLogic to offer features that traditional search engines lack.  Four of the most important differentiators are:

  • multi-level searching,
  • editable search results,
  • schema flexibility,
  • and simplified architectures.

I can see how all these features can be useful for some use cases. But as a side comment, none of these features doesn’t clarify why MarkLogic is not only a search engine.

Original title and link: Is MarkLogic a Search Engine? (NoSQL database©myNoSQL)


Groonga: Open-Source Fulltext Search Engine and Column Store


Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

Most of the documentation is in Japanese, the only thing I’ve found are some slides:

Full Text Search: What to Use?

A problem everyone using a NoSQL databases faces (nb: actually I think this applies to most storage engines that don’t support full text indexing):

The problem now is: what to use? Currently I’m toying with 3 options:

  1. Use Sphinx Search; it’s pretty powerful, pretty damn fast, but requires me to feed it data through XML, but only when the indexer runs. Basically it’s quite hard to get real-time indexes going, and the delta updates are something I’d rather not mess with. 
  2. Use Solr; I’d go for this if it wasn’t for the fact it’s Java and requires Tomcat to work. Our entire application infrastructure is basically MongoDB and Perl, and I don’t want to go and set up a Tomcat instance just for Solr; on top of which I have a pathologically deep hatred for Java, but that aside…
  3. Roll my own. Full text search the way we need it doesn’t actually require things like stemming or fancy analysis of things. What it does need is the ability to search a schema-less database… Solr and Sphinx both suffer from the fact you need to tell them what to index, and even then you run into the fact that it’ll need a double pass. First pass is getting the search results, and the second pass entails the checking to see whether the user doing the search can actually see the document. 

Couple of thoughts:

  1. there are a couple of solutions out there, both relational and NoSQL databases, that support different degrees of full text indexing (e.g. Riak Search, MarkLogic)
  2. even if your database supports some form of full text search, the implementation might not be complete/optimal.
  3. initially it may sounds like building a reverse index is the best solution. Twitter’s story of migrating from their own reverse indexes in MySQL to a Lucene based solution should change your mind.
  4. some NoSQL databases provide good mechanisms for enabling full text indexing. Riak has post commit hooks, CouchDB has a _changes feed.

Original title and link: Full Text Search: What to Use? (NoSQL database©myNoSQL)


ThriftDB: The Amazon Web Services of Search

ThriftDB presented today at TechCrunch Disrupt:

Technically speaking, ThriftDB is a flexible key-value datastore with search built in that has the flexibility, scalability, and performance of a NoSQL datastore with the capabilities of full-text search. Essentially, what this means is that, by combining the datastore and the search engine, ThriftDB is offering a service that makes it easy for developers to build fast, horizontally-scalable applications with integrated search.

The website says ThriftDB is a document database built on top of Thrift with full-text search support. I’m not really sure about the Amazon Web Services for Search, but it sounds like it would go against Marklogic, ElasticSearch, Solr, and so on.

Original title and link: ThriftDB: The Amazon Web Services of Search (NoSQL databases © myNoSQL)


Riak Search Explained

35 minutes of Riak Search with Dan Reverri which will walk you from the Riak Search basics to running a sample application:

Mark Phillips

Original title and link: Riak Search Explained (NoSQL databases © myNoSQL)

The Joy of Indexing

Patrick Durusau:

Ask yourself, what is the index in Kyle’s examples indexing?

Kyle says the example are indexing recipes but is that really true?

Or is it the case that the index is indexing the occurrence of a string at a location in the text?

Not exactly the same thing.

That is to say there is a difference between a token that appears in a text and a subject we think about when we see that token.

Interesting point. Basic indexing is dismissing both contextual and semantic attributes. All is left is keywords. Adding back any of contextual or semantic meta will increase results relevancy.

Original title and link: The Joy of Indexing (NoSQL databases © myNoSQL)


HSearch: NoSQL Search Engine Built on HBase

Cassandra has Lucandra Solandra, Riak has Riak Search, HBase has HSearch

HSearch features include:

  • Multi-XML formats
  • Record and document level search access control
  • Continuous index updation
  • Parallel indexing using multi-machines
  • Embeddable inside application
  • A REST-ful Web service gateway that supports XML
  • Auto sharding
  • Auto replication

Original title and link: HSearch: NoSQL Search Engine Built on HBase (NoSQL databases © myNoSQL)