ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

search: All content tagged as search in NoSQL databases and polyglot persistence

eBay, Wal-Mart Search for Revved-Up Search Engines

Reuters reporting about eBay and Wal-Mart’s work to improve their search engines:

The search engine project takes time because eBay’s online marketplace has so much variable information from millions of listings that are described differently by each seller - something known as unstructured data in the tech world.

This is not much of a NoSQL story, but there’s something I’m reading between the lines: when talking about creating better search solutions making search work at scale is not mentioned, implying this is a solved problem. The focus is on handling unstructured data and creating better relevancy algorithms.

I have no details about the architecture of the new version of eBay search, but I have found this diagram of eBay’s Voyager in a slidedeck by Dan Pritchett from around 2007:

Scaling Search Voyager

Original title and link: eBay, Wal-Mart Search for Revved-Up Search Engines (NoSQL database©myNoSQL)

via: http://www.reuters.com/assets/print?aid=USBRE84319420120504


Big Data Search: Perfect Search

Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:

  • offers a unique architectural approach that significantly reduces the total computations required to query
  • creates terms and pattern indexes (basically combinations of terms at indexing time)
  • uses jump tables and bloom filters
  • heavily optimizes disk I/O
  • doesn’t require indexes in memory
  • “can often do same query with less than 1% computations”
  • “when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
    • according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
    • if Perfect Search is a search engine why comparing with relational databases?
  • “Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
    • Google is using 100 servers for reliability and guaranteeing the speed of results
  • “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”

With this preamble, you can watch the video after the break:


Is MarkLogic a Search Engine?

A praise to MarkLogic:

What sets MarkLogic apart is that it is not just a search engine.  MarkLogic combines some of the best features of search with a fast performing XML database.  This combination allows MarkLogic to offer features that traditional search engines lack.  Four of the most important differentiators are:

  • multi-level searching,
  • editable search results,
  • schema flexibility,
  • and simplified architectures.

I can see how all these features can be useful for some use cases. But as a side comment, none of these features doesn’t clarify why MarkLogic is not only a search engine.

Original title and link: Is MarkLogic a Search Engine? (NoSQL database©myNoSQL)

via: http://blogs.avalonconsult.com/blog/search/is-marklogic-a-search-engine/


Groonga: Open-Source Fulltext Search Engine and Column Store

Groonga:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

Most of the documentation is in Japanese, the only thing I’ve found are some slides:


Recipe for a Distributed Realtime Tweet Search System

Ingredients:

Method:

  1. Place Voldemort, Kafka, and Sensei on a couple of servers.
  2. Arrange them with taste:

    Chirp Architecture

  3. Spray a large quantity of tweets on the system

Preparation time:

24 hours

Notes:

For more servings, add the appropriate number of servers.

Result:

Chirper on Github

Reviews

  • One design choice was letting the process that writes to Voldemort also be a Kafka consumer. Although this would be cleaner, we would risk a data-race where search may return hit array before they are yet added to Voldemort. By making sure it is first added to Voldemort, we can rely on it being an authoritative storage for our tweets.
  • You may have already realized Kafka is acting as a proxy for twitter stream, and we could have also streamed tweets directly into the search systems, bypassing the Kafka layer. What we would be missing is the ability to play back tweet events from a specific check-point. One really nice feature about Kafka is that you can keep a consumption point to have data replayed. This makes reindexing for cases such as data corruption and schema changes, etc., possible. Furthermore, to scale search, we would have a growing number of search nodes consume from the same Kafka stream. Kafka is written in a way where adding consumers does not affect through-put of the system really helps in scaling the entire system.
  • Another important design decision was on using Voldemort for storage. One solution would be instead store tweets in the search index, e.g. Lucene stored fields. The benefits with this approach would be stronger consistency between search and store, and also the stored data would follow the retention policy of that’s defined by the search system. However, other than the fact that Lucene stored field is no-where near as optimal comparing to a Voldemort cluster (an implementation issue), there are more convincing reasons:
    • We can first see the consistency benefit for having search and store be together is negligible. Actually, if we follow our assumption of tweets being append-only and we always write to Voldemort first, we really wouldn’t have consistency issues. Yet, having data storage reside on the same search system would disproportionally introduce contention for IO bandwidth and OS cache, as data volume increases, search performance can be negatively impacted.
    • The point about retention is rather valid. As search index guarantees older tweets to be expired, Voldemort store would continue to grow. Our decision ultimately came down to two points: 1) Voldemort’s growth factor is very different, e.g. adding new records into the system is much cheaper, so it is feasible to have a much longer data retention policy. 2) Having have cluster of tweet storage allows us to integrate with other systems if desired for analytics, display etc.

Original title and link: Recipe for a Distributed Realtime Tweet Search System (NoSQL databases © myNoSQL)

via: http://sna-projects.com/blog/2011/02/build-a-distributed-realtime-tweet-search-system-in-no-time-part-12/


HSearch: NoSQL Search Engine Built on HBase

Cassandra has Lucandra Solandra, Riak has Riak Search, HBase has HSearch

HSearch features include:

  • Multi-XML formats
  • Record and document level search access control
  • Continuous index updation
  • Parallel indexing using multi-machines
  • Embeddable inside application
  • A REST-ful Web service gateway that supports XML
  • Auto sharding
  • Auto replication

Original title and link: HSearch: NoSQL Search Engine Built on HBase (NoSQL databases © myNoSQL)


Search Analytics with Flume and HBase

In the last week, I’ve seen 3 articles or presentations on using Hadoop-based searches:

and then embedded below sematext’s Search Analytics with Flume and HBase.

Meanwhile, Google went Caffeine to deal with more timely index updates.

Original title and link: Search Analytics with Flume and HBase (NoSQL databases © myNoSQL)


Large-scale Incremental Processing Using Distributed Transactions and Notifications

From Daniel Peng and Frank Dabek paper (☞ PDF):

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

Is this paper at the origin of Google Caffeine?

Original title and link: Large-scale Incremental Processing Using Distributed Transactions and Notifications (NoSQL databases © myNoSQL)

via: http://research.google.com/pubs/pub36726.html


Hadoop and HBase Optimization for Read Intensive Search Applications

Kind of what Google was doing prior to Caffeine:

Bizosys Technologies* has built a sSearch engine whose index is on Hadoop and HBase to deploy in a cluster environment. Search applications by nature involve read intensive operations. Bizosys experimented with its search engine that involved use of latest hardware options, software configuration and cluster deployment provisioning.

Bizosys Hadoop HBase

Original title and link: Hadoop and HBase Optimization for Read Intensive Search Applications (NoSQL databases © myNoSQL)

via: http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/


Neo4j: Advanced Indexes Using Multiple Keys

There’s a prototype implementation of a new index which solves this (and some other issues as well, f.ex. indexing for relationships). The code is at https://svn.neo4j.org/laboratory/components/lucene-index/ and it’s built and deployed over at http://m2.neo4j.org/org/neo4j/neo4j-lucene-index/

The new index isn’t compatible with the old one so you’ll have to index your data with the new index framework to be able to use it.

Before you were only able to search by a single property.

Original title and link for this post: Neo4j: Advanced Indexes Using Multiple Keys (published on the NoSQL blog: myNoSQL)

via: http://lists.neo4j.org/pipermail/user/2010-August/004781.html


CouchDB Full Text Indexing

Currently there seems to be two approaches to get full text indexing in CouchDB: couchdb-lucene [1] and indexer [2].

As its name implies, couchdb-lucene is based on the well known Lucene library. While I think that such a solution is providing a lot of features and flexibility, my concern is that it also brings additional complexity in terms of scalability as you’ll not only need to take care of scaling CouchDB, but also your Lucene indexes. On the other hand, indexer is using a much simpler approach and stores the indexes directly in the CouchDB, but it is still a prototype version.

But that’s just my opinion, so I’m wondering which one of these would you favor?

To learn more about these projects you can check the following resources:

And for more libraries and projects make sure you check the NoSQL Libraries.