NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



lucene: All content tagged as lucene in NoSQL databases and polyglot persistence

LinkedIn's new search platform

In this post introducing the new search solution implemented at LinkedIn, you can find a pretty good list of the requirements for a good search tool. In the form of what were the showstoppers hit with the previous solution:

  • Rebuilding a complete index is extremely difficult
  • Live updates are at an entity granularity
  • Scoring is inflexible
  • Too many small open sources components

On top of these, add flexibility and extensibility, something that is important for every critical component, but much more so for search which depends so heavily on the format, behavior, and fine tunning.

The rest of the post dives into some details of the new solution, which is a distributed layer of extensions on top of Lucene, code named Galene.

Original title and link: LinkedIn’s new search platform (NoSQL database©myNoSQL)


Architecture of HBase-based Lucene Implementation

Boris Lublinsky and Mike Segel:

The implementation tries to balance two conflicting requirements - performance: in memory cache can drastically improve performance by minimizing the amount of HBase reads for search and documents retrieval; and scalability: ability to run as many Lucene instances as required to support growing search clients population. The latter requires minimizing of the cache life time to synchronize content with the HBase instance (a single copy of thruth). A compromise is achieved through implementing configurable cache time to live parameter, limiting cache presence in each Lucene instance.

Architecture of HBase-based Lucene implementation

Besides existing Solr scaling approaches and the work to make Solr scalable, there’s also the recently released DataStax Enterprise which integrates Solr on top of Cassandra.

Original title and link: Architecture of HBase-based Lucene Implementation (NoSQL database©myNoSQL)


Big Data Search: Perfect Search

Tim Stay (CEO) talks about Perfect Search a solution for searching Big Data that:

  • offers a unique architectural approach that significantly reduces the total computations required to query
  • creates terms and pattern indexes (basically combinations of terms at indexing time)
  • uses jump tables and bloom filters
  • heavily optimizes disk I/O
  • doesn’t require indexes in memory
  • “can often do same query with less than 1% computations”
  • “when compared to Oracle/MS SQL, Perfect Search can be from 10x to over 1000x faster”
    • according to the chart, the significant speed improvements are for cached results, while for first time queries I see numbers from 2 to 59
    • if Perfect Search is a search engine why comparing with relational databases?
  • “Google takes over 100 servers to search 1 billion documents. Perfect Search can do it with 1 server”
    • Google is using 100 servers for reliability and guaranteeing the speed of results
  • “Lucene: 0.1 billion documents per server; CPU maxing at 100%. Perfect Search 1.6 billion documents per server; CPU idling at 15%”

With this preamble, you can watch the video after the break:

Fulltext search your CouchDB in Ruby

When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:

  • Xapit is still under active development
  • You need to trigger Index update manually
  • It doesn’t Incremental index update at the moment

I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.

Original title and link: Fulltext search your CouchDB in Ruby (NoSQL database©myNoSQL)


Getting off the CouchDB... or Lessons Learned while Experimenting in Production

The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.

And that’s when the wheels came off.

Word of caution: this is not the “CouchDB sucks so we went with MongoDB” type of post. It’s more of “we thought CouchDB can solve one of our problems, but then got confused and thought it can solve world hunger. So we decided to throw a bunch of data to it to see if it sticks. Surprise! It didn’t.”

Just to be clear, I’m not defending CouchDB and everything John Wood writes about it is correct. It’s just that experimenting with CouchDB in a non-production environment or at least reading myNoSQL would have already offered all those answers.

Original title and link: Getting off the CouchDB… or Lessons Learned while Experimenting in Production (NoSQL database©myNoSQL)


Hadoop and Solr for Archiving Emails

Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.

Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

Let’s look again at the problem and see what the requirements are:

  • store a large and continuously growing amount of messages
  • retrieve messages either directly (key-based access) or by searches (full text indexing)

The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:

If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.

Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:

Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.

Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.

Original title and link: Hadoop and Solr for Archiving Emails (NoSQL database©myNoSQL)

Lucene & Solr Year 2011 in Review

I much prefer reviews to predictions. Moreover so when there are so many worthy things to be mention as what Lucene and Solr have accomplished in 2011:

  • Near Real-Time search (freshly added documents can be immediately made visible in search results)
  • Field collapsing or result grouping
  • faceting module
  • language support

Plus the promise of the SolrCloud:

In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  

On the other hand, last December LinkedIn open sourced IndexTank a real-time fulltext search-and-indexing system. Some of its features will definitely sound interesting to Lucene and Solr users.

Original title and link: Lucene & Solr Year 2011 in Review (NoSQL database©myNoSQL)


LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr

Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.

The projects can be found already on GitHub: index tank-engine (the indexing engine) and the API, BackOffice, Storefront, and Nebulizer.

When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.

The answer to the the first one is provided in the post.

What is Index Tank? IndexTank is mainly three things:

  • IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
  • API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
  • Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.

For the second, I’ve reached out the the old IndexTank FAQ.

How does IndexTank compare to Lucene and Solr?

  1. IndexTank was a hosted, scalable service
  2. IndexTank can add documents to the index
  3. IndexTank supports updating document variables without re-indexing
  4. IndexTank supports geolocation functions

For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.

Happy hacking!

Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr (NoSQL database©myNoSQL)

Neo4j 1.4 “Kiruna Stol” Released With Many Notable Improvements

Releasing often has too many advantages to list them all, but I think the major ones are: capturing the interest of new users (generating buzz), showing a healthy project velocity, and, probably the most important one, delivering the features and improvements users were asking for in a timely manner . Neo4j has learned these lessons[1] and since Neo4j 1.2 the team at Neo Technologies is trying a very frequent release plan which also includes milestone releases. The other day, Neo4j 1.4, a.k.a. Kiruna Stol, has been released:

Over the last three months, we’ve released 6 milestones in our 1.4 series. Today we’re releasing the final Neo4j 1.4 General Availability (GA) package. We’ve seen a whole host of new features going into the product during this time, along with numerous performance and stability improvements. We think this is our best release yet, and we hope you like the direction in which the product is heading.

There are some notable new features and improvements in this release:

  1. a new query language called Cypher[2]
  2. automatic indexing
  3. a Lucene upgrade leading to faster indexing
  4. self relationships
  5. REST API improvements: exposing batch execution API, paging mechanism for traversers
  6. webadmin, performance, and new server management scripts

  1. In the NoSQL space, they are not alone. 10gen follows a similar aggressive release plan for MongoDB. Redis, even if supported by a 2 people team, has always enjoyed frequent releases. DataStax has also started to push out Cassandra updates more often.  

  2. At first glance the query language looks odd, but I haven’t looked beyond some basic examples to understand its syntax and strenght. Neo4j also supports Gremlin.  

Original title and link: Neo4j 1.4 “Kiruna Stol” Released With Many Notable Improvements (NoSQL database©myNoSQL)


Seven Java Projects That Changed the World

Over the last decade, several projects have traveled beyond mere adoption and had effects dominating the Java world, into software development in general, and some even further into the daily lives of users.

Not sure how Edd Dumbill came up with the list[1], but it includes Solr (and implicitely Lucene) and Hadoop. I concur.

  1. The list looks good to me.  

Original title and link: Seven Java Projects That Changed the World (NoSQL database©myNoSQL)


Using Solr and Hadoop as a NoSQL database

The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS.

Using Solr and Hadoop as a NoSQL database

I think the first time I’ve heard about Solr and Lucene mentioned as NoSQL-like storages was from Grant Ingersoll and from the guys.

From a NoSQL perspective:

  • there’s no fixed schema
  • there’s key-value access — hopefully that’s very fast and scalable
  • even if not standardized, there’s an advanced querying language

But as the original article points out some characteristics are missing:

  • Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance.
  • Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead.

Original title and link: Using Solr and Hadoop as a NoSQL database (NoSQL databases © myNoSQL)


Lucene and Solr Development Merged

Full text indexing in NoSQL databases has been addressed so far only by Riak search, the others relying on integrations with Lucene, Solr, or ElasticSearch.

With merged dev, there is now a single set of committers across both projects. Everyone in both communities can now drive releases – so when Solr releases, Lucene will also release – easing concerns about releasing Solr on a development version of Lucene. So now, Solr will always be on the latest trunk version of Lucene and code can be easily shared between projects – Lucene will likely benefit from Analyzers and QueryParsers that were only available to Solr users in the past. Lucene will also benefit from greater test coverage, as now you can make a single change in Lucene and run tests for both projects – getting immediate feedback on the change by testing an application that extensively uses the Lucene libraries. Both projects will also gain from a wider development community, as this change will foster more cross pollination between Lucene and Solr devs (now just Lucene/Solr devs).

Hopefully NoSQL databases will benefit from this merge too, by having a more solid product to rely on.

Original title and link: Lucene and Solr Development Merged (NoSQL databases © myNoSQL)