ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

solr: All content tagged as solr in NoSQL databases and polyglot persistence

Fulltext search your CouchDB in Ruby

When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:

  • Xapit is still under active development
  • You need to trigger Index update manually
  • It doesn’t Incremental index update at the moment

I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.

Original title and link: Fulltext search your CouchDB in Ruby (NoSQL database©myNoSQL)

via: http://taylorluk.com/post/17255656638/fulltext-search-your-couchdb-in-ruby


Getting off the CouchDB... or Lessons Learned while Experimenting in Production

The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.

And that’s when the wheels came off.

Word of caution: this is not the “CouchDB sucks so we went with MongoDB” type of post. It’s more of “we thought CouchDB can solve one of our problems, but then got confused and thought it can solve world hunger. So we decided to throw a bunch of data to it to see if it sticks. Surprise! It didn’t.”

Just to be clear, I’m not defending CouchDB and everything John Wood writes about it is correct. It’s just that experimenting with CouchDB in a non-production environment or at least reading myNoSQL would have already offered all those answers.

Original title and link: Getting off the CouchDB… or Lessons Learned while Experimenting in Production (NoSQL database©myNoSQL)

via: http://blog.signalhq.com/2012/01/24/getting-off-the-couchdb/


Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud

Just a quick roundup of the latest releases and announcements.

Hortonworks Data Platform (HDP) version 2

HDP v2 will include:

  • NextGen MapReduce architecture
  • HDFS NameNode HA
  • HDFS Federation
  • up-to-date HCatalog, HBase, Hive, Pig

According to the announcement:

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.

SolrCloud Completes Phase 2

Mark Miller about the completion of phase 2:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Not there yet, but it’s coming.

DataStax Community Server 1.0.7

A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7

HBase 0.92

Don’t let the version number trick you. This is an important release for HBase featuring:

  • coprocessors
  • security
  • new (self-migrating) file format
  • AWS improvements: EBS support, building a HA cluster

The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).

I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:


Solr Index Replication at Etsy: From HTTP to BitTorrent

Etsy went from using HTTP to BitTorrent for replicating Solr indexes:

By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.

[…]

Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!

Make sure you don’t miss the part where they were experimenting with multicast UDP rsync.

Original title and link: Solr Index Replication at Etsy: From HTTP to BitTorrent (NoSQL database©myNoSQL)

via: http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication/


Hadoop and Solr for Archiving Emails

Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.

Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

Let’s look again at the problem and see what the requirements are:

  • store a large and continuously growing amount of messages
  • retrieve messages either directly (key-based access) or by searches (full text indexing)

The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:

If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.

Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:

Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.

Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.

Original title and link: Hadoop and Solr for Archiving Emails (NoSQL database©myNoSQL)


Lucene & Solr Year 2011 in Review

I much prefer reviews to predictions. Moreover so when there are so many worthy things to be mention as what Lucene and Solr have accomplished in 2011:

  • Near Real-Time search (freshly added documents can be immediately made visible in search results)
  • Field collapsing or result grouping
  • faceting module
  • language support

Plus the promise of the SolrCloud:

In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  

On the other hand, last December LinkedIn open sourced IndexTank a real-time fulltext search-and-indexing system. Some of its features will definitely sound interesting to Lucene and Solr users.

Original title and link: Lucene & Solr Year 2011 in Review (NoSQL database©myNoSQL)

via: http://blog.sematext.com/2011/12/21/lucene-solr-year-2011-in-review/


IndexTank vs Thinking Sphinx vs WebSolr

In the light of IndexTank being open sourced by LinkedIn, here is a post in which Gautam Rege compares IndexTank with Thinking Sphinx and WebSolr. Feature-wise IndexTank has some advantages over Solr and almost none when compared wtih Thinking Sphinx.

When I first set out needing full text searching, I used Solr. It was pretty good though re-indexing took ages and to ensure consistency, I had to re-index every day via cron. Then I found Thinking Sphinx – and loved it because it managed delta indexes! Wow – no more daily re-index cron jobs. Even the re-indexing was way quicker.

The big issue with both Solr and TS was that it required tight integration with models and my database. For example – in TS, if a relationship was changed, I had to ensure to trigger the parent / child delta index in order to ensure it gets indexed too.  Both TS and Solr add methods to ActiveRecord, which I find a little annoying. These nuances gets my code too dependent on TS or Solr and switching from them to something else becomes a big pain!

Original title and link: IndexTank vs Thinking Sphinx vs WebSolr (NoSQL database©myNoSQL)

via: http://blog.joshsoftware.com/2011/10/17/indextank-so-long-and-thanks-for-all-the-fish/


LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr

Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.

The projects can be found already on GitHub: index tank-engine (the indexing engine) and the API, BackOffice, Storefront, and Nebulizer.

When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.

The answer to the the first one is provided in the post.

What is Index Tank? IndexTank is mainly three things:

  • IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
  • API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
  • Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.

For the second, I’ve reached out the the old IndexTank FAQ.

How does IndexTank compare to Lucene and Solr?

  1. IndexTank was a hosted, scalable service
  2. IndexTank can add documents to the index
  3. IndexTank supports updating document variables without re-indexing
  4. IndexTank supports geolocation functions

For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.

Happy hacking!

Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr (NoSQL database©myNoSQL)


Factual API Powered by Node.js and Redis

Continuing my search for non trivial node.js + NoSQL database application, here’s Factual stack for serving their API:

Factual API Stack

Factual architectural components:

  • Varnish
  • HAProxy
  • Node.js
  • Redis
  • Solr

Why Node.js?

We chose Node because of three F’s: it’s fast, flexible, and familiar. In particular, the flexibility is  what allowed us to use our Node layer to handle things like caching logic and load balancing, in addition to the aforementioned authentication and authorization. To make our Node layer scalable, we use multiple instances of Node tied together with Redis to keep things in sync.

Also worth mentioning is that data served through Factual API is always JSON, so having a server side JavaScript engine alsa takes reduces the need for converting data to different formats.

Original title and link: Factual API Powered by Node.js and Redis (NoSQL database©myNoSQL)

via: http://blog.factual.com/v3-api-stack-faster-data


Seven Java Projects That Changed the World

Over the last decade, several projects have traveled beyond mere adoption and had effects dominating the Java world, into software development in general, and some even further into the daily lives of users.

Not sure how Edd Dumbill came up with the list[1], but it includes Solr (and implicitely Lucene) and Hadoop. I concur.


  1. The list looks good to me.  

Original title and link: Seven Java Projects That Changed the World (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2011/07/7-java-projects.html


ThriftDB: The Amazon Web Services of Search

ThriftDB presented today at TechCrunch Disrupt:

Technically speaking, ThriftDB is a flexible key-value datastore with search built in that has the flexibility, scalability, and performance of a NoSQL datastore with the capabilities of full-text search. Essentially, what this means is that, by combining the datastore and the search engine, ThriftDB is offering a service that makes it easy for developers to build fast, horizontally-scalable applications with integrated search.

The website says ThriftDB is a document database built on top of Thrift with full-text search support. I’m not really sure about the Amazon Web Services for Search, but it sounds like it would go against Marklogic, ElasticSearch, Solr, and so on.

Original title and link: ThriftDB: The Amazon Web Services of Search (NoSQL databases © myNoSQL)

via: http://techcrunch.com/2011/05/24/thriftdb-wants-to-be-the-amazon-web-services-of-search/


The HBase+Solr CMS Lily Reaches 1.0

Lily, the only CMS built on top of HBase and using Solr as its search engine, has reached the 1.0 version.

Lily is dead serious about Scale. The Lily repository has been tested to scale beyond any common content repository technology out there, due to its inherently distributed architecture, providing economically affordable, robust, and high-performing data management services for any kind of enterprise application.

Outerthought has talked in the past about their technical choices:

Original title and link: The HBase+Solr CMS Lily Reaches 1.0 (NoSQL databases © myNoSQL)

via: http://outerthought.org/blog/468-ot.html