solr: All content tagged as solr in NoSQL databases and polyglot persistence
Wednesday, 8 February 2012
Fulltext search your CouchDB in Ruby
When having to choose what library to use for full text indexing of CouchDB data for a Ruby application, Taylor Luk looked at from Sphinx, Lucene, Ferret, Xapian and decided to go with Xapian with Xapit . Besides the fact that Xapian with Xapit offers a clean interface and customization of the indexing process, there seem to be quite a few important limitations:
- Xapit is still under active development
- You need to trigger Index update manually
- It doesn’t Incremental index update at the moment
I know some are afraid of managing a Java stack, but in the land of indexing, Lucene, Solr, ElasticSearch, IndexTank are the most powerful tools.
Original title and link: Fulltext search your CouchDB in Ruby (©myNoSQL)
via: http://taylorluk.com/post/17255656638/fulltext-search-your-couchdb-in-ruby
Thursday, 2 February 2012
Getting off the CouchDB... or Lessons Learned while Experimenting in Production
The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.
And that’s when the wheels came off.
Word of caution: this is not the “CouchDB sucks so we went with MongoDB” type of post. It’s more of “we thought CouchDB can solve one of our problems, but then got confused and thought it can solve world hunger. So we decided to throw a bunch of data to it to see if it sticks. Surprise! It didn’t.”
Just to be clear, I’m not defending CouchDB and everything John Wood writes about it is correct. It’s just that experimenting with CouchDB in a non-production environment or at least reading myNoSQL would have already offered all those answers.
Original title and link: Getting off the CouchDB… or Lessons Learned while Experimenting in Production (©myNoSQL)
via: http://blog.signalhq.com/2012/01/24/getting-off-the-couchdb/
Tuesday, 24 January 2012
Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud
Just a quick roundup of the latest releases and announcements.
Hortonworks Data Platform (HDP) version 2
HDP v2 will include:
- NextGen MapReduce architecture
- HDFS NameNode HA
- HDFS Federation
- up-to-date HCatalog, HBase, Hive, Pig
According to the announcement:
In order to avoid confusion, let me explain the two versions of HDP:
- HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
- HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.
SolrCloud Completes Phase 2
Mark Miller about the completion of phase 2:
The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.
Not there yet, but it’s coming.
DataStax Community Server 1.0.7
A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7
HBase 0.92
Don’t let the version number trick you. This is an important release for HBase featuring:
- coprocessors
- security
- new (self-migrating) file format
- AWS improvements: EBS support, building a HA cluster
The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).
I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:
Monday, 23 January 2012
Solr Index Replication at Etsy: From HTTP to BitTorrent
Etsy went from using HTTP to BitTorrent for replicating Solr indexes:
By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.
[…]
Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!
Make sure you don’t miss the part where they were experimenting with multicast UDP rsync.
Original title and link: Solr Index Replication at Etsy: From HTTP to BitTorrent (©myNoSQL)
via: http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication/
Tuesday, 3 January 2012
Hadoop and Solr for Archiving Emails
Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.
Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.
Let’s look again at the problem and see what the requirements are:
- store a large and continuously growing amount of messages
- retrieve messages either directly (key-based access) or by searches (full text indexing)
The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:
If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.
Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:
Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.
Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.
Original title and link: Hadoop and Solr for Archiving Emails (©myNoSQL)
Lucene & Solr Year 2011 in Review
I much prefer reviews to predictions. Moreover so when there are so many worthy things to be mention as what Lucene and Solr have accomplished in 2011:
- Near Real-Time search (freshly added documents can be immediately made visible in search results)
- Field collapsing or result grouping
- faceting module
- language support
Plus the promise of the SolrCloud:
In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier. Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.
On the other hand, last December LinkedIn open sourced IndexTank a real-time fulltext search-and-indexing system. Some of its features will definitely sound interesting to Lucene and Solr users.
Original title and link: Lucene & Solr Year 2011 in Review (©myNoSQL)
via: http://blog.sematext.com/2011/12/21/lucene-solr-year-2011-in-review/
Thursday, 22 December 2011
IndexTank vs Thinking Sphinx vs WebSolr
In the light of IndexTank being open sourced by LinkedIn, here is a post in which Gautam Rege compares IndexTank with Thinking Sphinx and WebSolr. Feature-wise IndexTank has some advantages over Solr and almost none when compared wtih Thinking Sphinx.
When I first set out needing full text searching, I used Solr. It was pretty good though re-indexing took ages and to ensure consistency, I had to re-index every day via cron. Then I found Thinking Sphinx – and loved it because it managed delta indexes! Wow – no more daily re-index cron jobs. Even the re-indexing was way quicker.
The big issue with both Solr and TS was that it required tight integration with models and my database. For example – in TS, if a relationship was changed, I had to ensure to trigger the parent / child delta index in order to ensure it gets indexed too. Both TS and Solr add methods to ActiveRecord, which I find a little annoying. These nuances gets my code too dependent on TS or Solr and switching from them to something else becomes a big pain!
Original title and link: IndexTank vs Thinking Sphinx vs WebSolr (©myNoSQL)
via: http://blog.joshsoftware.com/2011/10/17/indextank-so-long-and-thanks-for-all-the-fish/
LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr
Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.
The projects can be found already on GitHub: index tank-engine (the indexing engine) and the API, BackOffice, Storefront, and Nebulizer.
When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.
The answer to the the first one is provided in the post.
What is Index Tank? IndexTank is mainly three things:
- IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
- API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
- Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.
For the second, I’ve reached out the the old IndexTank FAQ.
How does IndexTank compare to Lucene and Solr?
- IndexTank was a hosted, scalable service
- IndexTank can add documents to the index
- IndexTank supports updating document variables without re-indexing
- IndexTank supports geolocation functions
For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.
Happy hacking!
Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr (©myNoSQL)
Friday, 9 December 2011
Factual API Powered by Node.js and Redis
Continuing my search for non trivial node.js + NoSQL database application, here’s Factual stack for serving their API:

Factual architectural components:
- Varnish
- HAProxy
- Node.js
- Redis
- Solr
Why Node.js?
We chose Node because of three F’s: it’s fast, flexible, and familiar. In particular, the flexibility is what allowed us to use our Node layer to handle things like caching logic and load balancing, in addition to the aforementioned authentication and authorization. To make our Node layer scalable, we use multiple instances of Node tied together with Redis to keep things in sync.
Also worth mentioning is that data served through Factual API is always JSON, so having a server side JavaScript engine alsa takes reduces the need for converting data to different formats.
Original title and link: Factual API Powered by Node.js and Redis (©myNoSQL)
Monday, 11 July 2011
Seven Java Projects That Changed the World
Over the last decade, several projects have traveled beyond mere adoption and had effects dominating the Java world, into software development in general, and some even further into the daily lives of users.
Not sure how Edd Dumbill came up with the list[1], but it includes Solr (and implicitely Lucene) and Hadoop. I concur.
-
The list looks good to me. ↩
Original title and link: Seven Java Projects That Changed the World (©myNoSQL)
Tuesday, 24 May 2011
ThriftDB: The Amazon Web Services of Search
ThriftDB presented today at TechCrunch Disrupt:
Technically speaking, ThriftDB is a flexible key-value datastore with search built in that has the flexibility, scalability, and performance of a NoSQL datastore with the capabilities of full-text search. Essentially, what this means is that, by combining the datastore and the search engine, ThriftDB is offering a service that makes it easy for developers to build fast, horizontally-scalable applications with integrated search.
The website says ThriftDB is a document database built on top of Thrift with full-text search support. I’m not really sure about the Amazon Web Services for Search, but it sounds like it would go against Marklogic, ElasticSearch, Solr, and so on.
Original title and link: ThriftDB: The Amazon Web Services of Search (NoSQL databases © myNoSQL)
via: http://techcrunch.com/2011/05/24/thriftdb-wants-to-be-the-amazon-web-services-of-search/
Tuesday, 10 May 2011
The HBase+Solr CMS Lily Reaches 1.0
Lily, the only CMS built on top of HBase and using Solr as its search engine, has reached the 1.0 version.
Lily is dead serious about Scale. The Lily repository has been tested to scale beyond any common content repository technology out there, due to its inherently distributed architecture, providing economically affordable, robust, and high-performing data management services for any kind of enterprise application.
Outerthought has talked in the past about their technical choices:
Original title and link: The HBase+Solr CMS Lily Reaches 1.0 (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling