SOLR: All content tagged as SOLR in NoSQL databases and polyglot persistence
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:
Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 ( ©myNoSQL)
Just a quick roundup of the latest releases and announcements.
Hortonworks Data Platform (HDP) version 2
HDP v2 will include:
- NextGen MapReduce architecture
- HDFS NameNode HA
- HDFS Federation
- up-to-date HCatalog, HBase, Hive, Pig
According to the announcement:
In order to avoid confusion, let me explain the two versions of HDP:
- HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
- HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.
SolrCloud Completes Phase 2
Mark Miller about the completion of phase 2:
The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.
Not there yet, but it’s coming.
DataStax Community Server 1.0.7
A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7
Don’t let the version number trick you. This is an important release for HBase featuring:
- new (self-migrating) file format
- AWS improvements: EBS support, building a HA cluster
I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:
Sunil Sitaula posted two articles on Cloudera’s blog about archiving emails on Hadoop: part 1 and part 2. But even if I read the posts twice I couldn’t find a clear answer to the question: why would one do it this way.
Sunil provides a general explanation, but the two articles fail to present the real advantages of using Hadoop for solving this problem.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.
Let’s look again at the problem and see what the requirements are:
- store a large and continuously growing amount of messages
- retrieve messages either directly (key-based access) or by searches (full text indexing)
The underlying storage of Hadoop, HDFS would bring to the table a reliable, scalable, and cost effective storage solution. But using HDFS would also require having a custom ETL process—transforming email messages into something to be stored in HDFS is described in the first post:
If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file.
Nonetheless a completely different system would be needed for providing access to the stored messages. The second post introduces Lucene and Solr for dealing with message retrieval, but setting them up to take advantage of the same infrastructure can get complicated:
Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.
Bottom line, it looks like the article suggests using two almost separated tools to solve the initial problem. And that makes me think that another better solution exists.
Original title and link: Hadoop and Solr for Archiving Emails ( ©myNoSQL)
Today LinkedIn has announced that they are open sourcing the technology behind IndexTank, a company they acquired back in October. IndexTank was offering a hosted, scalable full-text search API.
When reading the announcement, I’ve asked myself two questions: what is IndexTank and how does IndexTank compare to Lucene and Solr.
The answer to the the first one is provided in the post.
What is Index Tank? IndexTank is mainly three things:
- IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs).
- API: a RESTful interface that handles authentication, validation, and communication with the IndexEngine(s). It allows users of IndexTank to access the service from different technology platforms (Java, Python, .NET, Ruby and PHP clients are already developed) via HTTP.
- Nebulizer: a multitenant framework to host and manage an unlimited number of indexes running over a layer of Infrastructure-as-a-Service. This component of IndexTank will instantiate new virtual instances as needed, move indexes as they need more resources, and try to be reasonably efficient about it.
For the second, I’ve reached out the the old IndexTank FAQ.
How does IndexTank compare to Lucene and Solr?
- IndexTank was a hosted, scalable service
- IndexTank can add documents to the index
- IndexTank supports updating document variables without re-indexing
- IndexTank supports geolocation functions
For more details there’s a paper by Alejandro Perez covering IndexTank and other search solutions.
Original title and link: LinkedIn Open Sources IndexTank: What Is IndexTank and How Does It Compare to Lucene and Solr ( ©myNoSQL)