XML database: All content tagged as XML database in NoSQL databases and polyglot persistence
- Put XML into an XML database, objects into an Object Database, JSON into a document database, relational data into a relational database and you’ll get the best results
- the better the data store understands the structure of your data, the better search results should be
Original title and link: MarkLogic, LexisNexis, XML, and Search ( ©myNoSQL)
I rarely write about MarkLogic, but the amount of information that hit me about the newly released MarkLogic 5 made me curious. Below are quotes and commentary about MarkLogic 5, the new MarkLogic Express, and MarkLogic and Hadoop integration.
MarkLogic is a next generation database for Big Data and unstructured information. MarkLogic empowers organizations to make high stakes decisions on Big Data in real time.
So far I thought MarkLogic is an XML database with powerful search capabilities. This new message makes it sound like MarkLogic is a Big Data Analytics or BI tool, which I don’t think would be the most accurate description.
MarkLogic Confidence at Scale
There are a couple of new feature falling into this category as presented in the press release:
Database Replication – protect your mission-critical information from site-wide disasters and reduce the cost of downtime
MarkLogic 5 features the ability to keep a “hot copy” of the database in another data center for quick failover in the event of a disaster, as well as a journal-archiving function that allows a database to be restored to a particular point in time.
Point-in-Time Recovery – recover from backups to a specific point-in-time then roll forward using the transaction log to a specific point-in-time, minimizing the window for lost data between the occurrence of a disaster and the time the last backup was taken.
Not sure how it works and what are the requirements for getting it to work, but point-in-time recovery sounds like a very interesting feature.
Enterprise Big Data
- Simplified Monitoring — new monitoring and management features enable organizations to see system status at a glance with real-time charts of metrics such as I/O rates and loads, request activity, and disk usage.
- Monitoring Plug-Ins — integration with HP Operations Manager and Nagios
- Tiered Storage – expand Big Data performance by implementing a solid state disk (SSD) tier between memory and disk
This last feature is one that prepares MarkLogic for the future by allowing it to work smartly with different storages. Ron Avnur (CTO, MarkLogic) interviewed by Chris Kanaracus:
We realized people have rotational drives and network-attached storage, and are starting to play more seriously with solid-state. These have different performance profiles.
System administrators will tell MarkLogic where and what the options for storage are, and the system will “do all the optimization.” In this way, more frequently used data can be kept in flash and older or less frequently accessed information held elsewhere.
I’m not aware of other solutions being able to play smart with heterogeneous storage deployments.
MarkLogic Connector for Hadoop
The MarkLogic Connector for Hadoop powers large-scale batch processing for Big Data Analytics on the structured, semi-structured, and unstructured data residing inside MarkLogic. Using MarkLogic for real time analytics with Hadoop for batch processing brings the best of Big Data to companies that need real time, secure, enterprise applications that are cost effective with high performance. With simple drop-in installation, organizations can run MapReduce on data inside MarkLogic and take advantage of Hadoop’s development and management tools, all while being able to leverage MarkLogic’s indexes and distributed architecture for performance. This combination results in enhanced search, analytics, and delivery in MarkLogic, and enables organizations to progressively enhance data without having to remove it from the database.
MarkLogic sees Hadoop as being able to support MarkLogic for various uses. For example, an intelligence-gathering organization could collect data that is into hundreds of petabytes, not understanding what exactly is there, but then decide to investigate a particular topic in-depth. In such a scenario, users would want to use MarkLogic for interaction with this content, asking questions and getting answers in sub-second time, and then asking other questions and exploring the data for insights. However, Hunter explains, because the data is so large it would probably not be cost-effective to load hundreds of petabytes of data into MarkLogic if they don’t have to, and so they can load the data into Hadoop and run a Hadoop job to select the portion of the content that it makes sense to do real-time analytics against and load that into MarkLogic for interactive queries. “So you go from hundreds of petabytes down to one petabyte, or half a petabyte, do bulk load and do interactive queries against it.”
MarkLogic Express, a new MarkLogic 5 license that allows students and developers to download and take MarkLogic into production immediately.
MarkLogic Express includes geospatial capabilities, alerting, and can be used in production environments. That means a developer can take a MarkLogic implementation that leverages a 2 CPU node and up to 40 GB of data live.
Josette Rigsby points out some more limitations of the Express version:
- Can’t combine with another licensed install of MarkLogic
- Can’t be used for work on behalf of the U.S. Federal Government
- No clustering
- Can’t run multiple production copies of Express for the same application
- Cannot be used by development teams — note: this point is very confusing.
It looks like MarkLogic is ackowledging the power developers represent in the current organizations and they decided to offer access to the product. While I don’t think the current restrictions would allow someone to go in production with the MarkLogic Express version, I still believe is better than nothing. I’ve also read that students and researchers could get access to a less restrictive version—something that’s easy to appreciate.
MarkLogic 5 includes also some feature that are probably appealing to their users (rich media support, document filters, query console, REST-based API, distributed transaction support, geo-support).
I’m leaving you with Curt Monash’s comments:
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who MarkLogic in those ways are commonly doing things that can be described as research and analysis.
and a short video of MarkLogic CTO, Ron Avnur summarizing the release:
In case it wasn’t obvious I don’t like XML as a storage format, nor did I like XML databases. ↩
Original title and link: MarkLogic 5: Confidence at Scale, Enterprise Big Data, Hadoop Connector, Express Edition ( ©myNoSQL)