NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



analytics: All content tagged as analytics in NoSQL databases and polyglot persistence

Examples of analytics applications across industries

A great matrix of the different analytics use cases across industries in Hortonworks’s post “Enterprise Hadoop and the Journey to a Data Lake“:

Anaylitcs use cases

The data type column section covers multiple dimensions of data. And the authors took a conservative approach for the structured and unstructured categories (in the sense that they marked very few categories as unstructured).

A couple of interesting exercises that can be done using this matrix as an input:

  1. figure out how adding data from different categories to a specific use case would benefit it. One obvious example is: how would Telecom companies benefit from adding to their infrastructure analysis social data?

    Building on the above, decide what tools exist to help with this extra scenario.

  2. can one use case from an industry be applied to a different industry to disrupt it?

    What would be the quickest road to accomplish it?

Original title and link: Examples of analytics applications across industries (NoSQL database©myNoSQL)

Instant in-memory analytics via drag & drop


Simply drag & drop text files or spreadsheets into TARGIT Xbone, and immediately use TARGIT’s high-performance analytics on the data within. The power of TARGIT Xbone lies in its simplicity: artificial intelligence automatically detects the potential hierarchies (e.g., time, product, and customer relationships). Dimension detection is lightning-fast when the operations are in-memory. And automating this process incorporates some quality control: users can’t create “dirty dimensions” (multiple-multiple relationships), and by the same token, users receive the same results when collaborating on data synchronized or shared across multiple devices.

At a time when even month-old startups talk about millions and millions of rows, while big guys are already processing petabytes, it’s good to know that you can have instant magical data analysis by drag and dropping text files and spreadsheets.

Original title and link: Instant in-memory analytics via drag & drop (NoSQL database©myNoSQL)


Exploring Google Analytics Data With Clojure, Incanter, and MongoDB

Arnold Matyasi posted 4 articles (with Clojure code, charts, and explanations) on how to analyze Google Analytics data locally with Clojure, Incanter, and MongoDB:

  1. Part 1: exporting data, setup, Clojure helper functions
  2. Part 2: first charts
  3. Part 3: grouping data
  4. Part 4: implementing weighted sort

Original title and link: Exploring Google Analytics Data With Clojure, Incanter, and MongoDB (NoSQL database©myNoSQL)

Big Data, Unstructured Data, and In-Memory Analytics

Two interesting quotes from Teradata’s CTO Stephen Brobst interview with Vinita Gupta (InformationWeek):

Structured vs unstructured data:

I don’t believe that any data is unstructured. We have to overcome this myth that anything that is not in rows or columns is unstructured. The blogs and videos are structured, but non-traditional data.

I think of unstructured data as:

  1. data from which various different structured data can be extracted

    The simplest example is web logs. They contain various bits of information that could be each used for different investigations.

  2. data about the same entities taking various forms

    The simplest example is click streams coming from different sources (e.g. a shared video on YouTube/Vimeo/Twitter etc.). All this data is needed for analysis, but it comes back in different forms.

In-memory analytics:

Some of our competitors, who talk about in-memory analytics in India, do not understand analytics because the cost per terabyte of in-memory is at least 50 times the cost of mechanical disk drives. […] From the massive data available, we frequently access only 20 percent of the data. So, customers want that 20 percent of data to be in high-performance storage and the remaining 80 percent of the data to be in low-cost storage. CIOs want an environment that allows both — optimization for price and performance and optimization for price and storage.

This sounds extremely familiar.

Original title and link: Big Data, Unstructured Data, and In-Memory Analytics (NoSQL database©myNoSQL)


GridGain and Hadoop: About Fundamental Flaws

Would you run your analytics today off the tape drives? That’s what you do when you use Hadoop MapReduce.

The fundamental flaw in Hadoop MapReduce is an assumption that a) storing data and b) acting upon data should be based off the same underlying storage.

What Hadoop does is offering an approach for problems where having all data in memory is almost impossible and definitely not cost effective. What GridGain data grid does is offering an approach where having data in memory is cost effective. None of these assumptions are fundamental flaws.

The only fundamental flaw is positioning a product by making the wrong assumptions about alternative solutions. Like we’ve seen it before: NoSQL Wants To Be Elastic Caching When It Grows Up… Does It Really? or In-Memory Elastic Databases.

Original title and link: GridGain and Hadoop: About Fundamental Flaws (NoSQL database©myNoSQL)


Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors?

Harish Kotadia:

Predictive Analytics has been billed as the next big thing for almost fifteen years, but hasn’t gained mass acceptance so far the way ERP and CRM solutions have. One of the main reason for this is the high upfront investment required in Software, Hardware and Talent for implementing a Predictive Analytics solution.

Well, this is about to change – […] Using R, HBase and Hadoop, it is possible to build cost-effective and scalable Big Data Analytics solutions that match or even exceed the functionality offered by costly proprietary solutions from leading BI/Analytics software vendors at a fraction of the cost.

Vendors will argue that software licensing represents just a small fraction of the costs of implementing BI or data analytics. What they’ll leave out is the costs of acquiring know-how and more important, the costs of maintenance and modernization of their solutions.

Original title and link: Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors? (NoSQL database©myNoSQL)


Hadoop and NoSQL in a Big Data Environment with Ron Bodkin

Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:

People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.

The interview is only 16 minutes long and you have the full transcript.

Original title and link: Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (NoSQL database©myNoSQL)

Statistical Advances: The Maximal Information Coefficient a New Method to Uncover Hidden Data Relationships

Yakir Reshef (main researcher):

“If you have a data set with 22 million relationships, the 500 relationships in there that you care about are effectively invisible to a human.”

The statistical method that Reshef and his colleagues have devised aims to crack those problems. It can spot many superimposed correlations between variables and measure exactly how tight each relationship is, on the basis of a quantity that the team calls the maximal information coefficient (MIC). The MIC is calculated by plotting data on a graph and looking for all ways of dividing up the graph into blocks or grids that capture the largest possible number of data points. MIC can then be deduced from the grids that do the best job.

The original article, Detecting Novel Associations in Large Data Sets, was published on Science, but is behind a paywall.

Original title and link: Statistical Advances: The Maximal Information Coefficient a New Method to Uncover Hidden Data Relationships (NoSQL database©myNoSQL)


Big Data Focus Shifting to Analytics and Visualization

Jeff Kelly:

To reiterate, there’s still plenty of work to do on the infrastructure layer of Hadoop and other Big Data approaches. But the focus of the Big Data industry is — and should be — moving to include analytics and visualization.

Differently put data is not the end goal.

Original title and link: Big Data Focus Shifting to Analytics and Visualization (NoSQL database©myNoSQL)


R: the Leading Statistics Language and Key Weapon in Advanced Analytics Today

David Smith (Revolution Analytics):

Of course, this isn’t the first time that R has been embedded into a data warehousing appliance. IBM Netezza’s iClass device integrates with Revolution R, and AsterData, the Teradata Data Warehouse Appliance, and Greenplum all provide connections to R as well. Here at Revolution Analytics, we think that such enterprise-level integrations with R serve to grow the R ecosystem and serve as validation of R as a key platform for advanced analytics. As CEO Norman Nie said to GigaOm this weekend, 

“Oracle’s announcement to embed R demonstrates validation for the leading statistics language and offers further evidence that R is a key weapon in advanced analytics today”

And let’s not leave aside the strategic partnership between Revolution Analytics and Cloudera to include RevoConnectR in the CDH.

Original title and link: R: the Leading Statistics Language and Key Weapon in Advanced Analytics Today (NoSQL database©myNoSQL)


An Alternative Approach for Big Data Real Time Analytics

Starting from the architecture of Facebook’s realtime analytics presented in the paper Apache Hadoop Goes Realtime at Facebook and Dhruba Borthakur’s excellent posts HDFS: Realtime Hadoop and HBase Usage at Facebook, Nati Shalom describes an alternative approach for real-time analytics using data grids making the following assumptions:

They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.

What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?

While a great read, I get the feeling there’s something wrong. Maybe this:

There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook’s working system: […] We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented  in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.

and then:

One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.

Original title and link: An Alternative Approach for Big Data Real Time Analytics (NoSQL database©myNoSQL)


Splunk Wants to Webify Big Data

IT analytics company Splunk has received a patent for its method of organizing and presenting big data to mirror the experience of browsing links on the web. The patent validates Splunk’s unique approach to the problem of analyzing mountains of machine-generated data and hints at a future where writing big data applications doesn’t require a Ph.D.

So someone takes the philosophy of WWW and Semantic web, Sir Tim Berners-Lee’s linked open data star scheme, adds the BigData term and gets a patent? What’s next?

Original title and link: Splunk Wants to Webify Big Data (NoSQL databases © myNoSQL)