ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Unstructured Data: What Is It?

Paige Roberts writes in a post about integrating predictive analytics with Hadoop:

Unstructured is really a misnomer. I think it was Curt Monash who coined the term polystructured. That makes a lot more sense, since if data was truly without structure, even humans wouldn’t be able to make sense of it. In every seemingly unstructured dataset, there is some form of structure. An email has structure. A web page has structure. A Twitter stream has structure. Facebook interactions have structure. Machine generated log files have structure. But none of those structures are remotely alike. Nor are they remotely similar to the structure of a standard transactional record.

I don’t think there are many that are thinking of unstructured data as data with completely random structure. My understanding of the term unstructured refers to three dimensions:

  1. variability: data representing the same entities can take different forms and contain different details. The simplest example I could think of is the information about a video shared on two different platforms.
  2. multi-purpose: data is not representing a single entity, but rather a set of related entities in an aggregated or compo
  3. data closer to natural language than mathematical structure: take for example some normal English text—according to the grammar rules it has structure, but it’s not easily understandable by machines (nb: maybe machine descriptiveness would be a better way to name this dimension)

Original title and link: Unstructured Data: What Is It? (NoSQL database©myNoSQL)