NoSQL Event: All content tagged as NoSQL Event in NoSQL databases and polyglot persistence
For those of us that haven’t been at the Hadoop Summit 2011:
The main takeaway from Hadoop Summit 2010 was Cascalog. I predict the main takeaway from Hadoop Summit 2011 is Spark.
My essential points are that the “birthers” (where hadoop has been born) and “adopters” (where hadoop will be used in enterprises) have a strong intersection today, modulo some extras on both sides…
However, at t = 3 years from now, we can either go separate ways because of different demands… or come together […]
[Hadoop] No longer a West Coast early adopter phenomenon. Hadoop isn’t quite mainstream, but almost, not quite at enterprise level purchasing but getting close.
A 4 minutes interview with the Eric Baldescwieler, CEO of Hortonworks, the Yahoo! Hadoop spin-off:
- Cloudera Enterprise 3.5 : Full lifecycle management of Apache Hadoop deployments featuring the Service and Configuration Manager, Activity Monitor, Enhancements to Resource Manager and Authorization Manager
- Karmasphere Studio Community Hadoop Virtual Appliance for developers: a free virtual machine imagine including Apache Hadoop, Ubuntu Linux, the Eclipse IDE and Karmasphere Studio Community.
Last, but not least you can read Derrick Harris’ overview post .
Original title and link: Hadoop Summit 2011 in Review ( ©myNoSQL)
After seeing the excerpt from Jonathan Harris’ talk at Data Scientist Summit I really wanted to post a link to some of the videos. But they are all behind a registration gateway. Just in case you want to watch them—there are indeed some interesting titles— you’ll find them here.
Ryan Rosario summarizing a panel from Data Scientist Summit, featuring Pete Skomoroch (LinkedIn), Sharon Franks Chiarella (Amazon Mechnical Turk), Gil Elbaz (Factual) and Toby Segaram (Google):
you can’t turn data into a story without joining the data with, well, other data.
10gen continued its MongoDB popularization tour around the world with three events in Europe: London, Paris, and Berlin. SkillsMatter, the organizers of MongoUK have recorded all the sessions and made them available here
Here is the list of the talks:
- Welcome by Eliot Horowitz
- Nosh Petigara: Building your 1st MongoDB application
- Richard Kreuter: Mastering the MongoDB shell
- Meghan Gill: MongoDB community resources
- Richard Kreuter: Schema design: data as documents
- Mathias Stearn: MongoDB Internals: Storage Engine
- Graham Tackley: MongoDB at the Guardian
- Russell Smith: Geo & Capped collections with MongoDB
- Richard Kreuter: Indexing and Query Optimizer
- Geoff Watts: BSON and ZMQ
- Mathias Stearn: Administration
- Eliot Horowitz: Open Q&A with Eliot Horowitz
- Ashok Subramanian & Stephen Rose: Project Phoenix
- Phillipp Krenn: Morphia: MongoDB for Java Developers
- Eliot Horowitz: Scaling with MongoDB
- Neil Bertlett: MongoDB as a backing store of Eclipse MF
- Nosh Petigara: Deployment strategies
- David Mytton: Monitoring MongoDB
- Eliot Horowitz: MongoDB Project Roadmap
If you’d ask me this question, I’m sure my initial answer would be: “absolutely”. And I guess I would not be alone. But is that the right answer?
While watching GigaOm’s Structure Big Data event, there were two talks that gave me a different perspective on this question.
Firstly, it was the interview with Kevin Krim, the Global Head of Bloomberg Digital, which told the story of adopting, mining, and materializing Big Data inside a corporation that didn’t believe in it, nor did it allocate large budgets to it. The result: collecting more than a terabyte of data every day from 100 data points for every pageview and running 15 different parallel algorithms to make recommendations that led sometimes to 10x clickthrough rates. The interview is embedded at the end of this post.
The second story, coming from Pete Warden, founder of OpenHeatMap, is even more exciting. Pete has used a combination of right tools deployed on the cloud to mine Facebook data: 500 million pages for $100 — that was the cost before being sued by Facebook.
Pete Warden distilled his experience with these tools and has made available at datasciencetoolkit.org a collection of data tools and open APIs in both an Amazon AMI format to be run on the cloud and as a VMWare image to run locally. I highly recommend watching Pete’s talk which I’ve embedded below.
While it depends on what definition of BigData we’d use, both these talks are leading to a simple conclusion:
- you need imagination to get started with Big Data
- you need to use the right tools for getting good results
Is this going to work at the scale of Twitter, LinkedIn, Facebook, Google? Probably not. But before getting at that size, you need to start somewhere. And both these talks suggest a clear answer to the question “does big data need big budgets?”: not always.
Three presentations covering the various NoSQL usages at Twitter:
Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ
Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ
Dmitriy Ryaboy on Hadoop from Devoxx 2010:
- Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
- Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
- Netflix: Amazon SimpleDB, Cassandra
- Digg: Cassandra
- SimpleGeo: Cassandra
- StumbleUpon: HBase, OpenTSDB
- Yahoo!: Hadoop, HBase, PNUTS
- Rackspace: Cassandra
And probably many more missing from the list. But that could change if you leave a comment.
ReadWriteWeb has published a very interesting story of a project presented at last week’s Strata conference aiming to reconstruct linked data based on public data sources like Flickr and OpenStreetMap using a somehow classical”fuzzy matching” approach.
build a detailed database of information about places in Afghanistan, using only public sources on the Web. The goal is to describe in detail the towns and cities including everything from names, locations and populations, as well as lists and coordinates for schools, mosques, banks and hotels.
My gut feeling is that mixing in some graph database would make this problem not necessarily easier to address, but it would bring in a different angle to tackle it. Fuzzy matching is a search-based approach with an inductive flavor, while using a graph databases could bring in a deductive approach.
A panel discussion on NoSQL, NoSQL databases, and relational databases, featuring Salvatore Sanfilippo
, Lenz Grimmer
, Filipe David Borba Manana
, and a forth person from SAPO whose name I couldn’t spell: