Extracting and Tokenizing 30TB of Web Crawl Data
All code for this 5 step process of extracting and tokenizing Common crawl’s 30TB of data is available on GitHub:
- Distributed copy to get data into a Hadoop cluster
- Filtering text/html
- Using boilerpipe for extracting visible text
- Using Apache Tika
LanguageIdentifierfor filtering English content - Tokenizing using the Stanford parser.
Original title and link: Extracting and Tokenizing 30TB of Web Crawl Data (©myNoSQL)
via: http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/