All code for this 5 step process of extracting and tokenizing Common crawl’s 30TB of data is available on GitHub:
- Distributed copy to get data into a Hadoop cluster
- Filtering text/html
- Using boilerpipe for extracting visible text
- Using Apache Tika
LanguageIdentifier for filtering English content
- Tokenizing using the Stanford parser.
Original title and link: Extracting and Tokenizing 30TB of Web Crawl Data ( ©myNoSQL)