Jump to content

Welcome to FutureTimeline.forum
Register now to gain access to all of our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, post status updates, manage your profile and so much more. If you already have an account, login here - otherwise create an account for free today!

Mining Common Crawl

  • Please log in to reply
No replies to this topic




  • Members
  • PipPipPipPipPipPipPip
  • 1,638 posts
Here's an interesting article posted to Bellingcat a few years ago:


Common Crawl is a gigantic dataset that is created by crawling the web. They provide the data in both downloadable format (gigantic) or you can query against their indices and only retrieve back the information you are after. It is also 100% free, which makes it even more awesome.

Massive amounts of internet data that you can search through if you know how to write Python programs (it's easy to learn).

If you try to find stuff on Google, it's hard to do deep searches. But with Python, you can search for a combination of keywords, semantics (word vectors), confined to a particular location and time, and so on. If you can write a Python program to specify what you are looking for, you can find it. Google offers some of these features (like keyword and time), but Google doesn't let you write Python programs to find exactly what you're looking for.

A sufficiently determined geek can use Common Crawl to ferret-out all kinds of hidden information. For example: let's say you are a billionaire who wants to know what people are saying about your company on Reddit. You can hire a geek, who can search through Common Crawl, and find just about everything written -- things that are impossible to find on Google. So, you better be careful about what you write on social media! (Don't feed the trolls!)

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users