Common Crawl is a gigantic dataset that is created by crawling the web. They provide the data in both downloadable format (gigantic) or you can query against their indices and only retrieve back the information you are after. It is also 100% free, which makes it even more awesome.
Massive amounts of internet data that you can search through if you know how to write Python programs (it's easy to learn).
If you try to find stuff on Google, it's hard to do deep searches. But with Python, you can search for a combination of keywords, semantics (word vectors), confined to a particular location and time, and so on. If you can write a Python program to specify what you are looking for, you can find it. Google offers some of these features (like keyword and time), but Google doesn't let you write Python programs to find exactly what you're looking for.
A sufficiently determined geek can use Common Crawl to ferret-out all kinds of hidden information. For example: let's say you are a billionaire who wants to know what people are saying about your company on Reddit. You can hire a geek, who can search through Common Crawl, and find just about everything written -- things that are impossible to find on Google. So, you better be careful about what you write on social media! (Don't feed the trolls!)