Given 10 billion URLs, how do you detect duplicate documents? Assume duplicate means that the URLs are identical.
10 billion urls is a lot of data to keep in just one machine. But to create a simple version, we assume we can. We can create a hash table where each url maps to true if it's already been found. But when we can't store that in memory, two solutions
We do two passes to the document: first we split the list of urls into 4000 chunks of 1GB each. We can store each url u in a file named x.txt where x = hash(u) % 4000. All urls with the same hash value would be in the same file. In the second pass, we implement the solution from befor. Load each file into memory, create a hash table, and look for duplicates.
We send the url to machine x. We can parallelize the operation, so that all 4000 chunks are processed at the same time. But now we need many machines, which is not realistic. We need to consider how to handle failure.