Skip to content

Latest commit

 

History

History
13 lines (7 loc) · 1.01 KB

9.4. Duplicate URLs.md

File metadata and controls

13 lines (7 loc) · 1.01 KB

9.4. Duplicate URLs

Given 10 billion URLs, how do you detect duplicate documents? Assume duplicate means that the URLs are identical.

10 billion urls is a lot of data to keep in just one machine. But to create a simple version, we assume we can. We can create a hash table where each url maps to true if it's already been found. But when we can't store that in memory, two solutions

Solution 1: dosk storage

We do two passes to the document: first we split the list of urls into 4000 chunks of 1GB each. We can store each url u in a file named x.txt where x = hash(u) % 4000. All urls with the same hash value would be in the same file. In the second pass, we implement the solution from befor. Load each file into memory, create a hash table, and look for duplicates.

Solution 2: multiple machines

We send the url to machine x. We can parallelize the operation, so that all 4000 chunks are processed at the same time. But now we need many machines, which is not realistic. We need to consider how to handle failure.