9.4. Duplicate URLs

Given 10 billion URLs, how do you detect duplicate documents? Assume duplicate means that the URLs are identical.

10 billion urls is a lot of data to keep in just one machine. But to create a simple version, we assume we can. We can create a hash table where each url maps to true if it's already been found. But when we can't store that in memory, two solutions

Solution 1: dosk storage

We do two passes to the document: first we split the list of urls into 4000 chunks of 1GB each. We can store each url u in a file named x.txt where x = hash(u) % 4000. All urls with the same hash value would be in the same file. In the second pass, we implement the solution from befor. Load each file into memory, create a hash table, and look for duplicates.

Solution 2: multiple machines

We send the url to machine x. We can parallelize the operation, so that all 4000 chunks are processed at the same time. But now we need many machines, which is not realistic. We need to consider how to handle failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9.4. Duplicate URLs.md

9.4. Duplicate URLs.md

9.4. Duplicate URLs

Solution 1: dosk storage

Solution 2: multiple machines

Files

9.4. Duplicate URLs.md

Latest commit

History

9.4. Duplicate URLs.md

File metadata and controls

9.4. Duplicate URLs

Solution 1: dosk storage

Solution 2: multiple machines