Sunday, December 19, 2004
Detecting duplicate and near-duplicate files
This web page describes
research I did for Google 2000 through 2003, although mostly in 2000.
This
work resulted in US Patent 6658423, by William Pugh and Monika Henzinger,
assigned to Google.
The information here does not reflect any information
about Google business practices or technology, other than that described in the
patent. I have no knowledge as to whether or how Google is currently applying
the techniques described in the patent. This information is not approved or
santioned by Google, other than by giving me permission to discuss the research
I did for them that is described in the patent.
The patent describes
techniques to find near-duplicate documents in a collection. Google is obviously
considering applying these techniques to web pages, but they could be applied to
other documents as well. It might even be possible to sequences that are not
documents (such as DNA sequences), although that raises some questions that
aren't covered here.
I'll get more information up shortly, but for now:more information on the google's patent of detecting duplicate files on the web is here,www.cs.umd.edu/~pugh/google/

0 Comments:
Post a Comment
<< Home