wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> cs >> similar pages implementation.
(Message started by: puzzlecracker on Jan 30th, 2005, 10:43am)

Title: similar pages implementation.
Post by puzzlecracker on Jan 30th, 2005, 10:43am
anyone has any ideas how similar pages implemented by Google?    any suggestion you might have in mind?

Title: Re: similar pages implementation.
Post by towr on Jan 30th, 2005, 10:49am
You could look at author, keywords, images (some websites do simply copy each others images, but I suppose it's not that frequent top be usefull).
And I suppose, most notably, linkage. If there are a lot of pages that link to the same two pages, those two are probably similar/related.

Title: Re: similar pages implementation.
Post by Grimbal on Jan 30th, 2005, 4:31pm
I think Google simply gets the keywords it knows for the reference page and looks for other pages with the same keywords.  Rare keywords are probably valued more or even much more.  Linking to the same pages also would indicate similarity.

Title: Re: similar pages implementation.
Post by puzzlecracker on Jan 30th, 2005, 7:58pm
I want to extend the towrs idea. They way it might be implemented is by comparing in-links to and out-links from the page, for similar pages 'usually' (should probably use a more mathematical terminology) same sites point into them, similarly - they have comparable links.  

any thoughts?

Title: Re: similar pages implementation.
Post by eviltoylet on Jan 31st, 2005, 12:46am
This is a pretty interesting question. I want to say that google spiders the web -- upon arriving at some arbitrary web page X, it records all links on that page . Then, it assumes that these pages could be related to each other. It spiders these web pages and extracts keywords or meta data even ... and if similar, keys them as similar.

Perhaps a way to find out for sure is for us to make a few websites and link them ... with different keywords, or with same keywords.

Title: Re: similar pages implementation.
Post by towr on Jan 31st, 2005, 1:08am
Google also looks at the text pages are link to each other with. That's how googlebombing works.
So if the linktext for two pages is the same, they are probably also related.
So math (http://mathworld.wolfram.com/) and more math (http://en.wikipedia.org/wiki/Math) probably makes google think these pages similar. (And of cousre it helps I'm linking to them from the same page, and in proximity to eachother on this page)

Title: Re: similar pages implementation.
Post by amichail on Jan 31st, 2005, 1:51am
I think this paper on SimRank goes further than what Google does:

http://ideas.web.cse.unsw.edu.au/index.php?module=articles&func=display&ptid=1&aid=20

My guess is that Google uses a simple cocitation algorithm.  SimRank takes this idea further in much the same way that PageRank takes an indegree link count further.

Title: Re: similar pages implementation.
Post by Terps.Go on Feb 3rd, 2005, 5:22pm
Hehe,
Dr. Broder when worked for Ditigal (acquired by Compaq) developed a method to compare two webpages using randomized algorithm. The method is called min-wise independent. This method is used in Altavista and I think google too. Try to google min-wise independent.



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board