wu :: forums
« wu :: forums - similar pages implementation. »

Welcome, Guest. Please Login or Register.
May 4th, 2024, 4:15am

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   cs
(Moderators: SMQ, towr, Grimbal, Eigenray, ThudnBlunder, Icarus, william wu)
   similar pages implementation.
« Previous topic | Next topic »
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: similar pages implementation.  (Read 949 times)
puzzlecracker
Senior Riddler
****



Men have become the tools of their tools

   


Gender: male
Posts: 319
similar pages implementation.  
« on: Jan 30th, 2005, 10:43am »
Quote Quote Modify Modify

anyone has any ideas how similar pages implemented by Google?    any suggestion you might have in mind?
IP Logged

While we are postponing, life speeds by
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: similar pages implementation.  
« Reply #1 on: Jan 30th, 2005, 10:49am »
Quote Quote Modify Modify

You could look at author, keywords, images (some websites do simply copy each others images, but I suppose it's not that frequent top be usefull).
And I suppose, most notably, linkage. If there are a lot of pages that link to the same two pages, those two are probably similar/related.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: similar pages implementation.  
« Reply #2 on: Jan 30th, 2005, 4:31pm »
Quote Quote Modify Modify

I think Google simply gets the keywords it knows for the reference page and looks for other pages with the same keywords.  Rare keywords are probably valued more or even much more.  Linking to the same pages also would indicate similarity.
IP Logged
puzzlecracker
Senior Riddler
****



Men have become the tools of their tools

   


Gender: male
Posts: 319
Re: similar pages implementation.  
« Reply #3 on: Jan 30th, 2005, 7:58pm »
Quote Quote Modify Modify

I want to extend the towrs idea. They way it might be implemented is by comparing in-links to and out-links from the page, for similar pages 'usually' (should probably use a more mathematical terminology) same sites point into them, similarly - they have comparable links.  
 
 any thoughts?
IP Logged

While we are postponing, life speeds by
eviltoylet
Guest

Email

Re: similar pages implementation.  
« Reply #4 on: Jan 31st, 2005, 12:46am »
Quote Quote Modify Modify Remove Remove

This is a pretty interesting question. I want to say that google spiders the web -- upon arriving at some arbitrary web page X, it records all links on that page . Then, it assumes that these pages could be related to each other. It spiders these web pages and extracts keywords or meta data even ... and if similar, keys them as similar.
 
Perhaps a way to find out for sure is for us to make a few websites and link them ... with different keywords, or with same keywords.
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: similar pages implementation.  
« Reply #5 on: Jan 31st, 2005, 1:08am »
Quote Quote Modify Modify

Google also looks at the text pages are link to each other with. That's how googlebombing works.
So if the linktext for two pages is the same, they are probably also related.
So math and more math probably makes google think these pages similar. (And of cousre it helps I'm linking to them from the same page, and in proximity to eachother on this page)
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
amichail
Senior Riddler
****





   


Posts: 450
Re: similar pages implementation.  
« Reply #6 on: Jan 31st, 2005, 1:51am »
Quote Quote Modify Modify

I think this paper on SimRank goes further than what Google does:
 
http://ideas.web.cse.unsw.edu.au/index.php?module=articles&func=disp lay&ptid=1&aid=20
 
My guess is that Google uses a simple cocitation algorithm.  SimRank takes this idea further in much the same way that PageRank takes an indegree link count further.
« Last Edit: Jan 31st, 2005, 1:53am by amichail » IP Logged

DropZap - a new kind of block elimination game
Terps.Go
Newbie
*





   


Gender: male
Posts: 13
Re: similar pages implementation.  
« Reply #7 on: Feb 3rd, 2005, 5:22pm »
Quote Quote Modify Modify

Hehe,
Dr. Broder when worked for Ditigal (acquired by Compaq) developed a method to compare two webpages using randomized algorithm. The method is called min-wise independent. This method is used in Altavista and I think google too. Try to google min-wise independent.
IP Logged
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board