wu :: forums
« wu :: forums - links »

Welcome, Guest. Please Login or Register.
Mar 28th, 2024, 6:34pm

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   cs
(Moderators: william wu, towr, Icarus, SMQ, Eigenray, Grimbal, ThudnBlunder)
   links
« Previous topic | Next topic »
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: links  (Read 8429 times)
jhngalt
Newbie
*





   


Gender: male
Posts: 36
links  
« on: Mar 5th, 2005, 4:22pm »
Quote Quote Modify Modify


How would you design a system which spiders the links from a webpage - specifically, how would you check if a link has already been hit?
 
thanks
IP Logged
puzzlecracker
Senior Riddler
****



Men have become the tools of their tools

   


Gender: male
Posts: 319
Re: links  
« Reply #1 on: Mar 5th, 2005, 7:46pm »
Quote Quote Modify Modify

put it in some sort of hash table to determine the collision (a link has been hit). and put all links you encounter in a proirity  queue - so remove on at the time and go it.
IP Logged

While we are postponing, life speeds by
jhngalt
Newbie
*





   


Gender: male
Posts: 36
Re: links  
« Reply #2 on: Mar 5th, 2005, 8:28pm »
Quote Quote Modify Modify

on Mar 5th, 2005, 7:46pm, puzzlecracker wrote:
put it in some sort of hash table to determine the collision (a link has been hit). and put all links you encounter in a proirity  queue - so remove on at the time and go it.

 
hi.. why do you put all links you encounter into a priority queue?
thanks,
IP Logged
puzzlecracker
Senior Riddler
****



Men have become the tools of their tools

   


Gender: male
Posts: 319
Re: links  
« Reply #3 on: Mar 5th, 2005, 9:15pm »
Quote Quote Modify Modify

a queue would also be sufficient  - so that you visit all wepages in order
 
for ex.  
 
Page z has links a,b,c,d
 
page a: e, f, g, d
 :
and so on....
 
 
queue would be initially filled with a b c d  then you remove 'a'
 
and it would be filled with  b c d e f g .. then you go you go to 'b' and so on.    Also, every single jump verify with hash to avoid cycle and repeats.
 
IP Logged

While we are postponing, life speeds by
jhngalt
Newbie
*





   


Gender: male
Posts: 36
Re: links  
« Reply #4 on: Mar 6th, 2005, 6:00am »
Quote Quote Modify Modify

on Mar 5th, 2005, 9:15pm, puzzlecracker wrote:
a queue would also be sufficient  - so that you visit all wepages in order
 
for ex.  
 
Page z has links a,b,c,d
 
page a: e, f, g, d
 :
and so on....
 
 
queue would be initially filled with a b c d  then you remove 'a'
 
and it would be filled with  b c d e f g .. then you go you go to 'b' and so on.    Also, every single jump verify with hash to avoid cycle and repeats.
 

 
You said in every single jump verify with hash to avoid cycle and repeats, meaning when you get a link you hash it and see if it already had been previously hit, what hash function do you think we could use, i read somewhere that you could use the checksum of the link, i'm not sure what that is, any ideas?
 
also some webpages have different urls pointing to it, how could you tell if you have already been there but under a different link..
thanks,
IP Logged
jhngalt
Newbie
*





   


Gender: male
Posts: 36
Re: links  
« Reply #5 on: Mar 16th, 2005, 8:09am »
Quote Quote Modify Modify

on Mar 6th, 2005, 6:00am, jhngalt wrote:

 
You said in every single jump verify with hash to avoid cycle and repeats, meaning when you get a link you hash it and see if it already had been previously hit, what hash function do you think we could use, i read somewhere that you could use the checksum of the link, i'm not sure what that is, any ideas?
 
also some webpages have different urls pointing to it, how could you tell if you have already been there but under a different link..
thanks,

 
nobody has any ideas?
 
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: links  
« Reply #6 on: Mar 17th, 2005, 1:00am »
Quote Quote Modify Modify

Well, for the latter part. Visiting the link should help distinguish wether different urls point to the same website. (Just resolve what ip and path or content they point to)
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
TenaliRaman
Uberpuzzler
*****



I am no special. I am only passionately curious.

   


Gender: male
Posts: 1001
Re: links  
« Reply #7 on: Mar 21st, 2005, 10:52pm »
Quote Quote Modify Modify

The checksum mentioned could be as simple as adding up the letters in the url. Lucky for us, the urls are all small letters and even luckier since addition of two letters doesnt give us a third letter. (I think!).
 
If the above is not correct(tho i think it should be), we may use an MD5 checksum.
 
-- AI
 
IP Logged

Self discovery comes when a man measures himself against an obstacle - Antoine de Saint Exupery
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: links  
« Reply #8 on: Mar 22nd, 2005, 4:32am »
Quote Quote Modify Modify

on Mar 21st, 2005, 10:52pm, TenaliRaman wrote:
The checksum mentioned could be as simple as adding up the letters in the url.
Addition does not make a good hashfunction, since it is communative.
Essentially that means any anagrams will be confused, among many other things (the hash for "ac" would also be the same as "bb"). And I just don't think you would want to confuse urls like "www.dicktraceysucks.com" with "www.traceysucksdick.com".. Tongue
 
Quote:
Lucky for us, the urls are all small letters and even luckier since addition of two letters doesnt give us a third letter. (I think!).
But three or four modulo the hastable-size might, not that that matters much, I think.. (Just as long as you get mostly distinct numbers for distinct urls)
« Last Edit: Mar 22nd, 2005, 4:36am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7526
Re: links  
« Reply #9 on: Jun 14th, 2013, 7:54am »
Quote Quote Modify Modify

The problem with a simple sum is that you never get values over a few thousand.  If you have millions of URLs, you get many collisions.
 
Java String's hash function is basically this:
 
  char val[] = the string's chars
  int hash = 0;
  for( int i=0 ; i<val.length ; i++ ){
    hash = hash*31 + val[i];
  }
IP Logged
TenaliRaman
Uberpuzzler
*****



I am no special. I am only passionately curious.

   


Gender: male
Posts: 1001
Re: links  
« Reply #10 on: Jun 16th, 2013, 4:47pm »
Quote Quote Modify Modify

Yeah, the addition is a bad idea  Grin
IP Logged

Self discovery comes when a man measures himself against an obstacle - Antoine de Saint Exupery
gitanas
Junior Member
**





   


Posts: 55
Re: links  
« Reply #11 on: Jan 27th, 2016, 6:33am »
Quote Quote Modify Modify

You could store links in a database.
Moreover add index on that column and search will work very fast.
IP Logged

Dummy Frog - my blog about interesting and funny things in our World
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board