wu :: forums
« wu :: forums - Interview Questions »

Welcome, Guest. Please Login or Register.
May 18th, 2024, 1:15am

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   cs
(Moderators: Icarus, SMQ, Grimbal, Eigenray, ThudnBlunder, towr, william wu)
   Interview Questions
« Previous topic | Next topic »
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Interview Questions  (Read 1716 times)
vodangkhoa
Newbie
*





   


Posts: 20
Interview Questions  
« on: Jan 23rd, 2007, 8:54pm »
Quote Quote Modify Modify

1) How to find most common word in billions of documents? What if you have just one machine? What if multiple machines? What's the bottleneck? What if just one machine and limited memory?
 
2) Given a source word, a target word, and a dictionary, how to transform the source word into target word by changing only one letter in each step. The word you get in each step must be in the dictionary.
« Last Edit: Feb 9th, 2007, 11:56pm by vodangkhoa » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Interview Questions  
« Reply #1 on: Jan 24th, 2007, 1:30am »
Quote Quote Modify Modify

I'd be surprised if you'd find more than a million distinct words, and if the documents are english, the most common word is most likely "the" or "a".
 
The biggest problem is reading all the texts, disk access is slow. But processing it after that shouldnt' be much of a problem. A hash would work, but a tree wouldn't be much worse (just 10-20 times or so). Actually, the most common words is just a set of 10000.
 
For multiple machines, you could have all machines process a part of the documents, then merge the result (although, given the distribution, just compare the first result of each machine, it's bound to be the same result, thus proving it's the most frequent)
 
For the second question, transforming a word to a target word one step at a time, you could look to graph theory. Connect all words that differ by one character, then find a path between the source and target node.
« Last Edit: Jan 24th, 2007, 1:32am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Interview Questions  
« Reply #2 on: Jan 24th, 2007, 2:04am »
Quote Quote Modify Modify

on Jan 24th, 2007, 1:30am, towr wrote:
Actually, the most common words is just a set of 10000.

You could substitute 10000 by any value.  The top ten of the most common words is just a set of 10. Tongue
 
I think one question is where the documents are stored in the first place.  If you have many machines but all the words are on one server, the bottleneck will be in the access to the documents.  In that case the first step would be to set up an efficient distribution of the documents.
 
For the second part, it is a form of labyrinth.  Or searching a path in a graph.  In this case, it is a highly connected one, so a depth-first search would be suicidal.  The fastest would be a breadth-first search.  Make a first list with all the valid words with one letter changed.  Then make a second list by changing one letter from each word pf the first list, but ignoring any word from a previous list.  Continue until you reach the target word.  If you start simultaneously from the start and the end and you look for an intersection, you will save time and memory.
« Last Edit: Jan 24th, 2007, 2:05am by Grimbal » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Interview Questions  
« Reply #3 on: Jan 24th, 2007, 3:41am »
Quote Quote Modify Modify

on Jan 24th, 2007, 2:04am, Grimbal wrote:
You could substitute 10000 by any value.  The top ten of the most common words is just a set of 10. Tongue
Ok, let's put it this way, 99% of the text can be accounted for by a list of just 10000 different words.
Let's see you try that with 10.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Interview Questions  
« Reply #4 on: Jan 25th, 2007, 2:22am »
Quote Quote Modify Modify

Like that, no objection.
 
And I investigated.  The top ten still represent some 20% of 29,213,800 words (from TV and movie scripts and transcripts).  It is a rough figure, it is based on mostly spoken English but still impressive.  Imagine: learn 10 words and you master 20% of the English language!  Smiley
 
The words according to Wikipedia are:
    you I to the a and that it of me
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Interview Questions  
« Reply #5 on: Jan 25th, 2007, 3:45am »
Quote Quote Modify Modify

on Jan 25th, 2007, 2:22am, Grimbal wrote:
The words according to Wikipedia are:
    you I to the a and that it of me
Based on a dozen books (in plain txt format)
I have only two difference, "was" and "in", instead of "you" and "me"
 
And the top ten words account for a whopping 25.6%.
In fact, "the" by itself, accounts for nearly 6% !!
Really, you start to wonder how much better english is then just grunting like cavemen Tongue (Of course, it's pretty much the same in any language)
 
26021 that
26370 it
29759 was
34993 in
43451 i
48379 a
53562 to
60862 of
67132 and
119285 the
(from a total of 2018283 words)
 
[e]Oh, and in fairness, after checking, the top 1000 words only account for 78%, not the 99% I guestimated. But it's enough for a decent conversation. Besides, it's arguable whether am/are/is should be counted seperately or together (Even though it sounds bad, "I are here" is just as intelligable as "I am here").[/e]
« Last Edit: Jan 25th, 2007, 3:50am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
vodangkhoa
Newbie
*





   


Posts: 20
Re: Interview Questions  
« Reply #6 on: Feb 9th, 2007, 11:58pm »
Quote Quote Modify Modify

For question #2, is it possible to use a modified version Edit Distance DP Algorithm to solve it?
IP Logged
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Interview Questions  
« Reply #7 on: Feb 12th, 2007, 5:02am »
Quote Quote Modify Modify

on Jan 25th, 2007, 3:45am, towr wrote:
[e]Oh, and in fairness, after checking, the top 1000 words only account for 78%, not the 99% I guestimated.[/e]

But you guestimated for 10'000.
« Last Edit: Feb 13th, 2007, 1:30am by Grimbal » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Interview Questions  
« Reply #8 on: Feb 12th, 2007, 5:30am »
Quote Quote Modify Modify

on Feb 12th, 2007, 5:02am, Grimbal wrote:
But you guestimated for 10'000.
Ah. Well, it probably still doesn't cover 99%, but I'll check
 
Top 1 word covers 6%
Top 10 words, 25%
Top 100,  51%
Top 1000, 78%  
Top 10000, 96%
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
balanagireddy.tcs
Newbie
*





   
Email

Gender: female
Posts: 28
Re: Interview Questions  
« Reply #9 on: Feb 12th, 2007, 10:19pm »
Quote Quote Modify Modify

Quote:

1) How to find most common word in billions of documents? What if you have just one machine? What if multiple machines? What's the bottleneck? What if just one machine and limited memory?  
 
2) Given a source word, a target word, and a dictionary, how to transform the source word into target word by changing only one letter in each step. The word you get in each step must be in the dictionary

 
I feel for the above questions..
1) have a hash of all the dictionary words and for nouns have a separate hash and increase the count as you read the word.This is for single machine.
If you have multiple machines you can allot the words starting with particular alphabet to each machine..like say you can have 26 machines for 26 alphabets and 1 machine exclusively for non-dict words.
If there is one machine with limited memory..then the best way to take each word and maintain the count of that word and again search for the next word and update its the count is greater..
But this method is very slow considering billions of words.
 
Let me know if you want the code for the above solution
 
PS: Correct me if i am wrong
IP Logged
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board