wu :: forums - Interview Questions

wu :: forums « wu :: forums - Interview Questions » Welcome, Guest. Please Login or Register. May 18^th, 2024, 1:15am
RIDDLES SITE WRITE MATH! Home Help Search Members Login Register

   wu :: forums
   riddles
   cs (Moderators: Icarus, SMQ, Grimbal, Eigenray, ThudnBlunder, towr, william wu)
   Interview Questions

« Previous topic | Next topic »

Pages: 1

Notify of replies

Send Topic

Author

Topic: Interview Questions (Read 1716 times)

vodangkhoa
Newbie

Posts: 20

Interview Questions
« on: Jan 23^rd, 2007, 8:54pm »

Quote

Modify

1) How to find most common word in billions of documents? What if you have just one machine? What if multiple machines? What's the bottleneck? What if just one machine and limited memory?

2) Given a source word, a target word, and a dictionary, how to transform the source word into target word by changing only one letter in each step. The word you get in each step must be in the dictionary.

« Last Edit: Feb 9^th, 2007, 11:56pm by vodangkhoa »

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Interview Questions
« Reply #1 on: Jan 24^th, 2007, 1:30am »

Quote

Modify

I'd be surprised if you'd find more than a million distinct words, and if the documents are english, the most common word is most likely "the" or "a".

The biggest problem is reading all the texts, disk access is slow. But processing it after that shouldnt' be much of a problem. A hash would work, but a tree wouldn't be much worse (just 10-20 times or so). Actually, the most common words is just a set of 10000.

For multiple machines, you could have all machines process a part of the documents, then merge the result (although, given the distribution, just compare the first result of each machine, it's bound to be the same result, thus proving it's the most frequent)

For the second question, transforming a word to a target word one step at a time, you could look to graph theory. Connect all words that differ by one character, then find a path between the source and target node.

« Last Edit: Jan 24^th, 2007, 1:32am by towr »

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

Grimbal
wu::riddles Moderator
Uberpuzzler

Gender:

Posts: 7527

Re: Interview Questions
« Reply #2 on: Jan 24^th, 2007, 2:04am »

Quote

Modify

on Jan 24^th, 2007, 1:30am, towr wrote:

Actually, the most common words is just a set of 10000.

You could substitute 10000 by any value. The top ten of the most common words is just a set of 10. Tongue

I think one question is where the documents are stored in the first place. If you have many machines but all the words are on one server, the bottleneck will be in the access to the documents. In that case the first step would be to set up an efficient distribution of the documents.

For the second part, it is a form of labyrinth. Or searching a path in a graph. In this case, it is a highly connected one, so a depth-first search would be suicidal. The fastest would be a breadth-first search. Make a first list with all the valid words with one letter changed. Then make a second list by changing one letter from each word pf the first list, but ignoring any word from a previous list. Continue until you reach the target word. If you start simultaneously from the start and the end and you look for an intersection, you will save time and memory.

« Last Edit: Jan 24^th, 2007, 2:05am by Grimbal »

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Interview Questions
« Reply #3 on: Jan 24^th, 2007, 3:41am »

Quote

Modify

on Jan 24^th, 2007, 2:04am, Grimbal wrote:

You could substitute 10000 by any value. The top ten of the most common words is just a set of 10. Tongue

Ok, let's put it this way, 99% of the text can be accounted for by a list of just 10000 different words.
Let's see you try that with 10.

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

Grimbal
wu::riddles Moderator
Uberpuzzler

Gender:

Posts: 7527

Re: Interview Questions
« Reply #4 on: Jan 25^th, 2007, 2:22am »

Quote

Modify

Like that, no objection.

And I investigated. The top ten still represent some 20% of 29,213,800 words (from TV and movie scripts and transcripts). It is a rough figure, it is based on mostly spoken English but still impressive. Imagine: learn 10 words and you master 20% of the English language!

The words according to Wikipedia are:
you I to the a and that it of me

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Interview Questions
« Reply #5 on: Jan 25^th, 2007, 3:45am »

Quote

Modify

on Jan 25^th, 2007, 2:22am, Grimbal wrote:

The words according to Wikipedia are:
you I to the a and that it of me

Based on a dozen books (in plain txt format)
I have only two difference, "was" and "in", instead of "you" and "me"

And the top ten words account for a whopping 25.6%.
In fact, "the" by itself, accounts for nearly 6% !!
Really, you start to wonder how much better english is then just grunting like cavemen Tongue

(Of course, it's pretty much the same in any language)

26021 that
26370 it
29759 was
34993 in
43451 i
48379 a
53562 to
60862 of
67132 and
119285 the
(from a total of 2018283 words)

[e]Oh, and in fairness, after checking, the top 1000 words only account for 78%, not the 99% I guestimated. But it's enough for a decent conversation. Besides, it's arguable whether am/are/is should be counted seperately or together (Even though it sounds bad, "I are here" is just as intelligable as "I am here").[/e]

« Last Edit: Jan 25^th, 2007, 3:50am by towr »

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

vodangkhoa
Newbie

Posts: 20

Re: Interview Questions
« Reply #6 on: Feb 9^th, 2007, 11:58pm »

Quote

Modify

For question #2, is it possible to use a modified version Edit Distance DP Algorithm to solve it?

IP Logged

Grimbal
wu::riddles Moderator
Uberpuzzler

Gender:

Posts: 7527

Re: Interview Questions
« Reply #7 on: Feb 12^th, 2007, 5:02am »

Quote

Modify

on Jan 25^th, 2007, 3:45am, towr wrote:

[e]Oh, and in fairness, after checking, the top 1000 words only account for 78%, not the 99% I guestimated.[/e]

But you guestimated for 10'000.

« Last Edit: Feb 13^th, 2007, 1:30am by Grimbal »

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Interview Questions
« Reply #8 on: Feb 12^th, 2007, 5:30am »

Quote

Modify

on Feb 12^th, 2007, 5:02am, Grimbal wrote:

But you guestimated for 10'000.

Ah. Well, it probably still doesn't cover 99%, but I'll check

Top 1 word covers 6%
Top 10 words, 25%
Top 100, 51%
Top 1000, 78%
Top 10000, 96%

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

balanagireddy.tcs
Newbie

Gender:

Posts: 28

Re: Interview Questions
« Reply #9 on: Feb 12^th, 2007, 10:19pm »

Quote

Modify

Quote:

I feel for the above questions..
1) have a hash of all the dictionary words and for nouns have a separate hash and increase the count as you read the word.This is for single machine.
If you have multiple machines you can allot the words starting with particular alphabet to each machine..like say you can have 26 machines for 26 alphabets and 1 machine exclusively for non-dict words.
If there is one machine with limited memory..then the best way to take each word and maintain the count of that word and again search for the next word and update its the count is greater..
But this method is very slow considering billions of words.

Let me know if you want the code for the above solution

PS: Correct me if i am wrong

IP Logged

Pages: 1

Notify of replies

Send Topic


« Previous topic \| Next topic »