wu :: forums
« wu :: forums - Discovering words in a random sentence »

Welcome, Guest. Please Login or Register.
May 15th, 2024, 3:28pm

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   hard
(Moderators: SMQ, towr, Icarus, ThudnBlunder, Grimbal, Eigenray, william wu)
   Discovering words in a random sentence
« Previous topic | Next topic »
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Discovering words in a random sentence  (Read 658 times)
Earendil
Newbie
*





7195044 7195044    


Gender: male
Posts: 46
Discovering words in a random sentence  
« on: Jun 15th, 2005, 6:48am »
Quote Quote Modify Modify

Some finite words (constructed on a finite alphabet) are selected without you knowing them. After that, a sentence is constructed selecting every time a word, independently of the others already chosen, with some unknown probability function.
 
Given an infinite sized sentence, how do you discover which are the words and what is the probability of chosing each one of them?
 
Example:
words can be:
 
(2,0,1)
(2,1)
(3,0,0)
 
A sentence could be:
 
30030030020121201212120130021201300...
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Discovering words in a random sentence  
« Reply #1 on: Jun 15th, 2005, 7:13am »
Quote Quote Modify Modify

Some clarifications are required... I have one question:
 
May a word be a prefix of another word?
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Discovering words in a random sentence  
« Reply #2 on: Jun 15th, 2005, 8:43am »
Quote Quote Modify Modify

It's easy enough to find pieces of words that are repeated. But there is no way to tell where wordboundaries are, unless there are extra constraints (like f.i. no words is prefix to another)
 
There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc.
 
You get different probability function for each assumption, but you always get one. And since the right one is unknown, it can't help you distinguish what is the right lexicon anyway.
 
Some context, meaning, to the sentence might be nice.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Earendil
Newbie
*





7195044 7195044    


Gender: male
Posts: 46
Re: Discovering words in a random sentence  
« Reply #3 on: Jun 15th, 2005, 9:22am »
Quote Quote Modify Modify

on Jun 15th, 2005, 7:13am, Barukh wrote:
Some clarifications are required... I have one question:
 
May a word be a prefix of another word?

 
A word may not be the suffix of another word
 
on Jun 15th, 2005, 8:43am, towr wrote:

There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc.  

 
You can guide yourself on the Principle of Maximum Likelihood...
 
(The parameters which maximize the probability of obtaining the sample are the best ones)
« Last Edit: Jun 15th, 2005, 9:24am by Earendil » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Discovering words in a random sentence  
« Reply #4 on: Jun 15th, 2005, 10:12am »
Quote Quote Modify Modify

You're suggesting a hidden markov model?
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Earendil
Newbie
*





7195044 7195044    


Gender: male
Posts: 46
Re: Discovering words in a random sentence  
« Reply #5 on: Jun 15th, 2005, 12:14pm »
Quote Quote Modify Modify

on Jun 15th, 2005, 10:12am, towr wrote:
You're suggesting a hidden markov model?

 
Words haven't got necessarily the same length...
IP Logged
JocK
Uberpuzzler
*****






   


Gender: male
Posts: 877
Re: Discovering words in a random sentence  
« Reply #6 on: Jun 15th, 2005, 1:24pm »
Quote Quote Modify Modify

The remark that subsequent words are chosen independently of previous words, is crucial here. It means that the whole sequence should fit a Markov chain model with transitional probabilities for subsequent words that satisfy for each word (y) and all preceding words (x): P(x)(y) = P(y)
 
A simple example:  
 
Suppose I would construct an infinite sequence based on the random selection of the words ''ab'' and ''ba'' with equal probabilities.
 
The assumption that the words would be ''a'' and ''b'' would lead to a markov chain with transitional probabilities P(a)(a) = 1/4 =/= 3/4 = P(b)(a), and P(b)(b) = 1/4 =/= = 3/4 = P(a)(b). This result is clearly in conflict with the given statistical independence of subsequent words.
 
However, starting from the assumption that the words are ''ab'' and ''ba'' would lead to the Markov chain actually used in the construction of the data. I.e. the chain with P(ab)(ab) = P(ba)(ab) = P(ab) = 1/2,  and P(ba)(ba) = P(ab)(ba) = P(ba) = 1/2, satisfying the requirement of statistical independence of subsequent words.  
 
Of course, in this example, one could also construct a valid Markov model based on the words ''abab'', ''abba'', ''baab'' and ''baba''.  
 
The choice on what is the most likely Markov model explaining the data, could be based on the entropy of the Markov model:
 
H{x} = Sum x  -P(x) ln(P(x))
 
For the 4-word model this leads to an entropy of ln(4), whilst for the 2-word model the entropy would be ln(2).
 
I assume Earendil wants us to pick the Markov chain that satisfies all requirements and has the smallest entropy.
 
How to translate this into a constructive algorithm for deriving the words is not clear to me yet.
 
 
« Last Edit: Jun 15th, 2005, 1:26pm by JocK » IP Logged

solving abstract problems is like sex: it may occasionally have some practical use, but that is not why we do it.

xy - y = x5 - y4 - y3 = 20; x>0, y>0.
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board