wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> hard >> research problem involving convergence
(Message started by: Amir Michail on Dec 21st, 2003, 5:01pm)

Title: research problem involving convergence
Post by Amir Michail on Dec 21st, 2003, 5:01pm
I have come across the following convergence problem in my research.  However, the mathematical analysis is beyond my abilities.

Although the problem comes up in reverse engineering code, one can formulate it within the context of information retrieval.

Namely, the question is how to find important words in a document/web page.

A simple method would look for words that don't occur in many documents/web pages, and so, they are more likely relevant to any document/web page in which they do occur. In fact, the information retrieval idf measure is based on this simple idea.

However, this approach does not take into account similarity between documents/web pages.  If a word occurs in many documents/web pages, but all those documents/web pages are similar to each other in content, then maybe the word should be considered important after all.

We have considered the following approach in our research:

* represent each document/web page using a vector of size n, where n is the total number of words across all documents

* use a 1 to indicate the presence of a word and 0 to indicate its absence (the word order is fixed across all vectors)

* to determine the "prevalence" of a word w, find all documents D1, ..., Dm containing w

* consider the normalized vectors V1, ..., VM for those documents as points on a hypersphere of dimension n

* find the smallest n-orthotope that just encloses these m points and the origin (an n-orthotope generalizes the concept of a rectangle to n dimensions)

* the prevalence of w is the sum of the squares of the dimensions of this n-orthotope

* intuitively, if w's documents are really different, then their points on the hyperphere will be spread apart thus forcing the n-orthotype dimensions to be large

It is easy to show that:

* if D1, ..., Dm are the same, then the prevalence of w is 1.0

* if D1, ..., Dm are completely different (e.g., they share no words), then the prevalence of w is m provided m <= n.

The problem with this method is that it uses binary vectors to represent a document's words.  However, not all words are equally important and so we may wish to use different values in those vectors to indicate importance (much as is done in information retrieval).

However, the problem now becomes circular.  The whole point of this exercise is to determine word importance.

So we might consider an iterative algorithm (like pagerank say).

We could start with binary vectors and compute word prevalences from that.  We could then take (1/word prevalence) say to determine the importance of a word (since prevalence is inversely proportional to importance).

We could then use these new importance values for vectors in a new iteratino of the algorithm, and so on.

In our experimentation, this algorithm has always converged to a unique solution of prevalence/importance values.  Moreover, the solution is the same even if we start with random values for non-zero entries in the initial vectors.

We have no idea how to even attempt to prove this however.

Any pointers?  We have received advice that proving convergence is usually hard because there is no "standard" approach.  Each problem requires its own tricks.

Amir

P.S.  You can find out more about this approach in the DRT design recovery tool in http://www.cse.unsw.edu.au/~amichail/drt_techreport.pdf. There is an extensive section on computing function prevalence and a discussion of the iterative approach under future work.

Title: Re: research problem involving convergence
Post by towr on Dec 22nd, 2003, 4:35am
could you give a small example of how it works? Because I'm not sure what happens in each iteration..

Title: Re: research problem involving convergence
Post by Amir Michail on Dec 22nd, 2003, 4:58am

on 12/22/03 at 04:35:49, towr wrote:
could you give a small example of how it works? Because I'm not sure what happens in each iteration..


Suppose D1={x,y} and D2={y}.

Starting off with binary vectors, we have:

V1=(1,1) and V2=(0,1)

Normalizing gives:

V1'=(1/sqrt(2), 1/sqrt(2)) and V2'=(0,1)

Computing the prevalence of word x (that occurs only in D1):

 max( 1/sqrt(2) )^2 + max( 1/sqrt(2) ) ^2 = 1

Computing the prevalence of word y (that occurs in both D1 & D2):

 max( 1/sqrt(2), 0) ^2 + max( 1/sqrt(2), 1 )^2 = 1.5

Now, for the next iteration, we use the reciprocal of these prevalence values for importance in the vectors:

V1=( 1/1, 1/1.5 ), V2=(0, 1/1.5)

and we repeat steps preformed earlier....

Amir

Title: Re: research problem involving convergence
Post by James Fingas on Dec 22nd, 2003, 12:17pm

on 12/21/03 at 17:01:26, Amir Michail wrote:
The problem with this method is that it uses binary vectors to represent a document's words.  However, not all words are equally important and so we may wish to use different values in those vectors to indicate importance (much as is done in information retrieval).


I think that the convergence can be shown by showing the two following properties:

1) There exists an final importance measure so that when you apply this algorithm, the final importance measure remains the same.

2) Applying this algorithm to any importance vector gives a result closer to the final importance vector than the original was. The distance measure we use here could be non-intuitive.

I think it would also be useful to express some of the ideas in different mathematical forms.

First of all, we determine the lengths of the initial binary vectors d1, d2, ... dn for each document so that the vector Di has length di. Vi = Di/di.

We represent the importance of the words as a vector M, with a corresponding length m. Then the normalized importance-weighted vectors are MDi/mdi. The values in the normalized importance-weighted vectors are:

-larger if the document is small (i.e. doesn't have many important words in it)
-larger if the word is important

The question is then reduced to this:
does there exist a unit vector Mf such that, for all 1 [le] i [le] n:

1/Mfi = [sum] (1 [le] j [le] n, j != i) (max (k | Vki != 0) MfVkj)[sup2]

Since the normalized importance-weighted vectors MVi/m are all computed using the same M, the maximum computation does not change from iteration to iteration. We can therefore represent the Vkj elements as a length-n vector Qi which can be computed from the documents before word importance weighting. It will be based solely on the document length (number of distinct words). We have thus reduced the problem to the following:

Given n length-n vectors Qi with Qii = 0, find a vector Mf such that:

1/Mfi = [sum] (1 [le] j [le] n) (MfjQij)[sup2]



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board