wu :: forums - research problem involving convergence

wu :: forums « wu :: forums - research problem involving convergence » Welcome, Guest. Please Login or Register. May 5^th, 2024, 5:30pm
RIDDLES SITE WRITE MATH! Home Help Search Members Login Register

   wu :: forums
   riddles
   hard (Moderators: Grimbal, Icarus, SMQ, william wu, ThudnBlunder, towr, Eigenray)
   research problem involving convergence

« Previous topic | Next topic »

Pages: 1

Notify of replies

Send Topic

Author

Topic: research problem involving convergence (Read 1100 times)

Amir Michail
Guest

research problem involving convergence
« on: Dec 21^st, 2003, 5:01pm »

Quote

Modify

Remove

I have come across the following convergence problem in my research. However, the mathematical analysis is beyond my abilities.

Although the problem comes up in reverse engineering code, one can formulate it within the context of information retrieval.

Namely, the question is how to find important words in a document/web page.

A simple method would look for words that don't occur in many documents/web pages, and so, they are more likely relevant to any document/web page in which they do occur. In fact, the information retrieval idf measure is based on this simple idea.

However, this approach does not take into account similarity between documents/web pages. If a word occurs in many documents/web pages, but all those documents/web pages are similar to each other in content, then maybe the word should be considered important after all.

We have considered the following approach in our research:

* represent each document/web page using a vector of size n, where n is the total number of words across all documents

* use a 1 to indicate the presence of a word and 0 to indicate its absence (the word order is fixed across all vectors)

* to determine the "prevalence" of a word w, find all documents D1, ..., Dm containing w

* consider the normalized vectors V1, ..., VM for those documents as points on a hypersphere of dimension n

* find the smallest n-orthotope that just encloses these m points and the origin (an n-orthotope generalizes the concept of a rectangle to n dimensions)

* the prevalence of w is the sum of the squares of the dimensions of this n-orthotope

* intuitively, if w's documents are really different, then their points on the hyperphere will be spread apart thus forcing the n-orthotype dimensions to be large

It is easy to show that:

* if D1, ..., Dm are the same, then the prevalence of w is 1.0

* if D1, ..., Dm are completely different (e.g., they share no words), then the prevalence of w is m provided m <= n.

The problem with this method is that it uses binary vectors to represent a document's words. However, not all words are equally important and so we may wish to use different values in those vectors to indicate importance (much as is done in information retrieval).

However, the problem now becomes circular. The whole point of this exercise is to determine word importance.

So we might consider an iterative algorithm (like pagerank say).

We could start with binary vectors and compute word prevalences from that. We could then take (1/word prevalence) say to determine the importance of a word (since prevalence is inversely proportional to importance).

We could then use these new importance values for vectors in a new iteratino of the algorithm, and so on.

In our experimentation, this algorithm has always converged to a unique solution of prevalence/importance values. Moreover, the solution is the same even if we start with random values for non-zero entries in the initial vectors.

We have no idea how to even attempt to prove this however.

Any pointers? We have received advice that proving convergence is usually hard because there is no "standard" approach. Each problem requires its own tricks.

Amir

P.S. You can find out more about this approach in the DRT design recovery tool in http://www.cse.unsw.edu.au/~amichail/drt_techreport.pdf. There is an extensive section on computing function prevalence and a discussion of the iterative approach under future work.

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: research problem involving convergence
« Reply #1 on: Dec 22^nd, 2003, 4:35am »

Quote

Modify

could you give a small example of how it works? Because I'm not sure what happens in each iteration..

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

Amir Michail
Guest

Re: research problem involving convergence
« Reply #2 on: Dec 22^nd, 2003, 4:58am »

Quote

Modify

Remove

on Dec 22^nd, 2003, 4:35am, towr wrote:

could you give a small example of how it works? Because I'm not sure what happens in each iteration..

Suppose D1={x,y} and D2={y}.

Starting off with binary vectors, we have:

V1=(1,1) and V2=(0,1)

Normalizing gives:

V1'=(1/sqrt(2), 1/sqrt(2)) and V2'=(0,1)

Computing the prevalence of word x (that occurs only in D1):

max( 1/sqrt(2) )^2 + max( 1/sqrt(2) ) ^2 = 1

Computing the prevalence of word y (that occurs in both D1 & D2):

max( 1/sqrt(2), 0) ^2 + max( 1/sqrt(2), 1 )^2 = 1.5

Now, for the next iteration, we use the reciprocal of these prevalence values for importance in the vectors:

V1=( 1/1, 1/1.5 ), V2=(0, 1/1.5)

and we repeat steps preformed earlier....

Amir

IP Logged

James Fingas
Uberpuzzler

Gender:

Posts: 949

Re: research problem involving convergence
« Reply #3 on: Dec 22^nd, 2003, 12:17pm »

Quote

Modify

on Dec 21^st, 2003, 5:01pm, Amir Michail wrote:

The problem with this method is that it uses binary vectors to represent a document's words. However, not all words are equally important and so we may wish to use different values in those vectors to indicate importance (much as is done in information retrieval).

I think that the convergence can be shown by showing the two following properties:

1) There exists an final importance measure so that when you apply this algorithm, the final importance measure remains the same.

2) Applying this algorithm to any importance vector gives a result closer to the final importance vector than the original was. The distance measure we use here could be non-intuitive.

I think it would also be useful to express some of the ideas in different mathematical forms.

First of all, we determine the lengths of the initial binary vectors d1, d2, ... dn for each document so that the vector Di has length di. Vi = Di/di.

We represent the importance of the words as a vector M, with a corresponding length m. Then the normalized importance-weighted vectors are MDi/mdi. The values in the normalized importance-weighted vectors are:

-larger if the document is small (i.e. doesn't have many important words in it)
-larger if the word is important

The question is then reduced to this:
does there exist a unit vector Mf such that, for all 1 [le] i [le] n:

1/Mf_i = [sum] (1 [le] j [le] n, j != i) (max (k | Vk_i != 0) MfVk_j)[sup2]

Since the normalized importance-weighted vectors MVi/m are all computed using the same M, the maximum computation does not change from iteration to iteration. We can therefore represent the Vk_j elements as a length-n vector Qi which can be computed from the documents before word importance weighting. It will be based solely on the document length (number of distinct words). We have thus reduced the problem to the following:

Given n length-n vectors Qi with Qi_i = 0, find a vector Mf such that:

1/Mf_i = [sum] (1 [le] j [le] n) (Mf_jQi_j)[sup2]

IP Logged

Doc, I'm addicted to advice! What should I do?

Pages: 1

Notify of replies

Send Topic


« Previous topic \| Next topic »