1. 226 proteins 2. Similarity metric: evolutionary distance- probability of transforming one amino acid sequence into the other via point mutations. from hoffman paper ' the dissimilarity values between pairs of protein sequences are determined by a sequence alignment program which takes biochem and structural info into account. In essence, the alignment program measures the number of amino acids which have to be exchanged to transform the first sequence into the second.' 3. Number of classes: 4 (HA, HB, M, GH) [ issue #1: There is no GH, but there is GG, GP] 4. http://ni.cs.tu-berlin.de/publications/psvm_sup/protein/bashford.tri contains DISSIMILARITIES 5. max value of dissimilarity = 13.64; min value of dissimilarity = 3.61 6. domain of similarity values = 0 to length of sequence?- not exactly because average of all possible alignments- so not simple to say all natural numbers between 0 and length of the sequences. 7. The way they ran it: don't know. why? in their results, they only use 72+72+39+30=213 proteins but there are 226 over all- where did the other 13 go? so confused. 8. what's '10-fold cross validation' (wikipedia) K fold cross validation- the original sample is partitioned into K subsamples. of the k subsamples,a single subsample is retained as test data, and the remaining K-1 subsamples are used as training data. the crossvalidation process is then repeated K times with each subsample used exactly once as the test data. the k results are averaged to produce a single estimation.