function [dataMat, desiredColumns] = readNewerDataFromCsv %readNewerDataFromCsv - Read newer data provided by Lars. % Reads the newer data from Lars. %%% REDO or DELETE THIS TEXT % Reads the comma-separated values file luca.newer.csv % provided by Lars and eliminates self matches. % Data file is about 700,000 rows by 50 columns: HUGE! % We don't want to store it all in memmory and then pick out the % desired columns corresponding to the features. So, we must % read only the columns of interest. This is done in 2 steps: % 1) This step need to be done only once. % a) Create separate ascii files for each column in the original % file, correponding to the features we want to read in; % b) Put together the features in a matrix, one column per % feature and one row per data point, and save it to file. % 2) Load the matrix of features in memory. % % USAGE: data = readNewerDataFromCsv(doAll), where doAll is 1 if % step (1) above is to be executed, or 0 if only step (2) is done. %Written by Luca Cazzanti %Copyright 2005 %$Id$ location = 'newer_csv_data'; desiredColumns = {'ratio', 'zscore', ... 'experiment_contact_order', 'prediction_contact_order', ... 'experiment_percent_alpha', 'prediction_percent_alpha', ... 'experiment_percent_beta', 'prediction_percent_beta', ... 'correct_superfamily'}; nColumns = length(desiredColumns); % First find the self-matches: must take them out of each feature fname = fullfile(location, 'self_match.txt'); self_match = textread(fname, '%n', 'headerlines', 1); idx = find(self_match == 1); dataMat = zeros(length(self_match) - length(idx), nColumns); clear self_match; % Find rest of features for iCol = 1:nColumns fname = fullfile(location, [desiredColumns{iCol} '.txt']); tmp = textread(fname, '%n', 'headerlines', 1); tmp(idx) = []; dataMat(:,iCol) = tmp; end