Data Gathering, Visualization, Digital Mapping and Palladio: A Critique

DATA GATHERING:

Challenges:

  1. Collecting data in a specified format. Given that each humanist figure is newly being researched in this class and that data must be collected in a specific format, the greatest challenges arise when this data is not available. For example, research may indicate that Copernicus traveled to Germany to collaborate with his peers about his manuscript De revolutionibus orbium coelestium. However, the Location_name column in the locations spreadsheet expects city names, not countries. In this case, more research needs to be performed to collect this data or should not be listed due to lack of specificity.
  2. Articles provide conflicting data. Information about historical figures are not always well documented and different articles may reveal different speculations about a figure. For example, even the year that a humanist was born may be inconsistent over many sources because this information was not properly accounted for.
  3. Errors in recording Location_LatLong. Depending on how the lat/long data is collected, whether it be through a mapping application like Google Maps or an online source, the same Location_name (ie. Venice) may have slightly different lat/long values across different spreadsheets. This becomes a problem when combining the spreadsheets for each humanist group as this requires consistency in order to assign the Locations_Link as in the Travels spreadsheet.

Lessons learned in data gathering:

  1. Make sure that everyone who is gathering data for the humanists spreadsheets are thoroughly aware as to the format that each column of each of the three people, travels, and locations spreadsheet requires.
  2. Reliable articles and multiple sources should be consulted in order to make the best judgement on what information to use when articles provide conflicting information about a humanist.
  3. Make sure that the Location_LatLong is not recorded with the degree symbol, and that the same website or mapping application (ie. Google Maps) is used by everyone who collected lat/long information. Doing so will also prevent the same location from being recorded with different lat/long numbers.

 DATA ORGANIZATION & MERGING OF SPREADSHEETS:

Challenges:

  1. Format inconsistencies and spelling errors. If the spreadsheets are to be merged there needs to be a unified format through which all of the data is being recorded. Our “scientists” group had inconsistencies with the Location_Name column in the locations spreadsheet because not all of the entries were just cities. Sometimes, a place was also spelled incorrectly or two different humanists referred to the same place with different names (ie. the historic name vs. the modern name). These problems caused extra, unnecessary rows to be created in the merged spreadsheet and made it difficult to delete without messing up other data rows, especially when there was too much data already collected.
  2. Errors with using the “find/replace” tool. If a particular column has inconsistencies with say commas being used instead of semi-colons as in the occupation column of the people’s spreadsheet, our group used the find/replace tool in excel to find and replace the commas with semi-colons, or to remove certain characters altogether, such as the degree symbols in Location_LatLong. Doing so caused problems because if the particular row(s) we were modifying were not explicitly selected, then by default, the find/replace tool applied the filtering to ALL of the data rows. This seriously messed up other rows and removed necessary characters in data that were already recorded correctly, leading to more issues.
  3. Issues with data text alignment. Some humanists applied the paragraph centering alignment to the data, whereas others left it in the default format of left alignment. This also caused issues when creating the final merged spreadsheet where the alignment of data varied between different humanist gathered data.

Lessons Learned in data organization & merging of spreadsheets:

  1. Double check the spelling of cities in the column Location_Name and which name is being used to denote that location.
  2. Carefully select which rows are highlighted when applying the find/replace tool filter in order to prevent further messing up data in other rows.
  3. Agree on a text alignment (such as the default left alignment setting) that everyone uses consistently.

PALLADIO:

Benefits in using Palladio:

  1. Great tool for data visualization that readily accepts and parses the data without the user needing to have any separate programming knowledge.
  2. The user can directly copy/paste the data from the excel spreadsheet to the text-box, given that the data is in the right format.
  3. The user can specify what graph nodes represent, and how links between nodes are created through the data columns that they input into Palladio.
  4. Zoom in/out feature for the graph created in Palladio is a neat way to visualize the data without making it too overwhelming for the user to view the graph network.

Challenges:

  1. Data formatting errors and spreadsheet order copy/paste. Catching formatting errors on Palladio is difficult, especially when there is a lot of data present. Specifically, when inputting the travels, locations, and people spreadsheet, pasting the spreadsheets in the wrong order (ie. people first instead of travels) caused problems.
  2. Missing Header Issue. Our group also initially made the error of leaving out the column name labels, which resulted in the missing header error.
  3. Special Characters Error. Our group had data with “-” instead of “_” in the data which also resulted in the special characters issue. The benefit in Palladio however was that it clearly showed the line number, indicated by the red icon, where the error had occurred.
  4. Debugging the data error in Palladio. While Palladio indicates the line number that a data row has an error, it often does not indicate what the error is that is causing the problem, and it is up to the user to figure this out. This can often be frustrating and time consuming.
  5. Overwhelming Functionality. While Palladio has a lot of useful functionality for creating the nodes, adding specified labels on the nodes, and allowing the user to specify how links between the nodes are created and visualized, the Palladio user interface presents too much information all at once for the user. These numerous options made it very confusing for me to use Palladio as I was not even aware of which option performed what functionality.
  6. Problem with too much data. Palladio does not work that well when there is too much of data present, that too in multiple spreadsheets. This is because the user then has to copy/paste the data into Palladio, and checking for errors in the data then becomes very difficult.

Improvements for Palladio:

  1. Have a tutorial/video summarizing all of the options Palladio offers for data visualization, with concrete data embedding examples.
  2. Specifically describe what error is causing the issue with the data in Palladio as opposed to simply indicating that there is an error.
  3. Give the option for preview immediately after each option to create the graph network is indicated by the user.
  4. Having the ability for the user to adjust the size of the nodes based on the population size the node(s) represent.
  5. It would be helpful to enable the user to re-orient the position of the graph to visualize it in different angles.
  6. Once the user creates the graph, having the ability for the user to run various functions to analyze the graph, similar to how Gephi provides various graph analysis functions would be useful. More specifically, having a partition tool that can be used to colorize and thereby rank the nodes by a specified color scheme according node values is a good feature to have.
  7. It would also be helpful to give the user the option to calculate average node degree, assign weights on edges based on a user specified column metric, compute the shortest path, number of edges, number of nodes, and clustering coefficient in the graph.

Next Steps: Building Upon Palladio: Once users have the ability to visualize the data in spreadsheets, it would be interesting to provide the ability to combine different graph models into one large graph. Furthermore, having the ability to compare and contrast between multiple graphs (such as node degree, node size, edge size etc.) would also be helpful to compare and contrast data. Another long-term goal is in finding common similarities or sub-graphs within the graphs. For example, if there are shared nodes between the graphs, having the ability to identify the common nodes within the graphs, or the max common sub-graph would also be helpful in understanding similarities.