Visualizing Data

Yi
15 min readNov 2, 2023

--

Summary

This is the process post documenting my thinking and process for the Communication Design Studio project at Carnegie Mellon University, taught by Stacie and Vikki. I’m tasked to visualize data about Artificial Intelligence. In this post, I wrote about how to extract information from data and how to convey ideas behind the data through physical and digital methods.

Introduction to Data Visualization

Charles Joseph Minard was a French civil engineer recognized for his significant contribution to the field of information graphics in civil engineering and statistics, especially flow diagrams. One of his most famous works is Napoleon’s disastrous Russian campaign of 1812.

Napoleon’s disastrous Russian campaign of 1812

When we dive deep into this map, we can see examples and techniques for successful data visualization. Edward Tufte summarized the techniques into the following visualization principles:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented
  2. Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graph itself. Label important events in the data.
  3. Show data variation, not design variation.
  4. In time series, displays of money deflated and standardized units of monetary measurement are nearly always better than nominal units.
  5. The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.
  6. Documentations
Joseph’s documentation for the making of Napoleon’s Russian campaign

When I first looked at the Napoleon map, the legend was in French. I had no idea what the graphics meant. It’s very information-dense and hard to grasp for first-time readers without proper legends. This made me start to think about how the reader consumes information in the graphics.

During Charles's time, he hand-drew on paper. It’s a static graph for multiple-time, detailed examination.

With today’s digital technology, I can choreograph the sequence of people’s consumption/exposure to different information. For the printed piece, I can think about how do people interact with it. For example: the scale of the paper the viewing distance’s impact on the viewer’s experience, and information provided at different stopping points.

Questions for this phase

  • when deciding on the visualization focus, designers are making a conscious decision on what to include and what to highlight. Can one discover new things during the data analysis and representation process?
  • Absolute vs. relative information, which one to choose?

Cognitively connected the content; Once you know the legend logic, you don’t need to reference it again.

transitions & storytelling of interactive data visualization

Identify Topic & Collect Data

Updated Nov 6. 2023

I’m tasked to investigate the data for AI.

I believe the development of AI, especially the difference of it across countries, would impact how the near future geopolitical power dynamic. Therefore, I looked at data on investment in AI in academics and business, AI-related laws, and employment. I downloaded data from the OurWorldInData website. Leveraging the map view to visualize the comparison across countries for each data set, I noticed there are patterns for which countries have the most amount investment, academic publication, patent registrations, or bills. Namely, the United States, the European Union, and China.

Map view for data in scholarly publication (1st column), patent&private investment(2nd column), employment(3rd column), and bills & strategies (4th column). Visualization from OurWorldInData website.

Correlation doesn’t mean causation.

Although tempting, I must be very cautious about the proposition or perspective my data visualization is telling. There’s a clear correlation between academic, business, and law progress in AI across countries, but the causality is not clear yet.

Updated Nov 9. 2023

I started to think about what I wanted to do with the data sets. To help me clear my mind, I listed the following three points:

  1. I am interested in the reason behind and the implications of different development phases in AI globally.
  2. I currently have data to indicate the different phases of AI development, mainly through academic publications, business investment, patent registration, bills passed, and political strategy.
  3. What I am missing now is to make the connection between these data and their implication.

To organize and anchor the data at hand, I decided to draw inspiration from Wurman’s reading Information Anxiety, I learned about 5 ways of organizing information: Location, Alphabetical, Time, Categorical, and Hierarchical(LATCH). Location can be proximity. Time can be timelines, seasons, durations, months/years/hours, etc. Categorical means information is sorted by commonality. Hierarchical can mean sequence, level of importance, nesting/subset, parts to whole, or commonality as an anchor.

To find my anchoring system for data, I would need to decide on a coordination system. The following are the categories for the Data coordination system:

  • Cartesian (x/y/x)
  • Polar (amount, time>circular, proximity)
  • Geographical (absolute/relative)

When I reflect on my interests, I decided to start with China vs. the US vs. vs. the world. Therefore I selected the data for China and the US only from myworlddata.com and compiled them into a single Excel sheet as below.

These data are organized following the Latch structure, mainly by category. I looked at data in the following categories:

  1. Private investment in AI
  2. Scholarly publications on AI
  3. Patent application and granted related to AI
  4. Newly founded company in AI

During the collection process, I found a lot of data for China- and US-specific, but not so much for the world level. I wonder for world data, what’s the best way to collect data for each category? Since the countries are not exhausted summing all the countries’ data in those databases doesn’t necessarily equal to the world statistics.

Identifying the Goal

To better extract insights, I need to reflect on the following questions:

  1. What is my orienting question?

Compare the investment in AI between USA and China.

2. Organization Structure (LATCH)?

3. What’s the Scale/ Range /Buckets(ex. even intervals for numeric data)?

Data Pattern Detection

With a better understanding of my goal, I coded each data set accordingly to visualize and detect patterns. Below are the pattern detection strategies I used for this process.

Types: Visual, Temporal, Aural, Tactile

Numer + Hierarchy: see part of the whole relationship, see differentiation and similarities.

Temporal Building: Some pattern is more easily perceived than other, such as sound/motion/vibrations, etc.

I color-coded the data based on 4 categories: private investment, scholarly publication, patent application, and number of AI companies founded. I highlighted the highest number in the past 10 years. Interested in discovering trends in the past decade, I calculated the change rate(grey column) and highlighted the largest change rate and the negative change rate.

Color coding the data

I then started to think about

I think about the number of buckets for the data that would impact the final pattern. Usually, 7 buckets is a sweet spot to balance the detail vs. high-level concept.

Representing Data

Below are the tools I used to represent data.

Representation Strategies

  • Categorization + Appropriateness
  • Pacing + Simultaneity
  • Narrative + Indexical Structure
  • Expectations + Perceptions

Visual Representation Strategies

  • Position (location, correlation, proximity)
  • Length (distance, amounts, time)
  • Angle (proportion, parts to whole)
  • Direction (must have two points)
  • Shape (category)
  • Area/Size (quantity)
  • Volume
  • Color Saturation/Value (hierarchy)
  • Color Hue(category)

Temporal Representation Strategies

  • Duration (time)
  • Pacing/pauses/Intervals

Aural Representation Strategies

  • Volume (Hierarchy)
  • Pitch (Hierarchy)
  • Duration (Time)
  • Rhythm (Category)
  • Tone (Hierarchy)
  • Channels (Location)

Tactile Representation Strategies

  • Fiber Density
  • Texture
  • Volume
  • Vibration

Data Narrative

Narrative Structure: Journey, Story, path, arc, support, framework, mental model, linear

Indexical Structure: Categories, layers of narratives, list of content, library, legend, overview, search & find, hierarchy, determine relevance, filter

Physical Data Visualization Practice

Often, when I think about data visualization, digital graphs come to my mind. However, I learned that physical data visualizations have unique advantages. For example, physical objects introduce new possibilities and limitations due to their physicality, such as weight, texture, hardness, etc. Physical data visualization also offers stronger accessibility, regarding unequal computer ownership, eyesight, digital literacy, age, etc.

During class, we made a physical data visualization for the US’s tomato imports from nearby countries.

Representing my data

The goal for my data visualization.

  1. Comparing one country’s data within a category across 10 years
  2. compare the data of a category across 2 countries
  3. Compare data of one country’s across categories

Design Iterations

Iteration 1.0–1.1 | Digital Visualization

Since the overarching theme of my data visualization is to help users compare, I used the size of the color fill to represent the amount, and the color to represent countries. By overlapping the data from the two countries, users shall be able to compare the absolute amount for each year/category’s data.

Iteration 1.0: I used circles of varied sizes to represent data amount, color for the country: red for China, and blue for the USA. The circles are overlapped to show a comparison of absolute numbers. Each circle pair represents a category’s data in a year. For example, the number of private investments in AI in 2009.

Reflection: I tested with 5 people, and some reported they failed to see the overlapping concept and thought it was the area size of the central circle vs. the area size of the ring. Moreover, it’s hard to compare across categories and across the years to see the data trend.

Iteration 1.0 (left) and 1.1 (right)

Iteration 1.1: To address the data trend comparison, I decide to merged the circles into a river stream diagram (image above).

Reflection: The comparison across categories is weak if just side by side laying the river stream diagram next to each other. This strategy also has weaker engagement with users and a loose semantic connection with the categories. However, the size comparison for the amount is effective.

Iteration 2.0 | Physical Visualization

I summarized the different types of comparisons my project aims to achieve to help me clear the strategy. For example, users should be able to:

  1. compare countries’ absolute number in a specific year from one category (ex. amount of private investment in AI in 2009)
  2. compare the data of one category and one country across 10 years
  3. compare data trends across categories of a country
  4. compare 2&3 from two countries.

In summary, the comparison exists across countries, amounts, years, and categories. I hope the comparison across these 4 types of information can be perceived at the same time, I need to leverage the 3rd dimension and textures to overlay rich information.

So I conducted a series of experiments to test physical representations and material abilities.

Iteration 2.0

Iteration 2.0: Started by trying to build a vertical panel that holds small circles. Tried a long wide panel(left image) and 2 narrow panel(right image).

Reflection: Circles are hard to fully overlap due to the structures

Iteration 2.1

Iteration 2.1: Use inserting and empty spaces to compare the circle.

Reflection: The result looks promising, but the supporting columns on the two ends are distracting and do not add to the data representations.

Iteration 2.2

Iteration 2.2: Tried to change the structure, and extend the circles to compare

Reflection: good for comparison overall

Iteration 2.3

Iteration 2.3: overlaying both countries to compare at the same time and differently

Reflection: hard to separate and align

Iteration 2.4

Iteration 2.4: Carving the center and coloring. Side views to show the dimension change.

Reflection: Took advantage of the third dimension, but front and side views convey the same information.

Iteration 2.5

Iteration 2.5: Etching the acrylics. Same-sized acrylic sheets with varied shapes or area sizes for the shapes in the middle. I tested full etching and only the outline. I’ve also tested adding colors.

Reflection: Very beautiful visual styles. However, the shape comparison is not very strong.

Iteration 3.0 | Working with real data

After understanding the material capabilities and visualization styles, I went back to my narrative and data. I wanted to use 2.2 to build a matrix board that compares the category data between USA and CN, as follows. I hope the user can play around with each pole and compare the data across categories and years.

Assemble Idea

After talking with Stacie, I realized my narrative is too simple. There are opportunities for comparing across categories within a country as well. However, the data unit for each category is different. How do I compare them?

I decided to calculate the change rate from the previous year for each category, then the change rate could be compared across categories.

Discussion with Stacie

I decided to combine prototype 2.2 with 2.5 for this new change. I will keep using the acrylic size to represent the amount of data for the category, and then the etching for the change rate.

Material Diagram for the new design

To find out how big the circles will be, I used Grasshopper to visualize the change rate. This made me realize that the original strategy has some faults: the data sets are across a wide range with strong outliers. The change rate circle(etch) sometimes lies within the amount circle(cut), sometimes outside. This cannot work with my proposed method.

Grasshopper Visualization
The change rate(green) and the actual amount of data(red)

To solve this problem, I reexamined the data and realized some categories were missing some years. Therefore, I pruned the data from those years for better consistency across categories. Luckily many of those outliers were gone after pruning. Then, I scaled the amount of data by putting them into 8 buckets to help with better sizing for the cuts.

Data Edits

After some user testing and discussion with Vikki, I realized that etching circles on circles may be confusing because the size represents the exact amount and the change rate at the same time. So I changed the representation of the change rate to shape. I experimented with using flowers and number of the petals to represent the change rates(by bucket), and the filled vs. outline to represent the positive and negative percentages.

Change Rate Representation

Final Design

With the above explorations and iterations, here’s the video for the final design prototype.

In the visualization, I used the Cartesian system and color, shape, size, material texture, and value to represent the data. Specifically, I represent the data as follows:

  1. Country: the color of acrylics
  2. Category: shape of the acrylic disk
  3. Year: the order in which the disks are stacked + etching
  4. Amount of the category data for each year: the size of the disk
  5. Amount of change rate from the previous year: number of petals in the etch
  6. The positive/negative of change rate: filled/outline

To enhance clarity and readability in the physical prototype, I am employing a bucketing strategy for categorizing data, specifically focusing on the rate of change. By adopting a 20% change rate as the standard for each bucket, this approach simplifies the design by reducing the number of petals in the representation. This project prioritizes comparative sizes over precise measurements, making the bucketing strategy an effective tool for visually distinguishing different categories.

Assembled Prototype

Experience Path

In designing the physical prototype, I’ve structured it to guide the audience through a layered exploration of data. The journey begins with the smallest unit, a disk, which represents the change rate and amount for a specific data point. This focused view allows the audience to initially grasp the intricacies of individual data sets. Next, the design enables a broader comparison, where viewers can examine and contrast data from one country within a single category across multiple years. This progressive expansion of scope culminates in the final stage, where the audience is encouraged to freely compare data across different categories and countries, facilitating a comprehensive and comparative understanding of the data landscape.

Consideration of Change Rate Edge Case

Because the change rate is calculated based on data from the previous year, it’s etched on the previous year’s disk. Therefore, the disk from the newest year will not have any petals. To avoid confusing people with zero change rate, I added a circle pistil to signify the change rate data. The last year will not have any pistil to signify it doesn’t include the change rate etching at all.

Reflection

I resolved to use transparent and brow acrylic due to the limited acrylic supply nearby. Moreover, I found the transparent acrylic profile views read less clearly than the brown ones. I assume the darker the color, the more clear its side views. If I had more time, I would use red and blue acrylic to build a better semantic connection between colors and countries.

Side View Comparison

I’ve gained considerable insight into the limitations inherent in working with physical materials. There’s a point at which the human eye can’t perceive further differences, and manufacturing has its constraints. The size of the laser cutter bed, for example, limits how large we can make things. To ensure the prototype is interactable and user-friendly, it needs to be of a certain size. I chose to use data bucketing for a clearer representation, but this method has a drawback: when dealing with large outliers, it tends to make the rest of the data appear more uniform than it actually is.

Additionally, the etching turned out to be much smaller than anticipated, which could pose a challenge for users. It might require them to closely examine the etchings to read them effectively, possibly hindering ease of use.

Physical data visualization is challenging, yet, there’s the opportunity to utilize a broader range of dimensions for data representation. This includes aspects like texture, transparency, the z-axis, thickness, and other tangible qualities. These elements significantly enrich the data’s interpretation and user interaction, transforming physical data visualization from a mere tool for representation into an immersive experience that deepens the user’s connection and understanding of the data. The absence of a technological interface also makes them versatile in various settings where technology might be limited or impractical.

Citation

https://ourworldindata.org/grapher/private-investment-in-artificial-intelligence-cset?tab=table

https://ourworldindata.org/grapher/annual-scholarly-publications-on-artificial-intelligence

https://ourworldindata.org/grapher/scholarly-publications-on-artificial-intelligence-per-million-people?tab=table

https://ourworldindata.org/grapher/national-strategies-on-artificial-intelligence?tab=table

--

--

Yi

Product Designer | MDes 25' @Carnegie Mellon University