Of Hypergraphs, Programming Pairs, and Personal Learning Networks, Oh My! (Part 1 of 2)
Exploring the Research and Data Visualization Designs of the GitHub Copilot for Disabled Developers Program
In the initial posts to this publication, I used a 2x2 matrix model to identify the composition of the Programming Pairs to be recruited to participate in the GitHub Copilot for Disabled Developers research project and support program. The axes anchoring this model are Physical and Coding Abilities. The Physical Ability dimension consists of Disabled and Able-bodied participants. The Coding Ability dimension consists of persons who are either Expert or Novice developers.
Given these two axes, the four quadrants of this 2x2 matrix design breaks into ten Programming Pair compositions. Four pairs are homogeneous or matched pairs, and six being variously mixed pair combinations as shown in the panels of Figure 1.
The Homogeneous/Matched Mentor Pairs, as shown in the top-left panel of Figure 1, are as follows:
- Disabled Experts
- Disabled Novices
- Able-bodied Experts
- Able-bodied Novices
The six Mixed Mentor Pairs, as shown in the other three panels of Figure 1, are the following:
- Disabled Expert and Able-bodied Novice
- Disabled Expert and Able-bodied Expert
- Disabled Novice and Able-bodied Expert
- Disabled Novice and Able-bodied Novice
- Disabled Expert and Novice
- Able-bodied Expert and Novice
Each of these Programming Pairs will commit to making software development contributions to their choice of a participating Digital Humanities, Citizen Science, or Open Source Software project. Collectively, these disabled and able-bodied developers — together with their Programming Pair partners — will participate in a research and support program that will be managed and studied as a formal Personal Learning Network or PLN. This learning network — implemented as a weekly MS-Team online virtual meeting — will extend and deepen the individual and pair-wise learning experience of all participants in the program.
If you have read the previous posts in this publication, these summary introductory comments are sufficient to provide context for the following exploration of the research and data visualization designs for the proposed GitHub Copilot for Disabled Developers project. If you have dropped into this article without reading my publication’s prior posts, you may want to take a few minutes to read them. This quick detour before going further will help you to understand the life-changing potential for the use of GitHub’s Copilot technology to enable software development by disabled developers.
Initial Thoughts on Data Design
The 2x2 design describing the composition of the Programming Pairs — as described above and in the initial posts of this publication — is useful for defining the combinations of physical and coding abilities of participants in the proposed research and support group project. To explore the range and diversity of potential impacts of GitHub’s Copilot programming assistance technology, we need to think about the kinds of quantitative and qualitative data that we might capture over the lifespan of the proposed research study.
A non-exhaustive list of such impacts to be measured periodically over the course of the study include such things as:
- the Degree of Satisfaction each member of a Programming Pair feels with regard to participation in the project
- the Pair member’s perceived Level of Value that GitHub Copilot contributes to the Pair’s productivity for a given period
- each Pair’s Delta and Mean between members’ individual Degrees of Satisfaction with participation in the project
- the Pair’s number of lines of code (or some other quantitive measure) of contribution to their chosen Digital Humanities, Citizen Science, or Open Source Software project
- the number and duration of Pair Programming sessions a Pair has during a reporting period of participation in the project, etc.
Given this initial list of the range and type of data to be periodically captured during this proposed project, it quickly becomes clear that the 2x2 matrix model is not useful for designing the research or, once data is captured, visualizing that time-series data. This realization set me on the path of thinking about the study’s data model presented in this post.
Dusting Off My Memories of Hypergraphs
One of the first challenges I thought about upon exploring the design for this proposed research project was that the data model would have to be capable of conveniently capturing, analyzing, and visualizing/presenting data that could reflect the dual nature of the study’s research subjects. On the one hand, there will be impacts that we want to explore that measure the perception and contribution of the individual Persons participating in the project. At a minimum, this would mean that at least 20 people — two people for each of the ten Programming Pair combinations — or multiples of 20 would be needed for this study. On the other hand, our research subjects are Person-dyads that, together, combine to form one of the ten different compositions of the Copilot-assisted Programming Pairs. To use a loose allusion, our study consists of Person atoms and Programming Pair molecules.
As I thought about this important characteristic of our proposed research, I remembered a few of my experiences with my wife and research partner, Timlynn Babitsky, when we were (non-graduating) students in the doctoral program in Mathematical Social Science at UC Irvine in the early 1980s. In addition to our studies that involved learning about graph theory, we moonlighted as technical writers of documentation for software products in the emerging Object-Oriented Programming industry. We consulted on two early Apple Macintosh products that seem relevant to my current design; one a Petri Net decision support system and the other an OOP-based visual programming language.
Both these systems had part-subpart features based on subgraph compositions that were examples of multidimensional hypergraphs… at least that was the case in my dusty recollections. So I turned to Wikipedia to refresh my understanding of the definition and nature of hypergraphs. My first reading of the Wikipedia reference page was a Wonderland tumble down the Rabbit-hole back to my Irvine student days, or rather, daze as I struggled a bit to reorient myself to the symbolic expressiveness of graph theory and mathematical notation. Fortunately the opening sentence succinctly defined a hypergraph:
To understand the implications of this summary statement, the visual preference of my cognitive functions was drawn to the insightful example diagrams to the right-side margin of this page captured in the screenshot of Figure 2:
What really caught my attention on the Hypergraph page of Wikipedia were the two images that showed two ways to visually represent a simple example hypergraph. Most of us are familiar with the classic Venn diagram notation, and it is easy to see how it can be used to represent the multi-node relationships among vertices in a hypergraph. Where a basic graph is represented by a line-edge between two nodes, the colored inter-related areas of the Venn diagram represent the more complex composition of a hypergraph’s multi-node edges.
Discovering PAOH (Parallel Aggregated Ordered Hypergraph) Diagrams
While the Venn diagram is intuitively understandable for relatively small example hypergraphs, real-world hypergraph models can quickly become colorful swirly messes of overlapping edge-representative areas that challenge our ability to draw and interpret the model of the hypergraph. Nevertheless, I felt that the Venn diagram approach would be my next step to explore and visualize the data model of the GitHub Copilot for Disabled Developers research project.
I quickly determined that there were non-trivial graphical challenges as I tried to draw the Programming Pair combinations using my relatively robust Photoshop drawing skills. These Venn diagram drawing frustrations led me to take a more careful look at the second example diagram on the Wikipedia Hypergraph page, highlighted here in the screenshot montage of Figure 3:
The PAOH diagram on the Wikipedia Hypergraph page is helpfully drawn to be isomorphic to the Venn diagram example above it. This relationship of the example diagrams produced an immediate Eureka Moment putting me on a path to learn more about the PAOH notation. What caught my attention was how the PAOH visualization suggested the grid-like structure of the ubiquitous spreadsheet, like Excel. I knew I might use Excel to prototype data analysis of the proposed GitHub Copilot for Disabled Developers research project. I was anticipating using Excel to explore analyses that I would then implement in Python using NumPy and Pandas’ DataFrames, etc. I also knew that modern spreadsheets had gone far beyond their roots in VisiCalc when computation ruled while providing only meager graphic visualization features.
To investigate PAOH notation further, I followed Wikipedia’s link to the PDF of the paper cited in the first footnote of the its Hypergraph page. The article is “Analyzing Dynamic Hypergraphs with Parallel Aggregated Ordered Hypergraph Visualization” (PDF, HAL-02264960) by Paola Valdivia, Paolo Buono, Catherine Plaisant, Nicole Dufournaud, and Jean-Daniel Fekete. These researchers are part of Aviz, the Visual Analytics Project, that is a multidisciplinary team affiliated with INRIA and the Université Paris-Saclay. The extensive 12-page PDF is an example-rich exploration and comprehensive explanation of the PAOH notation.
As the notation’s acronym name suggests, the emphasis of the PAOH design is centered on Parallel, Aggregated, Ordered visualization of Hypergraphs. As such, this notation is particularly well-suited to the exploration and visualization of research data that is both described by a hypergraph and aggregated for pattern-discovery within time-series data. Examples of this kind of data in the social network domain include business transactions among a multi-channeled supply chain network, the table partners sitting together over daily lunchtime in a school cafeteria, or article citations among a community of published researchers. Figure 4 includes two highlighted regions of the many examples from the Aviz researchers’ 12-page paper:
The reader is encouraged to access the full PAOH PDF paper to explore the many examples and insightful explanations of this powerful visualization format. In Figure 4, I have highlighted the essential features that show how effective this notation is for presenting time-series based hypergraph-modeled data. Note in particular how the vertical “drip lines” provide node/edge composition of subgraphs within the model’s hyperedges defined by the horizontal and vertical rows and columns of the hypergraph model.
Buoyed by the clarity and spreadsheet grid-like nature of the PAOH notation, I was reinvigorated to again try to visualize the design for the Programming Pairs of my proposed GitHub Copilot for Disabled Developers research project. This time I would turn to my Excel spreadsheet rather than to Photoshop for my data model prototyping exploration.
A PAOH View of the Programming Pairs Composition of the GitHub Copilot for Disabled Developers Research Project
With the PAOH PDF handy for reference in a browser view on my computer, I launched Excel with a renewed determination to diagram the Programming Pairs of my proposed research. In one frenzied morning exploratory prototyping session, I came up with the PAOH diagram shown in Figure 5.
My initial model for the ten Programming Pairs of the proposed research study turned out to be an Order 20, Size 16 undirected hypergraph. This model is Order 20 as this is the minimum number of Person subjects need to populate the Programming Pairs composition. It is Size 16 as that is the sum of the number of hyperedges needed to fully cover the four Physical and Coding Abilities hyperedges, the two Homogeneous/Matched and Mixed Mentor edges, and the final ten discrete Programming Pairs combinations defined by the six edges just mentioned.
With my Excel spreadsheet saved and a screenshot taken of what would become Figure 5 above, I ended my exploratory experience using insights inspired by the Aviz researchers’ PAOH hypergraph notation.
So Far So Good, and My First Gotcha
As I contemplated my next data design prototyping session, I kept thinking about how I might take the next step to factor into my visualization model a dataset of qualitative or quantitative measures that were not specific to the hyperedges needed to describe the Programming Pair compositions of my proposed research. How, for example, might I visualize adding a weekly measure of a Programming Pair’s individual teammate’s Degree of Satisfaction with regard to participating in the research project?
What became quickly clear was that my research design would not conveniently accommodate the “parallel, aggregated, ordered” nature of the PAOH notation semantics. There would be too many clusters of breadth-wise columns needed to add in additional data to provide the kind of longitudinal presentation of program impact measures using the PAOH method. At this point in my research design prototyping, the PAOH method gave me a good model for understanding and visualizing the Programming Pair composition of the proposed GitHub Copilot for Disabled Developers research project. But I needed a way to grow my understanding of this diagramming semantic to handle the analysis and visualization of the many ways I could imagine wanting to explore and document the impact of GitHub’s Copilot as a potentially life-changing productivity technology for disabled developers.
So my next data design prototyping session was focused on how I might use what I had so far to anchor the visualization of a simple participant’s Degree of Satisfaction measure during the proposed project lifecycle.
My First Faceted (3-D) PAOH Diagram
As I sat down to embark on my second data design prototyping session, I saw that the “playing field” of a PAOH diagram consisted of three distinct sections as shown in Figure 6.
These regions (1 and 2 in Figure 6) include two Subject-Hyperedge grid-cell blocks on the left and top edges of the PAOH diagram which frame the matrix of cells that form, in my case, the Programming Pair compositions represented by the “drip-line” dyads of the third highlighted region of the PAOH diagram. As I contemplated this structure, I saw that the ten Pair Naming columns of the top edge of my PAOH design could be equally represented by labels on the drip-line dyads. Furthermore, the matched/mixed nature of the Pair could be expressed as a second drip-line label or dyad-edge coloration. In effect these alternative semantics for the vertical drip-line columns could free up the grid-cell matrix of my data design to reflect data measures that were not exclusive to the definition of the Programming Pair compositions.
For example, using synthetic data Figure 7 shows how we can visualize Pair member’s weekly measures of Degree of Satisfaction regarding participation in the GitHub Copilot for Disabled Developers program.
A sharp-eyed reader will note that Figure 7 does not use dyad edge labels or coloration to express the Programming Pair composition expressed in the top (region 2) of Figure 6. Figure 7 is, in effect, a semantically incomplete diagram that violates the notation standards — most notably the vertical column drip-lines — described by the author/creators of the PAOH notation. This labelling/coloration omission is done for two reasons:
- Labelling and coloring the dyad edges would be difficult to achieve in my prototyping Excel tool, and even if doable, would create a visually jumbled figure that would be difficult to interpret (a weak reason).
- This omission will emphasize that my research data design has evolved into a Faceted PAOH notation (a stronger reason) that I will describe in the following paragraphs and images that complete Part 1 of this data design post about the research method for the proposed GitHub Copilot for Disabled Developers program.
With the introduction of Figure 7 we have clearly moved from the classic definition of a PAOH diagram to a multi-dimensional faceted extension of the notation.
As the Figure 8 screenshot of a Bing search for the definition of facet shows, a facet presents one side or dimension to a many-sided or multi-dimensional object or subject of interest. With that definition in mind, we can then see the “broken” PAOH diagram of Figure 7 — showing Pair member Degree of Satisfaction data — as an additional aspect or feature of the proposed GitHub Copilot for Disabled Developers research design. What is different about this qualitative facet is that it is not part of the set of PAOH model hyperedges needed to define the composition of the Programming Pairs of this proposed research.
If we take the Programming Pair composition visualized in Figure 5 as the X-Y plane of a three-dimensional space, we can then project the Degree of Satisfaction model as the orthogonal Z plane of that space and visualize our data model design as shown in Figure 9.
Looked at from the face on the left of Figure 9 — being the X-Y plane of this faceted 3-D space — we see the PAOH model that prescribes the Programming Pair compositions of the proposed research. Projecting on to this space as the Z plane of this 3-D space, we see the visualization of synthetic data representative of one weekly measure of the qualitative data reflecting the Pair member’s Degree of Satisfaction regarding participation in the proposed research and support program.
From 3-D to 4-D Faceted PAOH Hypergraph Modeling
To measure the Degree of Satisfaction for Pair member participation in the proposed research and support program, we will want to gather this data weekly or biweekly. We can then visualize this time-series data as measurement events along the X axis of this 3-D space as shown in Figure 10 adding a fourth dimension into our research data model.
To further refine our understanding of this research design data model, we can describe this multi-dimensional space in terms of two sets of hyperedges:
- One set is a subset of hyperedges that describe the invariant and bounded nature of the Programming Pair compositions — what we previously described as the Person-Atoms and Pair-Molecules of our research — and
- A multiplicity of variable and unbounded subsets of hyperedges that prescribe the qualitative and quantitative measures — to be gathered over time — that capture the behavior and state of the atoms and molecules of our study’s Programming Pairs.
That is — referencing the left-side facet of Figure 10’s diagram — we can understand the PAOH model of the Programming Pair compositions as an applicable orthogonal facet in relationship to each of the time-slice “pages” of the behavioral/state datasets as represented by the Degree of Satisfaction dataset as shown on the right-side of the diagram of Figure 10. By extension, we should understand that the subset of hyperedges of the behavior/state dataset portion of this model can consist of as many facets to the Pair-composition facet as are of interest to the overall design of the research study. Each of the behavior/state measures can be similarly projected onto our Pair-composition facet.
In Conclusion for Part 1 and Preview of the Focus of Part 2
I believe that this first part of my two-part article exploring the data model for the GitHub Copilot for Disabled Developers program presents a sound and compelling case supporting the research design and impact analysis of this proposed research study and support program. This study’s first-order focus will investigate the potential impact of, and develop best practices for, the use of GitHub Copilot to provide a life-changing programming assistive technology to disabled developers. As a second-order contribution this research will explore and document the implementation of the Faceted PAOH research design and associated data analysis and visualization notation.
In the second part of this article I will present an additional subset aggregation design for the handling of Faceted PAOH hyperedges needed to explore and visualize the molecular composition and behavior/state analysis of the Programming Pair-level subjects of the GitHub Copilot for Disabled Developers program. I will also briefly show how this data design can be used for interactive Personal Learning Network mentoring and program improvement.
Jim Salmons is a seventy-one year old post-cancer Digital Humanities Citizen Scientist. His primary research is focused on the development of a Ground Truth Storage format providing an integrated complex document structure and content depiction model for the study of digitized collections of print era magazines and newspapers. A July 2020 fall at home resulted in a severe spinal cord injury that has dramatically compromised his manual dexterity and mobility.
Jim was fortunate to be provided access to the GitHub Copilot Technology Early Access Community during his initial efforts to get back to work on the Python-based tool development activities of his primary research interest. Upon experiencing the dramatic positive impact of GitHub Copilot on his own development productivity, he became passionately interested in designing a research and support program to investigate and document the use of this innovative programming assistive technology for use by disabled developers.