Re-emerging History (I)

Part I: A Data Treasure Hunt


/root

Gentoo Linux was love at first command prompt. Within three weeks of installing Gentoo, I was so enamoured with the experience that I became a contributor to the community.

At Gentoo, I’d learned to be a better programmer. I’d also learned about developer relations and community development (and customer support and user experience and product management). I learned about growing and leading a talented team to produce a high quality and high profile Linux distribution. And after 5 years with the project, it was time to step away.

Seven years later—or rather, 6 months ago—I left a job and needed my own computer again. Intuitively, I reached for Linux. Instinctively, I reached for the Gentoo community. In 2002, I was the 17th person to join Gentoo. By 2013, more than 800 people had gone through the developer community.

I had a lot of history to catch up on, so I set out to draw a picture of 800 people’s contributions over 14 years. First, then, I needed to find out who everyone was and when they were part of the project.

ls /devs

Gentoo’s Infrastructure team maintains their developer directory in an LDAP database. It serves as a means to authorize access to various services (code repositories, the bug tracker, forums, email, shell accounts, etc.), and provides a historical registry of everyone who has ever been part of the Gentoo project.

Robin Johnson, a long time leader of the developer community, strives to ensure that the directory is comprehensive and accurate. He was certain that there were early developers who were missing from the registry (spoiler alert: he was right). Additionally, the records were unclear on when twenty or so developers had actually been part of the project. In return for the data, I agreed to help fill in some of those information holes.

Armed with the LDIF file that Robin provided, I went to work.


/src/code/repos

The most obvious way for a developer to contribute is via the code repositories. Developers can contribute code, documentation, web content, and website design. Gentoo has historically used CVS to manage four main code repositories:

  • gentoo-x86 holds the portage tree (ebuilds);
  • gentoo-src is a legacy repository for tooling and experimental source code;
  • gentoo-projects is legacy source code repository that replaced gentoo-src;
  • gentoo acts as a content management system for Gentoo’s website.

A repository’s log contains the entire history of actions that developers have performed: adding, changing and deleting files and directories. To make effective use of those logs they needed to be converted to a structured data format. Patrick Lauer, who has done some beautiful visualizations for Gentoo, pointed me to a python script from the CodeSwarm project that does the job beautifully.

However, it crashed for me because of my timezone. Lines 177-178 show that it doesn’t recognize timezones behind UTC:

date_without_plus = date_parts[1].split("+");
date = time.strptime(date_without_plus[0].strip(), '%Y-%m-%d %H:%M:%S')

The following changes enable it to handle all timezone offsets as well as two different date formats:

date_no_tz = ' '.join(date_parts[1].split()[0:2]).strip()
for fmts in ['%Y-%m-%d %H:%M:%S', '%Y/%m/%d %H:%M:%S']:
try:
date = time.strptime(date_no_tz, fmts)
break
except:
pass

find ‘repo.logs’ | xargs grep ${missing_developers}

I ran the script on each repository, as follows:

for repo in `ls -d gentoo*/`; do
# strip the trailing slash
repo=${repo%%/}
cd ${repo}
echo "Updating ${repo}..."
cvs update
echo "Generating log..."
cvs log > ../${repo}.cvs.log
python convert_logs.py -c ${l}
# Go back to the starting directory
cd -
done

Out of the four XML logs emerged a manifest of committers:

grep 'author=' gentoo*.xml | \
awk -F' ' '{print $4}' | \
awk -F '"' '{print $2}' | \
sort -n | uniq \
> committers.txt

We discovered developers in the manifest who did not have a record in the the developer directory.

/src/code/activity -> /dev/tenure

Following Patrick’s lead, I then further distilled the logs into monthly commit summaries for each developer (and simplified the structure from XML to JSON):

# Create the 'devs' subdir if it doesn't exist
[[ -d devs ]] || mkdir devs
# Generate a consolidated JSON log for each developer
# showing all their commits in all the repositories.
for i in `cat committers.txt`; do
for j in `ls gentoo*.cvs.xml`; do
# Delete the file if it exists. We'll create it anew.
[[ -f ${i}.json ]] && rm ${i}.json
grep "author=\"${i}\"" ${j} | cut -d' ' -f2-4 | \
sed \
-e 's:/var/cvsroot/::' \
-e 's: :,:g' \
-e 's:^:,:' \
-e 's:$:},:' \
-e 's|=|:|g' \
-e 's|,\([^:]*\):"|,"\1":"|g' \
-e 's:^,:{:' \
>> devs/${i}.json
done

# Each JSON file should be an array of objects
sed -i \
-e '1 s:^:[:' \
-e '$ s:,$:]:' \
devs/${i}.json
done

From their individual repository logs we could reasonably determine when they were actively part of the project.


/var/log/bugzilla

Gentoo’s issue tracking system is a valuable tool for workflow prioritization and end-user feedback. It enables users to file tickets in order to report issues, suggestions and contributions to the development team, and enables developers to prioritize their work by understanding the severity of a bug or the usefulness of a feature.

find /var/log/bugzilla | xargs grep ${missing_developers}

The Developer Relations team (DevRel) has been using tickets to track recruitment and retirement of developers since late 2002. Combing through their ticket list over the course of a week, I found a further set of people who were not in the LDAP database.

/dev/tenure -> /var/log/bugzilla/activity

Most DevRel tickets indicate recruitment and retirement dates. And for those that didn’t, we could guess based on the timestamps of the tickets themselves. Thus, we started to build a more complete picture of when people had been Gentoo developers.


/proc/dev/rel

DevRel announces most new additions to Gentoo’s developer community. One announcement goes out to the expanded development community via the gentoo-dev mailing list. A second announcement goes out to the user community via the monthly newsletter.

sudo -u dberkholz cat /var/mail/archives

Donnie Berkholz, also a long time Gentoo leader, had a look at the results of my research and graciously offered to help complete the directory using his unmatched research skills and tenacity. He searched through the mailing list archives to find when developers had been announced.

sudo -u dberkholz cat /var/news/archives

The remaining bits of information were now esoteric. These were developers who’d had little interaction during their time, and so were hard to find. Donnie used his superpowers again to comb through the newsletter archives: and answered the last of the whens. At the end of the month long project, it was Donnie who brought it home.


/home

History provides us with a context of who has come before and gives us an opportunity to be grateful for their work. My years as a Gentoo developer have underscored my personal and professional growth. And when returned to Gentoo (as a user) I re-experienced the gravity toward the community. This project was fuelled by gratitude to the entire set of Gentoo developers.

In Part 2 we’ll explore drawing a picture from all this cleaned up data.


Note: The developer registry data and utilities mentioned in this article can be found in my github repository.

Note: The image is licensed under CC-by-SA-2.0 by Christopher Hsia.