An Introduction to Source Control

Bo Ngoh
Jumpstart Coding
Published in
15 min readAug 9, 2018

Introduction

Source Control is an integral function for anyone looking to seriously develop on programming. It is considered an early-to-intermediate-level skillset that can be applied to you no matter your

  • Programming language of choice,
  • Skill level, and
  • Size of project

This means that there are several degrees in which it can be used, from a single hobby coder running a free copy of Git, to large multi-office teams running it on Team Foundation Server as a backbone to their day-to-day running and deployments of Live systems. Thus the uses for — and expectations of — a knowledge of this function are high. Thus, it comes as quite a surprise that this isn’t a topic commonly taught in schools, and as a result it is often not well understood or harnessed to its full potential in many working environments.

Hopefully, through this high-level introduction, you can get a good understanding on taking up and mastering this important concept, and harnessing it to enhance your current and future coding endeavors.

Unlike other tutorials however, this introduction will try to avoid focusing on examples from any one specific Source Control tool example — those exist everywhere on Google, and often drill too deep into the details which can be too narrow a view of the whole concept.

Credit: xkcd

What is Source Control?

Not (just) Git

If you’re new and curious enough to be reading this, you’d probably have heard of ‘Git’, the current in-fashion software that’s almost always mentioned in the same breath as Source Control, but haven’t the foggiest on what either is.

While it is a popular choice, Git isn’t the only representation of what Source Control is. A proper understanding of the functions of and theories behind Source Control however, will inform your decision on whether to use Git or any of the other alternatives available. So what is Source Control?

Version Control

As Wikipedia puts it, Source Control is a Software (‘Source Code’) implementation of the wider functions of ‘Version Control’, which exists in many other forms outside of just Software Development. So what is version control itself? It’s the practice of maintaining copies of work at various stages of progress (complete or otherwise), for the future possibility of going back to refer to those earlier versions.

In other words, with regards to coding, it is the practice of preserving copies of code (& documentation, and configurations, and anything similar that you can think of) that you can revert/refer to any time after those copies were created, no matter what your current state of work has become.

That is, in a nutshell, what Source Control does. Why, however, do we need such a function?

Why use Source Control?

There are essentially 3 reasons, in varying degrees of complexity. The complexity of your own needs will determine to which extent you will need to master them, but eventually you should end up needing all three.

1. Backup of work

…Ever had one of those days?

This is the most basic use for Source Control: to ensure your hours of blood and sweat are preserved against the vagaries of bad luck. We all know that a trivial power outage will destroy whatever unsaved data you have on your computer’s memory which is perhaps a few minutes — maybe an hour — of work, but a hard disk failure (for example) will almost definitely ruin your entire progress since day one, which might easily stretch across months or even years. By backing up your work in regular, (hopefully) short intervals through a Source Control implementation, you mitigate the risk of such a disaster. As with popular action games and their ‘save checkpoint’ functions; just restore your last point, and you’re back in business.

Of course, this works best when Source Control saves your code to an offline location like a dedicated server, but it can also serve in a limited but similar fashion when your ‘repository’ (the location of your backups) is also in the same machine you code on. This is one of the strengths of Git (more on that later), and is often not immediately realized for its potential.

How so? Suppose you have a current version of your source code now that works perfectly, and you want to add on top of it. Suppose this new function that you want to add is risky: you might have to change or re-arrange a bulk of your existing code, plus change configurations to suit new, untested external libraries, and you have no idea if it’ll even work in the end.

Perhaps you’re also — as is often the case — trying out multiple alternatives to carry out the same function, and you want to be able to drop your current attempt if it proves to be fubar and then go back to a clean slate to try the other option(s) at any point. Oftentimes the code would have changed so much, over so long, that you would have no practical way to go back to where you started. Having a previous reliable ‘save point’ for your code will be invaluable asset for this purpose. You don’t need to go through all the complexity or trouble of setting up a networked server to do this.

If this sounds almost like a movie you’ve seen before, it is: Tom Cruise (just one example) used a single ‘save point’ to trial-and-error his way to beating an entire alien invasion in Edge of Tomorrow — how cool is that? Certainly not as much as the first time you manage to pull off a critical source code rescue, I’ll bet.

Edge of Tomorrow; or “Tom Cruise teaches us the virtues of a good restore point”

2. Version tracking

More code History

As you code longer and get to know your tools better, the next feature of Source Control becomes apparent: it doesn’t just save a version of your code that you need to refer to; it saves every version. It’s a veritable history of your work.

How is this useful? In the scope of a month-long project at school: not so much. In an actual company project that spans months over different editions of releases: quite a lot indeed. Suppose something that worked a month ago started acting up an hour or so after your newest version of it is released. We call that ‘breaking the code’ — making something that has worked all this time suddenly stop working — and no one likes to hear that phrase on Release Day. When that happens, you may have to spend a lot of time tracing the error messages, debugging your way to the code in question, which could take a very long time on a sprawling codebase that’s gone through countless revisions over a long time.

…Or, you could, through what is called a ‘Code Comparison’ tool — which usually comes packaged along with your Source Control tool — simply call up a comparison of the code that’s changed in between these two versions and zero in on a much narrower scope of suspect code to look at.

A sample Code Comparison tool highlighting the change between two editions

The illustration above is a simplified example of the result of such a comparison. In this case, the change is immediately obvious: the type of a variable as well as its value has now changed. In larger comparisons over code files that span hundreds of lines; such a focus on only the code that changed can help save a lot of time when chasing for specific changes. This is possible because a Source Control system never truly removes traces of changes to your code; it just adds version upon version on it over time.

*Less* Code History

While Source Control helps preserve more of your history, it also helps you get rid of it!

…Stay with me here; I’ll elaborate. Prior to Source Control tool usage, this was what a typical section of long-lived code looked like:

…More comments than code!

It’s not an elaborate joke; these things do happen even in some present-day working environments. Every change that goes in is demarcated with a comment marking, for example, the date it was added and by whom, or for what purpose. It may have started out as a brave attempt to impose traceability onto the code, but in over just a span of months, this can cause nightmare scenarios where tracing what the code actually does, involves a trek over twice as many lines of comments explaining what it did. Productivity will plunge massively.

With good source control practices however, these situations need not exist: the tool tracks the code changes along with all the related metadata including notes on the changes, so you don’t have to leave permanent comments everywhere. Simply make your changes — including deletes — and’ commit’ them (save them to the repository). All that remains on the current copy of the code is exactly what it does now, nothing more: you cut out all the clutter. Not to worry about previous work, though: your permanent history at the repository will ensure you can always go back see what the code was like before, when, and by whom it was changed by.

…Which brings us to the final function…

3. Conflict management

Inevitably, you’re going to have to work with others on the same code project. Just as inevitably, you’re going to work on the same files with them simultaneously. In situations like these, it is usually unfeasible to restrict access to these files to a single user at a time, especially when the users involved don’t physically share the same workspace (like offices in different countries and time zones).

For these cases, modern Source Control tools really come into their own. Most of today’s tools allow us to:

  1. Work simultaneously on each user’s own copy of the code, undisturbed by the changes made by another,
  2. Resolve code conflicts when we finally bring the separate users’ code back together (known as ‘conflict resolution’), and also
  3. Implement traceability on the code. As earlier explained: by marking both when code was changed and by whom, responsibility can be easily established. You know, just so you’d know who exactly to hunt down and murder — I mean, consult when something goes wrong.

This last feature is also Source Control’s most powerful one, and differentiates merely taking regular zipped packages of your code and offloading them elsewhere: it’s pretty much the only way to support multiple-user access and to code along simultaneous independent paths, especially at a large scale as in the case of Open Source software.

An example of a source code graph (colors differentiate branches)

All Source Control implementations accomplish this by utilizing the concept of branches: like the branches of a tree, you can choose to branch off at any point from an existing ‘changeset’ (sets of changed code that are committed together) and continue adding more from that point on as you wish; just as others can continue adding more changesets on top of the original branch you diverged from, or even other branches.

Each branch continues on its own independent trajectory until they complete their task; after which they are ‘merged’ back into whatever ‘master’ branch they came from. How exactly this is implemented differs between the various tools; there is no ‘best’ way to do it. Rather, it all depends on your requirements for your code project.

The exact features and methods of implementations of branching deserves more than a few articles of its own and will not be covered in the scope of this article. A brief description on some of the common differentiating points will be provided below.

Common Source Control terms

There is a lot of specialized lingo associated with Source Control which often prove to be the source (pun unintended) of most of the difficulty understanding or differentiating them, especially since different implementations sometimes have different names for similar concepts. This section will attempt to provide a brief glossary of the common terms.

Changeset / Version

A set of changes spanning multiple lines and/or files that serve as a single checkpoint, with relevant metadata (e.g. date, name of user) of those changes saved as well.

Check in / Commit

The act of committing a changeset to a repository. Once done, the changes are made public so that others can peruse it.

Check out

The act of retrieving a set of changes from the repository. In some configurations, this can be made to restrict access control over various files e.g. single-user change rights until it is checked in again.

Branching / Forking

Like a branch striking out from the main trunk of a tree, this is a feature that allows a source controlled-project to pursue independent directions of further commits.

Merging

The opposite of branching: this is the act of bringing previously-separated code branches back together again, and the conflict management involved is the most complicated function of any Source Control tool.

Conflict

What happens when lines of code are simultaneously edited by multiple users, and the changes are then brought together. There are various tools that attempt to help walk you through and resolve them through a comparison of both (usually two at a time) parties’ code, and their effectiveness or ease of use can vary widely.

Repository

A collection of changesets. It can be hosted on a central location (public), on a local machine (private), or even both.

Trunk

Similar to branches; this is (you guessed it) the ‘main’, ‘source’ branch that all other branches ultimately converge back to. The fact that this is the ‘main’ branch drives the working doctrine of projects utilizing it differently.

Popular Source Control Options

The good news about Source Control options are that 1) after decades of usage and comparison, the ‘industry standard’ has largely settled down on a few mature examples, and that 2) thankfully, most of them are free!

Of course, an in-depth analysis of the comparisons between the various tools available is too comprehensive for the scope of both this introduction as well as the limits of my personal experience, and there are several articles just a google search away that do a much better job, so this section will instead focus on the few most high-level differentiating factors that can help you get started on a general idea based on your current experience or requirements.

As you can probably guess from the rest of this article, most of the popular tools today are functionally equivalent in terms of their competencies. These include:

  • Branching & merging
  • Conflict detection & resolution via code compare tools
  • User access control
  • Code rollback (in case you ever want to undo specific changesets that never should’ve happened)
  • Backup systems (because even a network server can suffer a catastrophic failure)

One must note, however, that while these implementations are considered sufficiently competent across all these tools, the actual mechanics of their implementations along with the ease with which they can be picked up and mastered may differ greatly, and that may come as a shock to anyone cross-training between them.

When you really boil them down, however, where they really differ are simply two points:

1. Centralized vs Distributed

a. Centralized systems

The simplest, intuitive form of Source Control is as a centralized system: there is a single network location (not counting backups) that serves as the source of truth of your codebase. All attempts to view, modify and publish code will pass through this system. It thus serves as a centralized hub through which everyone’s work is shared: it’s simplistic, so it’s easy to pick up, and it’s also easy to administrate. So, if John wants to see the history of all the changes I’ve done to the code, I’d have to check my code into the server, after which John would have to get the latest version of the code (my changes) from that same server. The same will of course be true for the reverse direction.

However, if I want to commit even a single change or check out a file, the server — and in turn all participants — will be able to see the contents of my action, even if it might be unfinished. Even if my code is incomplete, if I were to go on a holiday and need my code to be checked in somewhere for safekeeping I’d have nowhere but the server itself to upload my changes to — incomplete code and all. Also, in some very strict implementations which implement single-use locks on checked-out files, if someone checks out a file and then leaves for a holiday or suffers a computer failure the file will stay locked out and prevent others from checking out the file for themselves.

Lastly, it’d also mean that for me to do anything to the code on my local machine (even browse the file listing), I’d have to be constantly connected to the server. This can cause problems with a large userbase or machines suffering from a poor connection to the server.

b. Distributed systems

On the other hand, you have distributed systems: whenever you start work on a code project you actually download a whole copy of that project onto your pc, after which it is accessed entirely there: you can do whole strings of commits, rollbacks, check outs…all within the context of your local machine. This means no-one can or needs to see the varied commits you make over the course of your work until you’re finished and ready to condense and push your work back up to the source, where the system will work the same as with a centralized system from that point on.

This means you don’t need a constant network connection to the server, and if you and someone else working on two separate copies of the same code wants to merge without having to go through the server, you can do so directly with each other — should you choose to; only the single final copy of both your work will need to be uploaded in a single push up to the server.

2. Free vs Paid

This doesn’t need much explanation — software is always divided upon this line no matter the purpose. The usual differences between these two are all there: paid/licensed source control systems can cost quite a lot of money, especially when nothing can be more free than free.

The former type justifies its price tag by:

  • The amount of user support available: both when nasty bugs happen and in terms of continually-added features to keep the technology up to date with the times. The speed of this support is also a factor.
  • The ease of use: this could mean the difference between having to do everything by obscure console commands after a convoluted installation process, or a snazzy graphical user interface delivered by a one-click installer. There are of course legions of purists that swear by the former, but for beginners the latter is definitely a draw.
  • The amount of integration available with other tools: in working environments these come into play, where source control tools can be tagged to work items, issue resolution, automated deployments, code quality review etc.

The most common tools, (that I have experienced), are divided amongst these factors as such:

Free + Distributed:

Free + Centralized:

Paid + Centralized:

There’s also a third common differentiator which is more of the net result of all these factors: amount of widespread use. This is where Git and its implementation into GitHub really stands out as a large amount of open-source code is hosted there. This is an important consideration because you might find that a library or code sample you’re looking for is housed on Github, for example, which still forces you into it in order to access it.

It’s worth noting, however, that Microsoft has shaken things up a lot in recent years: it’s moved its TFVC product into the cloud under its Visual Studio Team Services (VSTS) package which offers much more powerful benefits via access to its Azure cloud-based services. The basic implementation of Source Control hasn’t changed all that much; it’s just less localized (which can still be a good thing). Also, its recent purchase of Git itself and its intent to integrate Git more and more deeply into its existing Team Foundation products all the more crystallizes the dominance of Git in the userbase.

Chances are, you’ll still end up picking up Git sooner rather than later, and it’s not a bad choice at all. However, do know that the other options do still exist whether due to their respective strengths or through sheer corporate inertia, and It’s certainly a lot easier to pick up a centralized, UI-based system like SVN if you’re just getting introduced to the concepts.

Conclusion

I hope this high-level introduction to the juggernaut that is Source Control has, by avoiding getting bogged down by too much of the details, hopefully delivered enough of concept to probably give you an idea of why and what you can try to use on your (hopefully) long journey down the path of serious coding.

Make no mistake: the knowledge of this skill is literally a requirement if you’re going to contribute to an Open-Source community or work on any serious software project or department. It’s mastery however is an invaluable advantage towards both accelerating your team’s work as well as mitigating several of the common disasters that can visit it.

Good luck; and Happy Coding!

--

--