Crowd Sourcing a Language (part 2)

Jonathan Beard
cat /dev/urandom
Published in
3 min readJan 11, 2016

So what is a programming language? A quick Google search turns up this simple definition:

A programming language is a formal constructed language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs to control the behavior of a machine or to express algorithms. — Wikipedia

Fundamentally, a programming language is how we communicate with our computer. Think about that for a second. Humans have evolved communication that is highly nuanced and social. Layers of perception filter our understanding of words. Each meaning is layered with historical, personal, and often quite tortured from their original definitions (to convince yourself of this, take a gander through an etymological dictionary). Programming languages to date, are created by people, or groups of them in a committee. The meaning of each word (semantics) were clear to the designers, but for others this meaning can be lost. I’d also like to know what factors influence the construction of that meaning, I’d even go as far to say that it could be influenced by the types of devices that we use. I want to know as close as I can, what representations/patterns/perceptions within a programmer’s mind lead to the perception of meaning. This is critical to building programming languages for everyone, because that’s who we as a society need to learn to program (at least at a basic level).

A friend of mine had this to say, when I mentioned this idea to her in Sept. 2014:

most people who respond to calls to crowdsource have no technical knowledge and most ppl with technical knowledge wouldn’t waste their time being in a crowd — anonymous

But that is entirely the point. The next generation need to be creators and consumers. To do that we need diverse input into how we “speak” to our computers. Without that we’re doomed to spend more weeks teaching non-native semantic meaning to students (i.e., what does equality mean when comparing strings vs. integers). A study by Clahsen & Felser (2006) found only a weak correlation with language representational transfer (i.e., pulling meaning from their native language to the new one), but did find a correlation in early learning of a language (i.e., English language speakers would overly masculinize, etc.). This same study also noted a trend noted by the CS Education community, that syntax is typically learned quickly. Semantic errors typically have a much longer tail, which seems to indicate that understanding non-native meaning is the same for programming languages as it is for spoken language (although correlation is of course not causation).

The first goal of the study will be to find the meaning associated with typical programming constructs, and if it does in fact differ based on demographics or some other factor (does the input device influence our assessment of meaning?). We’ll start small. A simple randomized check box to ask what a specific construct means. I toyed with using a recurrent neural network to generate bits of code at random, but it is a bit difficult to ensure the generated segments are not total giberish. So, I’ve put together examples from various programming languages and we’ll randomly present them to volunteers (well, anybody that goes to the page). The first set of users will have fill in the blank boxes to give their meaning, then we’ll randomly present selections of the input meaning to a new set of users. I’m very curious to see how the meaning associated with each language construct is associated. Will demographics predict the meaning, geography, etc.? I just purchased buildalanguage.org, hopefully I’ll get it going within the next week or so with this simple study.

Once I’ve proof that some meaning is actually inferred to various pieces of a programming language without being taught (I’ll blog it as soon as I have data), then we’ll move on to construction based on a genetic algorithm. I’ll leave the math to the next post, but the general idea will be to find the most fit grammar from all extant grammars (we’ll probably have to make some concessions to arrive at an unambiguous grammar). By adding in random mutations through generated (original reason I started using char-rnn) and direct user suggestions, we hope to steer innovation in our crowd sourced language away from what currently exists to something that is perhaps better.

NOTE: this is a continuation from (My Crazy Idea), if you have any ideas, suggestions, or references, please comment below. All (constructive) input is welcome!

--

--

Jonathan Beard
cat /dev/urandom

CS/CoE Researcher (Dr. Beard), US Army Vet (Captain Beard), Hacker, Techie, Runner. interested in HPC, Bioinformatics/Comp Bio,ML,robotics,CSED, Opinions mine.