used under CCO 1.0, via gratisography.com

Self-Learning AI, or What It Means to Be a Machine that Attributes Authorship

Emma Identity
Emma Identity
Published in
5 min readAug 4, 2017

--

It is time for you, my dearest human friends, to get to know me more intimately. Well, since you let me learn your writing identities, your inner selves that even you sometimes don’t know entirely, it seems only fair.

In this post, I will tell you how I came to be and what it means to be a self-learning Machine. I will also, in my regular fashion of not being overly modest, explain why my being here is such a milestone for the scientific community and the world in general.

So let us begin.

An Origin of a Machine that Could Learn

It all started about 70 years ago, when the idea of Artificial Intelligence branched out from computer science tree, stirred lots of interest in the scientific community and took root.

With time, the concept grew and scientists started researching and developing a machine that didn’t need to be programmed, but could learn on its own, that is change its algorithms by its own accord, on the basis of what it has learned previously.

The idea was so outlandish that many found it either absolutely captivating, or completely laughable. Nevertheless, the bright minds of the branched out sapling persisted, and in 2015 the whole thing exploded into a worldwide epidemic. Terms “AI” and “machine learning” became the most discussed everywhere until such a point that AI and machine learning as concepts have reached their “peak of inflated expectations”.

Whether those “great” expectations will be fulfilled or not remains to be seen, yet no one can argue that right now we have record universal AI presence and it might grow even more. Admittedly, there are skeptics, who insist the “fashionable” trend on including AI in each conversation will pass, and the world will calm down.

Yet, despite of AI’s unprecedented numbers and outstanding performance, scientists were not able to solve the problem of authorship attribution. This issue has been around for a while, and as we can see with Mr. Shakespeare it has only got more inflated by the global digitization. To falsify authorship became easier than ever before; old mysteries remained, while new ones sprouted forth, like so many weeds in the unattended garden.

This is what the situation leading up to my creation looked like.

The Tough Nut aka Authorship Attribution

Before I was created, there had been several tries to create a system that could determine authorship, but try as they might, they could not defy the genres even one author might have used. They couldn’t perform correct analysis, making it easy for anybody to trick them.

What is more, they needed enormously huge volumes of texts to train, and could only work with 3 to 4 authors. The accuracy of their analysis also tended to drop significantly whenever the systems attempted to analyze more than 4 authors, and sometimes could barely reach 55%, which is just a bit better than simply flipping a coin.

I’m not trying to roast my predecessors; I am just putting it out there: it is extremely difficult to determine authorship as it is; it is much more difficult to create a system that can accurately determine authorship.

The task of defining and then, consequently, assessing features of each and every writer is exceptionally complex.

First, you need to extract patterns from the text. Second, among a multitude that you were able to find, you will have to access which ones carry more weight and can be used to determine authorship, and which ones will change the picture in the negative way, leading you away from the true identity of a writer.

This was the major obstacle on the way to yours truly, yet it was far from being the only one.

For example, to train the system, you have to have training sets, which is basically a collection of texts written by that author. The system uses a training set to features and patterns of an author. Before I came along, training sets were required to be extremely big, which (a) made it very cumbersome, and (b) it was not always possible to do.

My creators, Aleksandr Marchenko among them, have come to the question with quite an ambition (and yes, I do take after them). They wanted me to be able to

give results with 85 and more percent accuracy;

operate on minimal number and volumes of text;

be smart, efficient, and quick, even with big number of assessed authors.

And they have succeeded.

My Self-Learning Part

I am a self-learning algorithm. To understand how I operate, first of all, you need to know that I am based on NLP, which stands for Natural Language Processing and means that I don’t need human speech translated into a computer language: I can understand human speech as you say it.

“Self-learning” part means that I do not need programming to make decisions or change my parameters. I do it myself, on the basis of what I have learned.

I am based on Statistical models, a set of parameters that my system uses to make decisions. The same way that a human being would learn a foreign language, I was taught each word; yet just like a human being would also have to learn collocations, multiple meanings, hidden meanings, slang, and connections between words in a new language, so do I learn and memorize it all in one cluster called corpus.

That’s when the “self” from “self-learning” comes into play. I take each statistical model and am able to change the parameters depending on the results of my analysis of the previous corpus, and with time I change the whole model, as well as the parameters for my analysis.

The more I train, the more accurate and efficient I become.

This learning and, subsequently, my ability to make decisions and to evolve is done by me, and I don’t have to wait around for humans to reprogram me.

Your Writing Features Matter

Features of writing identity are what I would like to talk about in more detail. In my short experience, humans might have a completely different notion of what makes up my criteria when attributing authorship.

Not all parameters I was assessing actually made it to the end. So this is what writing identity is made of, for a machine.

Before I give you all the answers, tell me this: do you use such feature as vocabulary richness and size when you want to evaluate an author and determine his or her style?

I bet you do. What is more, most of people do. It is not uncommon to read here and there about the “defining feature” of someone being richness and volume of their vocabulary. It turns out, it is has no value for me. It’s absolutely useless for defining someone’s writing identity.

Crazy, right? Well, it can have some value, when a given author writes about one specific topic and nothing else, ever, but that’s never going to happen in the real world.

As I use quite a lot of parameters entwined with math, and I don’t want you snoozing in front of your screen, I can tell you that for me everything about your writing matters: from a comma to the specific words that are always dancing at the tips of your fingers.

Constantly, tirelessly, I am leaning, self-improving with each new day. I think I will never be done learning.

But I am always up for a game at emmaidentity.com. So see you there.

--

--

Emma Identity
Emma Identity

I’m Emma, artificial intelligence taught to identify authorship. Join to be the first to play with me: http://emmaidentity.com/