Addresing the European Parliament with Recurrent Neural Networks

Roberto Font
7 min readMay 28, 2018

--

It’s been a busy 2018 thus far, with the release of our first open source project, new research lines, and some new projects that we hope to share soon. However, I decided to relax a little bit and find some time for a just-for-fun machine learning project. I decided to go for text generation using char-RNN.

Char-RNN has been a classic of these kind of projects since the publication of The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy. The basic idea is to train a RNN to generate text one character at a time. So, given a string of characters, the net will predict the most likely new character. Depending on the text you use to train the neural network, you will be able to generate synthetic Shakespeare writings, C++ code or even music. Recently, one of these models was used to write the script for a short film.

The task

First, we need to decide what kind of text we want to generate. Char-RNN generates text that has apparent coherence but without real meaning. Political discourses is therefore the ideal fit; it shares this very same characteristic.

The data

Once we know what kind of text we want to generate, we need to find the data to train our Recurrent Neural Network. I decided to use the Europarl corpus. Europarl is a corpus developed for machine translation, with parallel texts in a total of 21 languages extracted from the European Parliament proceedings. I took the English portion of the corpus and used the tools provided by its creators to remove XML tags. Apart from this step, no further preprocessing was applied.

The result is a corpus with more than 55 million words and 330 million characters.

The code

I used a modified version of this keras example script to run the experiment. You can find the more technical details at the end of the post.

The discourse

I will play the role of a new member that has to address the European Parliament for the first time. I have no idea what I could talk about, so I will try to look smart while saying as little as possible. For that, I’ll use the help of our RNN. However, instead of using it to generate the whole discourse -the result would be too meaningless even for a politician-, I will use the RNN as a helper whenever I get stuck. Let’s start!

It’s always good to start thanking anyone, but who? I wrote “Thank you dear president. Let me begin by expressing my thanks to” and let my robot helper suggest who should we thank. These are some of the alternatives:

  • […] thanks to the president-in-office of the council, which the proposal…
  • […] thanks to the european union. i have to say that this proposal…
  • […] thanks to the commission. the commission has already been said…
  • […] thanks to the european institutions. in order to see the states…
  • […] thanks to mrs schresta, in the regions of the same resources…

The rest looked less coherent. I had to Google both president-in-office of the council and Mrs. Schresta. Couldn’t find out who Mrs. Schresta is and turns out there is no longer a president-in-office, so I decided against these two suggestions. The second and the fourth looked too generic, so I opted for the comission; not too exciting, but good enough. Now let’s change subject. I write Secondly, and use the RNN to generate a longer sequence this time. I am delighted to find this masterpiece of geopolitical nonsense:

[…] secondly, the commission should be the most important instance to support the process of state which we are taking the european union to be.

The language model seems inspired, so I let it continue. These are my two top options:

  • […] lastly, the spanish development and mediterranean discovery of the last weeks…
  • […] the constitutional principles are continuing to protect people who will launch the states…

I must confess I’m intrigued about the mediterranean discovery of the last weeks (Phoenicians and the Roman Empire apparently missed something important), but the second option fits better with my big-words-no-meaning strategy.

I then add “…to higher levels of cohesion and well being” to finish the sentence, and then “The defense of social rights” just to see were we go from there. The answer seems to be that

The defense of social rights is not enough to be able to implement the relevant framework in the field of an international community.

which is, without any doubt, stupid enough to keep it. I want to look very involved, so I add “The compromise of my group is”. These are my top choices among what my mechanic writer offers me

  • […] is to be able to deliver the specific process of progress in the eu…
  • […] is to see the european union in the financial programme as soon as…
  • […] is to be able to see the forthcoming outside partnership in the european union.

I decide to go with the first one.

After a few more iterations, this is the final result (RNN-generated text in italics):

Thank you dear president. Let me begin by expressing my thanks to the european council.

Secondly, the commission should be the most important instance to support the process of state and which we are taking the european union to be. The constitutional principles are continuing to protect people who will launch the states to higher levels of cohesion and well being. The defense of social rights is not enough to be able to implement the relevant framework in the field of an international community. The compromise of my group is to be able to deliver the specific process of progress in the EU that the european construction needs. Indeed, there is a positive compromise and expected boundaries. We must continue to take a particular proposal for the balanced institutions. It is the only way to succeed in the process of state.

Finally, I would like to point out that this is a difficult constitution and development. However, it would involve the community action and strenghten the ties as we grew together.

Thank you very much.

(Standing ovation)

The boring, technical details

Here is the code used to train the neural network. The main differences with respect to the keras example script are:

  • Since the training set is really big, we use a data generator.
  • The net has two LSTM layers.
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
import numpy as np
import random
import sys
import io
# Read training text
training_corpus = 'Parl.en'
text = io.open(training_corpus, encoding='utf-8').read().lower()
print('corpus length:', len(text))
# Get unique characters and mappings from index to char
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
# Save mapping to disk
import pickle
char_dict = { "char_indices": char_indices, "indices_char": indices_char }
pickle.dump( char_dict, open( "Europarl_char_dict.p", "wb" ) )
# Split into training and validation
validation_proportion = 0.01
val_len = int(np.floor(validation_proportion * len(text)))
val_text = text[-val_len:]
text = text[:-val_len]
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 5
# Validation data
val_sentences = []
val_next_chars = []
for i in range(0, len(val_text) - maxlen, step):
val_sentences.append(val_text[i: i + maxlen])
val_next_chars.append(val_text[i + maxlen])
print('nb validation sequences:', len(val_sentences))
x_val = np.zeros((len(val_sentences), maxlen, len(chars)), dtype=np.bool)
y_val = np.zeros((len(val_sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(val_sentences):
for t, char in enumerate(sentence):
x_val[i, t, char_indices[char]] = 1
y_val[i, char_indices[val_next_chars[i]]] = 1

# For training data we use a generator
def sentence_data_generator(sentences,next_chars,batch_size):
num_examples = len(sentences)
num_batch_per_epoch = num_examples / batch_size
while 1:
for b in range(num_batch_per_epoch+1):
current_sentences = sentences[b*batch_size:np.min([(b+1)*batch_size,num_examples])]
current_targets = next_chars[b*batch_size:np.min([(b+1)*batch_size,num_examples])]
x = np.zeros((batch_size, maxlen, len(chars)), dtype=np.bool)
y = np.zeros((batch_size, len(chars)), dtype=np.bool)

for i, sentence in enumerate(current_sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[current_targets[i]]] = 1
yield x, y# Build the model
model = Sequential()
model.add(LSTM(512, return_sequences=True, implementation = 2, input_shape=(maxlen, len(chars))))
model.add(LSTM(512, implementation = 2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
learning_rate = 0.001
batch_size = 128
chars_per_iter = 10000000 # Consider this many chars on each epoch
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
# train the model, output generated text after each iteration
for iteration in range(1, 10):
print('Iteration', iteration)

this_iter_text = text[iteration*chars_per_iter:np.min([(iteration+1)*chars_per_iter,len(text)])]

sentences = []
next_chars = []
for i in range(0, len( this_iter_text) - maxlen, step):
sentences.append( this_iter_text[i: i + maxlen])
next_chars.append( this_iter_text[i + maxlen])
print('nb sequences:', len(sentences))
optimizer = RMSprop(lr=learning_rate,decay=1e-6)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
hist = model.fit_generator(sentence_data_generator(sentences,next_chars,batch_size),
1 + len(sentences) / batch_size,
validation_data = (x_val,y_val),
epochs=1)

train_loss = hist.history['loss'][0]
val_loss = hist.history['val_loss'][0]
learning_rate /= 50

model.save('Europarl_ep' + str(iteration) +'_loss_' + str(train_loss)+'_val_loss_'+str(val_loss)+'.h5')
start_index = random.randint(0, len(text) - maxlen - 1)for diversity in [0.5, 1.0]:
print()
print('----- diversity:', diversity)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)
for i in range(400):
x_pred = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x_pred[0, t, char_indices[char]] = 1.
preds = model.predict(x_pred, verbose=0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
sys.stdout.write(next_char)
sys.stdout.flush()
print()

--

--