Metis Weeks 6–8

NLP & my Grandfather’s Letters During WWII

Darien Mitchell-Tontar
12 min readMay 25, 2020
My grandfather, Silvio Tontar, while serving in the U.S. Army as a medical officer

The fourth project at Metis was an exciting one because it was the first time we learned how to deal with text data in raw form. Previously, we only worked with categorical data, turned it into dummy variables and fit our models. Now we sought to get actual meaning from large collections of text (a corpus).

It was a struggle for me to come up with an idea for this project, then I remembered that years ago my aunt had typed up 300+ letters my grandfather, Silvio Tontar, wrote to my grandmother, Annette, during WWII. I was initially unsure whether or not this would be an acceptable topic, but after some encouragement from instructors and TAs, I felt confident moving forward with it.

The Problem

I have to admit, I dove into this project without a super clear purpose. From my perspective, this was an opportunity to get to know my grandfather better, who passed away when I was five. This is not exactly a research question, but I suppose I wanted to know how he felt during his time overseas. I wanted to see if I could use the text data to empathize with his situation, even a just little bit. When was he happy, if at all? When was he upset? How did his language change throughout time? My goal was to use data science to answer these questions more easily, as opposed to re-reading the letters over and over again.

My grandparents. This photo was taken right before my grandfather was sent overseas.

The Data

Perhaps this is obvious, but my data was the 310 letters my grandpa wrote from May 13, 1942 to November 10, 1945 during his time serving in the United States Army as a medical officer; my grandparents married in February of 1942. He spent three and a half years in the Southwest Pacific, mostly in Australia, New Guinea, the Philippines and ending his time in Japan after they surrendered.

During this time my grandfather never lost sight of what was most important to him: my grandmother. I admire how consistently he wrote her, and how positive he was the entire time (more on that later).

The vast majority of letters were written with very little time between them and the previous one.

As you can see he managed to write letters fairly consistently throughout his time at war. One could imagine that the writing was therapeutic for him in addition to being the only way he could connect with his wife and the rest of his family. For troops in general during that time, letters were a huge motivating factor, and they loved sending/receiving mail.

Sentiment Analysis

Before cleaning the text data, I wanted to get as much information as I could out of the raw text. A sentiment analysis leant itself well to this goal. I wanted to see if I could use his words to understand how he was feeling during the war. I used vader sentiment analysis for this, which takes in a small block of text and returns four score metrics. Positive, neutral and negative polarity all represent proportions of positive, neutral and negative sentiment in the text respectively. Lastly, the compound score is a normalized, weighted single metric for measuring the sentiment of the text:

  1. positive sentiment: compound score >= 0.05
  2. neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
  3. negative sentiment: compound score <= -0.05

I decided that a good place to start was plotting positive and negative sentiment over time.

Percentage of positive and negative words used in each letter

It’s amazing how level headed my grandpa was during the war. He was consistently more positive than negative in his language, although the gap between the sentiment scores starts to diminish closer to the end of the war; he had probably had enough and wanted to come home.

I was most interested in the huge spikes after looking at this chart. I sorted the letters by positive polarity and found the most positive letters were short postcards like this one:

New Guinea, May 5, 1944

Darling,

All is well. I am feeling tops. I enjoyed a good swim and sunbath.

Love,

Silvio

This little letter warms my heart, and it’s no surprise that it has a high positive polarity score.

I decided to order by compound score instead to see if it would return longer letters. Here is an excerpt from his most positively scored letter. which was written September 6, 1942 in Australia:

…I have a feeling that things will change. For one thing, I do not feel as pessimistic as I felt when I left the country. How are you, darling? The heat will be over by the time you receive this letter and you will enjoy the fall. I am moving toward the spring and the hot weather. darling, do not worry about me. I am feeling fine, have a good appetite, and eat good food. If I had you here with me, I would be perfectly happy…

This letter is quite optimistic, and early on in his time overseas. Starting with a sentiment analysis allowed me to improve my subject area knowledge by reading more letters and get a sense of common themes seen in them.

Even when my grandfather was feeling down, he still found positivity in the situation. Here is a piece from his most negatively scored letter, written much later on July 7, 1944 in the South Pacific (during the battle of Biak):

…I do not exaggerate in saying that more than a dozen artillery shells burst within a radius of ten yards from me, some came so near that I had dirt fall upon me. The machine gun and rifle bullets continued their sinister whistling only a few feet from me, sometimes only a few inches above me. That day they were after my scalp and in a big way. I was under fire all day, but twice within two hours, they kept dropping their projectiles so near me that I came out untouched, unscratched, and unhurt only by a miracle. Sure I am lucky…

That excerpt is not even the worst of the letter. At other points he talks about how he is losing hope, losing faith in war and questioning its purpose. He does manage to end on a more positive note however. From the same letter as above:

…I am well fed now. I have all the water to drink. I can take a bath daily. I went for weeks without washing and shaving! I smelled like a skunk! It’s a sad story. But that’s life. I am feeling well and love you very, very much. I have some time now to dream, to daydream, long, and yearn for the time I’ll be able to come home to you. meanwhile, I love you.

I am yours,

Silvio

After learning more about the letters, I felt ready to move on to a more in depth analysis

Exploratory Data Analysis

A common starting point in Natural Language Processing is to remove the most common words from your corpus. This is because they do not give you valuable information about each document. I wanted to know what makes each letter, or collection of letters, special so that I can eventually extract meaning from them.

After removing the most common words, and some other mundane ones, I produced this sequence of world clouds below, tracking the language he used in his letters and matching it with their positive polarity:

Most common words in the letter collected by quarter (every three months)

What I love about this visual is that is tells a story by itself. It shows common themes throughout his service. You can see his seasons greetings, you can see that he was perhaps a bit more relaxed at the beginning of the war, you can make associations between how he was feeling and the words he used and finally, at the end, it seems like he expected the war to end soon.

One interesting historical fact is that the aforementioned Battle of Biak was from May 27, 1944 to June 22, 1944 and my grandfather’s division got hit very hard during that battle. You can see that his positive polarity decreased starting in June 1944.

This is a very cool visual and tells us a lot about the letters, next I wanted to dig deeper and try to extract some of the main reoccurring themes seen in his writing.

Topic Modeling

Topic modeling is an unsupervised machine learning algorithm that allows us to reduce a feature space from thousands (in this case thousands of words), down to fewer than ten. One can then look at the most prominent words in each of those “topics” and interpret them, hence the name topic modeling. This is an unsupervised algorithm because the text documents do not have a predetermined topic label assigned to them at the beginning. This is in contrast to my last project, where each U.S. county was labeled to be “at risk” or “not at risk,” and I trained the model having those labels. Here, we are basically inferring our own labels based on what we find from our analysis.

One thing that’s true about almost all text data is that it’s very messy. Some steps I took to clean up the data were:

  1. Preprocessing: removing punctuation, making all words lower case, removing non-important words (in my case, names of locations and my grandparents’ names since these were mentioned in almost every letter).
  2. Lemmatization: grouped together inflected forms of words so they can be analyzed as a single item (for example, “walk” and “walking” are basically the same for our purposes).
  3. Filter letters so they are only comprised of nouns and adjectives. This allowed me to extract more meaning after I ran my initial models.

After the text had been cleaned up I was almost ready to fit a model, but the main problem here is that computers tend to prefer numbers.

Using a term frequency–inverse document frequency, or TF-IDF, we can turn a corpus of text into a data frame where each row is a document (letter in this case) and each column is a word. The entries are TF-IDF values, and we can think of them as weights for how important a word is in a letter. What’s nice about TF-IDF is that it also considers the number of times a word appears in all the letters, by doing this it reduces the importance of those words since they do not add much value for determining any unique, abstract, themes seen in the text.

Finally on to the exciting part. I used Nonnegative Matrix Factorization (NMF) to cluster the letters and below are the results:

Topic  1
war, hospital, expect, medical, overseas, end, news, duty, officer, change

Topic 2
christmas, picture, morning, went, dinner, cold, golf, played, people, warm

Topic 3
union telegram, western union, western union telegram, western, union, tontar, telegram, francisco, san francisco, san

Topic 4
jap, island, native, philippine, rain, philippine island, enemy, campaign, wet, landed

Topic 5
ship, sea, land, typhoon, calm, tonight, kure, morning, bay, island

Topic 6
birthday, wife, beloved, send, happy, beloved wife, kiss, remember, feeling, able

I encourage you to ponder over these topics on your own to see if you agree with my interpretation of them below:

Topic 1. Duty
Topic 2. Comfort/Normalization
Topic 3. Money (he sent money home frequently)
Topic 4. Inner Conflict/Struggle
Topic 5. Traveling at Sea
Topic 6. Romance

A nice way to study these topics further is to see how their appearance in the letters changed over time.

Topic prevalence in each letter over time. The letters are group together quarter

As you can see, the topic of travel is more prevalent in letters at the beginning and end of his service. Topics like “Duty” and “Romance” are much more common towards the beginning of his service, while “Inner Conflict/Struggle” and “Duty” spike near the end.

Let’s again look at some snippets of text, this time from the letter which features each topic above most prevalently (I am leaving out “Money” and “Travel at Sea”).

Topic 1: Duty from July 31, 1945

…I have been overseas so long and have done my duty with distinction and honor. I want to go home to come home with honor and with my conscience clear and unsoiled. The other way is to wait and hope as you know. I’m healthy, I have not lost one day since I came into the army I can only get out by playing the part of a “psycho” or “nut .” I disdain it…

Topic 2: Comfort/Normalization from October 19, 1942

…I played golf and was I bad. Can you imagine me playing with an old champion? I met him at the golf course. He invited me to join his group. Paul could not come, he was busy. Not only did I play golf with him, he invited me to his house to dinner and then to sleep…

Topic 4: Inner Struggle/Conflict from April 9, 1945

…I sailed in a small boat that kept rolling pitching and rocking all the time I vomited so many times that even my bile the yellow fluid that the liver excretes found a way out through my mouth. I was miserable and wretched…

…I heard some bullets whistling in the air, but none landed near me…

…The terrain was another trouble. It’s rugged up and down mountains. Some places I had to crawl up and down steep mountains. I had to wade belt deep in rivers and creeks. I got wet and slept more than one night in the rain wet from the belt down…

…I have met many Filipinos and heard their gruesome experiences under Jap rule, the people are ragged…

This particular letter is interesting because it addresses several of the different types of struggle he dealt with. Sickness, danger, the elements and the enemy treatment of the native people, something at which he spoke of quite regularly in his letters.

Topic 6 Romance from April 18, 1943

…Do you remember I waved to you from the moving car? You were standing by the window and I could see tears in your eyes. A whole year has gone by and look where I am now, in the jungle so far away from you. I am asking myself “how long will they keep me away from my beloved wife ,” a year is a long time too long…

I recognize that I am cherry picking from each article yes, but the topics are making it much easier for me to cherry pick, and I think they did a good job parsing out the letters by “topic.”

My grandfather and me

Conclusion

I knew before starting this project that my grandfather was a great man, but now I understand why. My grandfather risked his life and endured suffering for three and half years to keep millions of people he did not know safe. This is what our health care workers are doing for all of us as I write this blog entry. These are people that are special beyond what I am capable of describing, and for that, I think it is appropriate to conclude with my grandfather’s words instead of my own, for he was a hero but didn’t consider himself such. There’s a lot we can learn from him and others like him.

From April 18, 1943 (the same letter as above)

…It’s a great feeling to know that when I save a man from certain death I’ll make somebody in the states that I never saw and never will see very happy. It’s a feeling that no money can buy. It partially compensates for the hardship and discomfort a doctor goes through. I am glad and willing to do my part. I hope they will consider it and give me a break so that I can go home soon and be with the best lady in the whole world, my beloved wife…

Special Thank You

I’m extremely grateful to have had this rare opportunity, and it would not have been possible if it weren’t for my Aunt, Silvia Tontar, who worked hard to type all these letters in order to preserve them. Just reading through them is a treasure, but having the opportunity to analyze them had been a truly exceptional experience. Thank you Silvia!

--

--

Darien Mitchell-Tontar

Former high school math teacher documenting their journey onto the next stage of life