Using PCA & T-SNE to mine ideas for what to write next

Word visualizations are a helpful tool for considering the context in which keywords are used when talking about a topic. In this case, that topic is Data Science.

Joseph Alan Epstein
Analytics Vidhya
4 min readDec 29, 2020

--

All you need is direction and magnitude… Photo from Unsplash by: Steve Harvey

Have you ever found yourself in a situation with an overwhelming amount of text data, but you want insights right away?

Well, if you’re in a library, just grab a book and sit down. It’s gonna take a while.

But, if you’re on the internet, then download my GitHub repository, modify it for your own use case, and take a look at a graph of a bunch of word vectors.

What can Math tell you about Language?

Simple answer: A whole lot!

From a Linguist’s perspective, this can be summed up by the Distributional Hypothesis (The crux for most ML approaches to Semantics).

Linguistic items with similar distributions have similar meanings

With this hypothesis, we can enter the realm of co-occurrence models, like BoW, to the more complicated ML models, like Word2Vec. But, the idea is the same: look at the target word, compare it to the surrounding words, take down some sort of metric, and repeat.

Finding Ideas for my Upcoming Articles

  1. Find a few websites that blog about Data Science, and scrape the text from many of their articles on each website

2. Clean the text and run it through Doc2Vec

3. Plot with PCA to get an idea of the “Principal Components

What does this actually look like?

We can see in this graph that even though we have broken this down by 2 principal components, the text data is spread out through a few other topics.

This is the beauty of a linear matrix decomposition. Like spokes on a wheel, we can clearly see that there are some “directions” of thought.

For instance, the leftmost “spoke” has words like “risk”, “could”, and “experience”, near other words like “customer”, “products”, and “RPA”.

To me, this means that Data Science blogs are concerned with solving real-world problems in an automated fashion, giving the most back to their users.

4. Now for some T-SNE, ‘cause who doesn’t like non-linear matrix factorization? (…Yes, we first factorize with PCA, but that’s ok — and an industry standard)

In this new T-SNE graph, the spokes have turned in clusters! (I think this is so cool lol).

Look at the lefthand side again. We still see words like “customer”, “products”, and “RPA”. Now they are a bit more obviously grouped near words like “better”, “decision”, “real”, and “help”.

What does this all mean?

I think it’s best for this article to not go into every single insight that could possibly be attained in these graphs, mainly because this is a subjective exercise. (Wait, I thought Math was supposed to be objective? Maybe.)

The point of these visualization techniques is not for the computer to just output the answer, but to make it easier to do the work yourself.

So, what did I interpret from these “Spokes” and “Clusters”?

For the one “spoke” and “cluster” that we talked about in this article, I realized that with big data comes big responsibility. You cannot simply create something nifty, but something practical.

Something that improves the “decisions” made by your “customers”, in order to enhance their “experience” with your “product”, ideally in an “automated” fashion.

My current line of work is Web Development, and UX is incredibly important. Nobody cares about the backend if the UX is clunky.

So what’s a trendy automated UX??? CHATBOTS. Yup, that’s my takeaway: I want to learn as much as I can about Chatbots in the upcoming future.

Since the kind of product that I personally create is a website, and I now see that everyone is talking about automated experiences for their products, learning how to make Chatbots is the clear interpretation for me.

Look out for my upcoming series on Chatbots! I’m starting from scratch, so I hope to make this instructive as I welcome you on my journey through this technology.

Happy New Year Everyone :)

--

--

Joseph Alan Epstein
Analytics Vidhya

Web Developer by day, Data Scientist by night. Also, I enjoy Chess.