Data Science and Writing CRM White Papers

Microsfit Dynamics CRM 2015 — A typical CRM used by many companies


The other day I was talking to the product manager of Customer relationship management (CRM) and he said, he would like to do more whitepapers about CRM. Now I have made a few white papers myself and making good white papers involves a lot of time and effort. Then it occurred to me that anything worth saying about CRM has probably already been said on the internet, and what hasn’t been said isn’t worth talking about. All I need to do is to collect text from the Internet on CRM, filter it somehow, and auto generate whitepapers.

My first thought was to scrape some CRM blogs, but I’ve already scraped blogs for a previous project so I decided Twitter scraping would be more fun. Also Twitter streams have the benefit of being constantly generated by many users, which hopefully will make my CRM whitepaper “cutting-edge”, no relying on old blog news for me.

RaspberryPi 2

First thing I found was that people don’t really tweet about CRM on a Sunday evening, who knew? But I was getting maybe 1 tweet every 5 seconds. So I set up rasberry pi (Credit card sized computer) to continuously look for new tweets over night and left it to run while I was at work on Monday.

Auto-CRM White papers

In the end I managed to collect 3191 CRM tweets before closing my tweeter stream. Just for fun I thought i’d look at the top hashtags assocaited with CRM, I bet #CRM is number one. Here is the top 10 hashtags. It seems like finding a job in CRM is Twitters main priority, so my whitpapers may be slightly biased towards this topic.

#CRM, #Jobs, #Job, #Marketing, #msdyncrm, #sales, #salesforce, #erp, #hiring, #cem

Now I have to create my algorithm for generating whitepapers. To do that i am going to use a n-gram model, its the same alogrithm used for predictive text on your phone. Basically it takes a word and guesses what the next word will be. This is called a bi-gram, as you put one word in and you get one word out, for a total of two words. A tri-gram would give you two words out for every word you put in. Etc…all the way up to any number “n” you can think of (n-gram).

I also removed all the URL Links in the Tweets, assuming they would look bad in my whitepaper. Now I was ready to auto-generate white papers. Here are some extracts from my white paper:

“the crm community cloud sherpas: helping hand”
“true. many of greek people doesn’t include extending microsoft dynamics crm solution for small office including the artificial intelligence realestate software 365 and #ibm offer custom crm is the crm”
“customer engagement part 3 crm watchlist customer can do you a website and paste contact management tools) — digital recruitment is hiring! #crm is the best for its interpretation take a webform for potter’s house physician, the #internetofthings battle.”

Strange why Greek people don’t extend Microsoft Dynamics CRM for their small offices. Maybe they could ask the cloud sherpas for help. Ok so it’s not pulitzer standard, but it was alot easier than writing a whole white paper from scratch.


The quality of “writing” would probably have been better if I used blogs instead of Twitter, but the n-gram algorithm still would have produced nonsense. I could have also used a tri-gram model instead of the bi-gram model, which would have given slightly better results.

I may also comeback to this dataset and do some visualisations to explore it a bit further. Maybe doing some Topic analysis, grouping tweets based on their topics and making a sperate n-gram model for each topic.

Technical Notes

Everything was coded in python, using the Twython library for streaming Twitter data.