I never dreamt of becoming a writer
But here is the story. A short version for what will come as my next project. The final one, my Games of Thrones, my 100 days of loneliness.
For my final project at General Assembly I chose to go with fake news. Not CNN, the real fake news. There’s been so much hype about it, so that if I am next time at a bar and some drunk dude spells “fake news” into my face, I would fix my glasses and say “I actually made a study on those, and you know what…”
And it won’t be interesting because it’s not another spam email, it’s not an attempt to sell you viagra or let you win a brazilion of dollars from my Nigerian grandy. It’s another type of lie. Disguised into a sensation, lie about people whose lives made them so public that every story seems like a tale.
My search for those lies was like a paved road to the hell. I found them quick and easy. I asked Google, and it brought me to Kaggle.com. Kaggle has a plethora of interesting stories with data you can play with and prove their authors right or wrong. I trust them. So, I borrowed the dataset that one short fake news to 13 thousand and brought to my instructors all glaring that now I can check if one of the stories they shared on Facebook is a lie, because I thought I had all of them on my new MacBook Pro.
No, they said, you also need true news. I haven’t heard of those before, I know they all are true but we don’t trust them. Especially, the ones from the right wing. So I thought. I went on the March and I don’t trust them. But you can’t say that to a data scientist. You have to have fake and real, because how else would your new MacBook Pro learn the difference? And so I scraped. Day and night, night and day. Same day and night. It took me several hours. 15 news sites, six thousand news. All presumably real. I don’t know but I had to presume, because when you have fake news, others are presumably real. I had them all, CNN, Politico, and Huffington on the left, National Monitor and USA Today in the center and, oh my lord, Breitbart on the very right. I could not tell the scraper how many news I would harvest. It all depends, but I got 3000 one day and got them all heavily left. Ethics of the ethical data scientist did not let me go to bed unless I found some more from the right side. And I did: Western Journalism and other news. All are in the basket now equally distant from the center.
To my luck Katharine Jarmul, a prominent Python enthusiast, just published an article on DataCamp where she applied Multinomial Naive Bayes on a real/fake news dataset. I did not think twice and went with this dataset and used the same approach. It worked just fine on that dataset. But it was very peculiar with the data I scraped from the Internet.
So Breitbart is more credible than CNN. Eat it, lefties! But seriously NO. I am going with a different model. I will try to go with RandomForest. See you in the next post.