Hi Amir,
Michael Gu

Your model is expecting 16,413 features which I’m guessing is a result of you one-hot encoding on a larger data set with a lot more strings. If you read the last few paragraphs of the post I’ve mentioned this problem which is bound to occur at inference/score time. My solution to this is to keep around the columns you initially used for training so you can add them back to your query dataframe, here’s the relevant code:

Another problem which may arise is that you now have a new feature (a new string in your case) that was not present at the time of training. In that case you have no choice but to remove these features from your query. Basically the columns you’re sending to the model should match exactly with the columns it saw at train time.

If you are using textual features it probably makes more sense to use a vectorizer (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). In that case you have to fit your vectorizer on the training data and use the vectorizer at inference time, very similar to what I did with pickling the columns array. For a larger text corpus you can use a HashingVectorizer (http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.