Analytics Vidhya
Published in

Analytics Vidhya

Personal Film Content Analysis

As I have the habit of writing the daily log, therefore, the film or the drama I watched were also recorded in my daily log. After collecting around 4 years’ of my daily log data, I would like to see what kind of the film I watched the most. In order to get this answer, I approached this with 2 ways. Basically, these two ways were also based on the WordCloud module, but with different input. The first one is purely just using the film title name and the second one is to label each film with the type defined by myself and then take this as the input.

The first step is to get the data and clean it. The information we need is just the film content, so firstly we create the date frame with the column we need. Then we filter out the null value with setting this condition: ac_log[‘Film’].isnull()==False.
After that, we input the “news” this film content into each row of the data if it doesn’t include this information as I actually watch the news everyday but somehow at some point I don’t record it anymore.

Add the news this film content to the data

The next step is then to adjust the format, because the film is recorded in the format like XXX/XXX/XXX with the slash as the separator. In order to make each row as the independent film content, we have to split and then append it to the data frame we created earlier.

Adjust the data format

After handling this, we would have the result as the following table.

Table after format adjustment

The data we get ranges from 2017 to 2021, however, data label is only done from 2017 to 2020, so here we need to filter out the data from 2021. In the same time, we also want to remove the data with “-” inside indicating no value. After handling all these things, we create one column called “count” to help us to do some calculation later.

Filter out the 2021 data

Then we group the data by Film with the count column to calculate the count of each film. The following table shows the result of this and it’s obviously I watched quite lots of Chinese History Drama. However I actually watched lots of the documentaries as well but as these documentaries are not in the same series, therefore they would not be shown in this count ranking table.

Film count table

In order to demystify what kind of the content I actually watch, the label table is indispensable. In the beginning, I tried to find if there is a AI tool that could help me to label these film contents, but unfortunately I haven’t found it yet. If you’re reading my article and know how to do it, please leave the message to me as this would save me tons of time 🙏. Before finding this good tool, the only way to label it is to manually label it one by one. Overall, I label my film content with 30 categories based on my subjective judgement. Then we would have the result like the following table.

Film label table

As we don’t have any film record data in 2016, we need to filter out the 2016 data. In addition, 2017 and 2018 label data is not complete either, but here we would just make do with it as labeling data is really tedious 😞. The following table shows the film label content.

However, as the film label is also split with the slash like the film record, we also need to do some format adjustment like film record. Therefore, we would have the result like the following table.

Film label table after adjustment

The next step is to match the film record data with the label table. Here we do this process with the module difflib. And if the film record could find the matched film name (similar enough to match) in the label table, then the label table’s film name would be given to the film record, otherwise it would be given the value 0.

Film matching process

After that, we check the lost data and find out there are around 870 data lost. There are around one third of data lost that this result is not ideal. As we know, the film data from 2017 to 2018 is not really clean and we don’t put all film record from these 2 years into labeling process as well. However, we may not handle this as of now, but we would just use this to see how the result would be like.

Lost data table

Then we merge the result of the matched label table with the film record table as the following.

Merged result of the film record with the label

And we group the data by the film label. Here, we could find that the top 5 film category I watched are news, history, drama, world and documentary.

Film label count ranking

Lastly, let’s see the word cloud with the 2 approaches as mentioned above.
The first approach is based on the film title and the second approach is based on the film label.

Word cloud of the film title
Word cloud of the film label

To sum up, my top five film categories news, history, drama, world and documentary. And the dramas I watched mostly are history-related. And the documentary this film category somehow is also history-related film type. Accordingly, we can infer that I may be the person who likes to listen to the story, learn from the history and is curious about what happened around the world.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store