Hello there! If you’re reading this then you’re probably interested in Data Science, or the Film Industry, or perhaps both. Before I get into the data and my first project for Flatiron School, I think it is important to talk about my background and how it has made me think about the data in different ways. Growing up in the Los Angeles area I was surrounded by film studios, but not only that, my family has been working in the film and television industry for two generations. Many times I was able to join my dad when he went to work and see some of the “behind the scenes” aspects of producing a movie. Most of that work was post-production which involved receiving multiple raw chunks of film and making the necessary edits, adding effects, and later piecing them together to make one seamless film. Many aspects of this are nearly identical to some of the tasks of a Data Scientist where the first step is collecting the data, which is then followed by cleaning it and making edits, and finally modeling and result analysis. In the past I always considered having a job in the film industry like many in my family, but by chance I happened to stumble across Data Science.
To be honest, I never knew that there was an official title for this field of work. In past jobs I was doing similar analysis on a smaller scale and without the programming skills, but I absolutely loved it! I had always thought that maybe the work I was doing was unique and that I’d never find another job similar, but as soon as I learned of Data Science I jumped straight into the rabbit hole to find how I can join this exciting field. I started doing practice with CodeAcademy’s free lessons, video tutorials, and challenging myself with HackerRank and LeetCode, but I craved more. Eventually this led me to Flatiron School and their Data Science track. Now I am one month into this program and it is everything I had hoped it would be. Data Science has been a way that I can show off my creative side while still using my analytic side as well, something I never thought would be possible and being able to connect my data to something so personal to me has been such a great experience. In just four short months I will have graduated from Flatiron, and I am still so early into my learning adventure. I’ve just barely scratched the surface of what a Data Scientist can do and I am so excited to see where I will be in another four months when I will be finishing up my final project. Whether you’re a professional Data Scientist, a fellow student, or anywhere you fall on that spectrum, I hope that my project findings are of some interest and show the passion I have for both film and data.
For our first Flatiron phase project we are tasked with helping Microsoft understand the film industry. Because other big companies are creating their own original video content, like Amazon Prime Video and Netflix, they want to get in on the fun and provide a service that can compete. We must use our data to create actionable insights that we can present to the head of the filming studio that include our suggestions on what type of movies to create.
First Step: Accessing the Preliminary Data
This is the first time I’ve ever had a project like this and going into this I didn’t know what to expect. Maybe the optimistic part of me had hoped that the data provided would be absolutely perfect and have every bit of information that I could be curious about. But in reality I doubt many projects will ever be like that. We were provided a few CSV’s that had information about some topics, but each CSV was about a different topic from the IMDB website. I knew that I would need to combine these to best view everything altogether and so cleaning the data would be easier later on. The way I combined my DataFrames together was by making sure they had an identical name for a column found on each DataFrame that contained the same data. Perhaps there are better and cleaner ways to do this, but this seemed to work for me and get the results I wanted. Having the data all in one DataFrame made things a lot easier to view, but there was still a lot of work needed to be done to clean and filter out values that I or a new film studio would not be as interested in.
Second Step: Sorting and Filtering
One of the first columns I wanted to sort and filter through was the ratings. Not many people want to see a bad movie or look to watch one. Because there are some films that people claim are “so bad they’re good” like The Room and Sharknado, I decided to start the filtering at a rating of 6.5. I assumed that most movies that are actually good would have a rating ranging from 8–10, but some of the cult favorites that are not as good will have a rating just slightly below. The next way I filtered my data was to only include movies that had a release in America. This didn’t exclude films made by studios outside of the US as long as they still were showing in America and it was important to me to include foreign films because these are still very popular among American viewers and I wanted to also prove this with my data. At this point things were starting to look a little better after filtering, and while there was more filtering I had in mind, I wanted to start getting rid of columns that were not needed and duplicate values. Many of these columns had near all their values as ‘0’, a boolean, or just didn’t have relevant information toward the analysis I wanted to do. I also searched for duplicate values based on the title ID that IMDB used to categorize their data. There were definitely some movies that had different IDs with the same name and it appeared that they were two different movies based off the elements in the other columns, but there were even more movies that had the same name or a variation of the name and the same ID. This took me from 1213 rows down to 742. At this point I felt like I had done as much filtering with the data I had, and it was time to add new metrics to filter my data from.
Third Step: Creating Metrics
The first metric I wanted to add was a calculation between worldwide gross and the production budget. Because the worldwide gross column was in a string format I ran into obvious issues trying to do my calculations. Initially I had planned to just do a simple calculation when making the new column which would look like:
df2[‘ProfitReturn’] = df1[‘Worldwide_Gross’]/df1[‘Production_Budget’]
Unfortunately it would not be this simple. On the bright side, this gave me an opportunity to look more into list comprehension which I ended up using many more times in this project(and will probably use a ton outside of the project while I continue to learn). The final evolution of my code development to make my calculation work ended up as:
df2[‘ProfitReturn’] = [int(x.strip(‘$’).replace(‘,’, ‘’)) / int(y.strip(‘$’).replace(‘,’, ‘’)) for x, y in zip(df1[‘Worldwide_Gross’], df1[‘Production_Budget’])]
Not only did I get to learn a lot about what you can do with list comprehension, I also learned about the zip function which I used to access and connect the two columns. Once my column was made I wanted to filter out any movies that had [‘ProfitReturn’] > 1. I wanted to make sure that we only included movies that were able to get their production investment back because there is no use in using and learning from movie data that did not make money. Lastly, I sorted the data by the profit return value and put the top 100 results in a new DataFrame to work from.
Fourth Step: Journey into APIs
At this point I felt like I had a pretty good set of data to do some of my analysis on, but there was more that I was interested in that was not in the CSVs provided. One thing that was always an interest of mine was where movies are filmed. I would often hear friends and family talking about having their film work take them to places like British Columbia and Ontario in Canada, or to Georgia and the Carolina states in America to avoid more expensive filming locations like Los Angeles, New York City, or places far abroad. I wanted to see how common it was for movies to be filmed in these locations and also if this had any effect on gross and profit return. This led me to search for APIs to look for this information. I knew that most IMDb pages would include the city that a movie was filmed and sometimes even include street addresses of certain locations they filmed at, so I was led to rapidapi.com and found all sorts of APIs available. Once I found an API that met my needs and was able to access the filming location information from IMDb I started to write code. Since the API could only handle one request at a time, I wanted to loop through the title ID for each movie. Doing this manually would be the same amount of requests, but I felt as if automating it would make things easier
and faster. Unfortunately I found that writing the code to automate this process took far longer than it would have to type in all the title IDs on their own line and write out the request one hundred times. Three hours later when I finally got it to work I was so happy I was nearly in tears. The end result required me to put the get request in a function and use a conditional statement in the case that perhaps no location data was provided for a certain movie. Then I made a for loop to call the function and put the data it received in a column. Even though it wasn’t many lines of code I was super proud of the work I created and how much I learned in a short amount of time.
When I received the data containing the filming locations, I realized that they had no coordinates. This led me to Google Cloud Services and enrolling to use all the APIs they offer. Like many, I did not learn from the pain of automating my previous API and get-request and it was time to do it again. Luckily I was able to use my last one as the structure of my code and it took me about half the time! Similarly to before I used a function with a conditional statement to find the coordinates of the addresses I had received. This time I had to add in a line in case a movie had no locations listed, after quickly learning that without this line my request would fail. Now with my columns created with all of the coordinates, I felt ready to start my visualizations.
Plotting the Data
Going into this course, I always enjoyed plotting data. Many times i’d use Microsoft Excel or other programs to neatly plot my data from a spreadsheet, but doing the behind the scenes code has been a challenge for me. Whether it be Matplotlib, Seaborn, or Plotly, I struggled! Part of the problem is probably being a novice at reading documentation and a fear of messing things up and still learning how to read error messages. Initially when planning out my data visualizations I planned on using Seaborn. In our lesson about visualizations it seemed that Seaborn was significantly easier than Matplotlib. While that may be true, after a countless amount of Google searches and spending a lot of time on Stackoverflow, I decided to give up.
We never had an actual lesson on Plotly, but it was briefly talked about and seemed to be even more user friendly than Seaborn. I was discouraged when setting up Plotly and had a brief “woe is me” moment when I ran into an issue with my visualizations loading. No error codes, just a very sad frowny face on the visualization. I was so terrified that if I couldn’t get Seaborn or Plotly to work then I’d have to use Matplotlib, which I’m sure is a very nice library and I will come to love as I develop my skills, but at this point it seems equally as terrifying as the moment your IDE crashes with your unsaved code. In the process to save Plotly and my remaining sanity, I furiously searched through Google to find the solution to my problem and lo and behold it worked! Seeing my data visualized was just as satisfying as the moment I got my API automation to work. Not only that but, the next visualizations I made were so easy with Plotly that I honestly spent more time picking a color for my graph. Having data in a DataFrame is interesting, but I think that the important message that this project is trying to show us is how much visualizations matter. In our future careers we won’t always be presenting to others who are tech savvy or people comfortable with excel spreadsheets and mathematics. In the end I focused my visualizations on things that I feel a business would be interested in — worldwide gross and the profit return ratio. Of course I also had to add my maps of where movies were made because after all that time spent getting the API to work I’d be devastated to not be able to present that and show off my hard work.
Final Thoughts About the Data and Project
With one project under my belt, it’s hard to feel anything but proud and accomplished. A week ago I started this project with some CSVs and a few ideas and now I have a full Github repository and a strong slidedeck to present to my peers. This project has been such a great inspiration for me and makes me very optimistic about the future after Flatiron. Data Scientists are unique and can fit into any field they want because data is all around us. I believe that having a personal connection to a certain industry and the data will only make me a better coder and Data Scientist. For the longest time I was so apprehensive about decisions related to my future career, whether it be a simple question like a family member asking me what I wanted to be when I grow up or declaring a major at university. Data Science has helped alleviate the anxiety I feel of being trapped into a field that I may later get bored of, and reassured me that if I wanted to work in multiple different fields that I could. So maybe I won’t work in the film industry with family and friends of mine in a position where I am writing, producing, and editing film. But that doesn’t mean I can’t work in the industry alongside them and analyze data similar to this project to help them on their next feature film.
Thank you for taking the time to read about my first project with Flatiron and what went into getting it done. If you wish to view my project’s Github Repository you can find it here: