Movie List — Data Cleaning Using Pandas and ChatGPT — Unguided Project Part 1

Maria Inc
7 min readJul 22, 2024

--

This is my second time using Pandas to clean data and I’ve to say it's a challenging one for me. I’ve tried another dataset, and it's a lot bigger than my first Data Cleaning Project. However, this time there is no video guidance on what I should do step by step. Hence, I took a little help from ChatGPT to give me some ideas or advice, if needed.

This will be split into 2 parts; Part 1 is the Data Cleaning using Pandas and ChatGPT and Part 2 will be answering the Business Questions.

Sources:

This is a Movie dataset, it contains the following columns:

  • Movies: The title of the movie.
  • Year: The release year of the movie.
  • Genre: The genre(s) of the movie.
  • Rating: The IMDb rating of the movie.
  • One-Line: A one-line description or tagline of the movie.
  • Stars: A mixed column containing both directors and stars of the movie.
  • Votes: The number of votes the movie received on IMDb.
  • RunTime: The runtime of the movie in minutes.
  • Gross: The gross revenue of the movie.

Scenario (by ChatGPT):

A movie production company is evaluating its portfolio and future strategies. The company wants to make data-driven decisions to enhance its content strategy, marketing efforts, and collaborations with key talent.

By addressing these business questions, stakeholders can leverage the dataset to make strategic decisions about content production, marketing, and investments, ultimately aligning their actions with market trends and maximizing success.

Business Questions:

  1. What are the most common genres in the dataset?
  2. Which movies have the highest and lowest ratings, and what are their common attributes (e.g., genre, runtime, stars)?
  3. Which movies have the highest and lowest votes, and what are their common attributes (e.g., genre, runtime, stars)?
  4. Who are the most influential directors and stars in terms of movie ratings and gross revenue?

Here are the steps I took:

Step 1

Create a data frame (df) and then call your raw dataset (excel). This time I still used the path method.

Step 2

Data exploration and observation. I looked through each column and row, for any duplicates, missing data, or formatting discrepancies. From here we would know what changes we need to make to answer the business questions.

  • Drop duplicated rows
  • Strips unnecessary characters to a string
  • Removing column(s) that are not needed
  • Change to necessary formats

and many more.

Step 3:

Remove duplicated rows. Now, this dataset is huge, hence, we can’t eyeball if there are duplicated rows. Hence, we could use the following Pandas methods.

  • Shows which row is a duplicate

Using the method below will show us which row(s) consist of duplicated rows. (False = Not duplicated / True = Duplicated)

  • Shows the total number of duplicated row
  • Drop the duplicated rows

Step 4:

Remove the unwanted columns such as the ‘One-line’ column.

Step 5:

Cleaning and standardizing the columns.

a. YEAR Column

  • Removing the brackets from the strings
  • Challenges: After removing the brackets, I can’t seem to strip the hyphen from the standalone year (eg. 2021-)
  • Solution: I consulted ChatGPT to learn how to remove the hyphen from the standalone year by using the ‘.replace()’ and regex pattern.
  • Learned: Remove the whitespace first and then use ‘.replace’ to replace any hyphen value with an empty string ‘ ‘ by using r’–\s$’*, a regex pattern.
  • Remove the whitespace by using ‘.strip()’.
  • Use ‘.replace()’ with regex to replace the hyphen for the standalone year (eg. 2021-)

Let’s break down this regex pattern from ChatGPT:

  • ‘–’ : Matches the hyphen character (you can also use the standard hyphen ‘-’ if it matches the specific hyphen used in your data).
  • \s*: Matches zero or more whitespace characters (spaces, tabs, etc.).
  • $: Asserts the position at the end of the string.

Basically saying that hyphen — followed by zero or more spaces at the end of the string (eg. ‘2021-’ , ‘2013-’) — is removed. Hence, it left ‘2010–2022’ unchanged.

b. GENRE column

  • Remove the slashes ‘\’ and ’n’ which are called newline characters from the string

c. STARS Column

Among all the columns, this is the most challenging one yet. I’ve to do a couple of reverts to make it right.

  • Remove whitespace in the Stars column
  • Split the Stars column into 2 columns ‘Director’ and ‘Stars’
  • Replacing the ‘Directors:\n’ with an empty string and then removing white spaces
  • Replacing the ‘Star:\n’ and ‘Stars:\n’ with an empty string and then removing whitespace

The next step will show how strings that contain ‘Star:\n’ and ‘Stars:\n’ will be shifted to the Stars column. Now, this part took a while for me to understand but thanks to ChatGPT, it's able to break down the solution that’s easier to understand.

  • First, we have to identify the rows that contain ‘Star:\n’ or ‘Stars:\n’ under the Director column. Those that are ‘TRUE’ are rows that contain such string. While ‘FALSE’ was do not.
  • Create a new variable called ‘mask’ to store the identified rows that contain the said strings above.
  • Then with the said variable ‘mask’, we will use ‘.loc’ that’s related to the row index to move the string from the Director column to the Stars column in the same row index number.
  • For the rows under the Director column that are empty, we will set it as an empty string
  • After moving the appropriate strings to the Stars column, we will have to clean it up from whitespace and strip unnecessary strings such as ‘Star:\n’ or ‘Stars:\n’, and ‘\n’.
  • Then go back to the Director column and strip the newline characters ‘\n’ as well.

d. Overall columns

  • In the overall dataset, we will fill in all the Nan / None values and drop the ‘STARS’ column which is not needed anymore.

Step 7:

Reset the index column to make it consecutive.

Step 8:

  • Finally, converting the dataset into my chosen format which is Microsoft Excel. You can do another format as well if you’d like as seen below.

Conclusions

Here is our final dataset with a clean and standardized formatting, hence, it will be easier to view and answer our Business questions — which will be in our next post — Part 2!

My experience with a large dataset was daunting because it was not a guided project, unlike my first data-cleaning project with Pandas. I would have to use my critical thinking to ensure that the way I clean the table would be useful when I start answering the Business Questions.

Thankfully, ChatGPT is available to guide me whenever I’m in doubt about how to do certain methods, most especially Step 5, c. — Stars Column.

Moreover, I believe there will be more data wrangling when it comes to answering the business questions. So that’s more to come to our Part 2 for our Movie dataset project.

You can visit my Github to see more on this project, click here.

Next step / Action

  1. I will post Part 2 of Data Cleaning using Pandas with ChatGPT —an unguided project, where I will answer the Business Questions listed above.

2. In The next lesson I will start exploring more in-depth Data Analysis methods using Pandas and SQL.

“Maria”

--

--

Maria Inc
0 Followers

Aspiring Data Analyst / Business Analyst | Continuously learning as we go