Analyzing Amazon Forest Fire Spots with Python Part 1

Bernard Kurka
Analytics Vidhya
Published in
5 min readDec 1, 2019
Picture by Greg Neise https://flic.kr/p/6dhVkZ

Dear reader in this post, I will explore how I used Python to explore a data set of fire spots in the Amazon forest.

In part 1 you will learn:

  • How to import multiple CSV files into a single Pandas DataFrame.
  • Create a Python file with customized functions and use them in a Jupiter notebook.
  • How to explore a data set with 1 million rows.
  • Change column data types.
  • Create auxiliary columns.

If you are new to Python check out my post on writing your first Python code, if you are new to Pandas data frames check out this post. To install Python and Jupyter notebook in for windows you can watch this youtube tutorial.

Data source:

The data consists of latitude and longitude of fire spots located in the Amazon Forest inside Brazil’s territory from 01/2007 to 08/2019. The data was collected from the Brazilian National Institute for Space Research (INPE), date of access was September 20, 2019. You also can download the files directly from my Github.

INPE Base de dados de Queimadas

Importing libraries:

import pandas as pd
import my_functions

The ” my_functions”s is a python file that I created with customized functions and are saved in the same folder as the Jupyter notebook. It’s a form to leave your notebook clean and store reusable code that can be used in other projects. The only caveat is that you need to restart the notebook kernel to import any updates made in my_functions file. In order to create a Python file you can use a text editor such as notepad insert your python functions and save the file adding .py to the end of the file name. I’ve been using sublime text editor https://www.sublimetext.com/ and notepad ++ https://notepad-plus-plus.org/.

Importing data:

The code below imports one CSV file.

path = './Local Data/'
file = 'filename.csv'
df = pd.read_csv(path + file, sep=';')

In this project, the data is stored in 13 CSV files. I’ve created a customized function that imports all of them into a data frame. The original column names are in Portuguese, so the function also renames columns and changes column types.

Get the full function here: https://github.com/berkurka/amazonfire/blob/master/my_functions.py

Using the function above I imported all the CSV files into a data frame.

Exploring data:

These are some of the methods I use to explore a new data set.

  1. df.shape
  2. df.head() or df.sample()
  3. df.dtypes & memory_usage
  4. df.isnull().sum()
  5. df.describe()

1. df.shape: returns the number of rows and columns in the data frame.

2. df.head() or df.sample(5): returns the first 5 rows or a random sample with 5 rows. From the results below we can get a sense of the contents of each column.

3. df.dtypes & df.memory_usage(): It's always important to check if the data types in the table are what you expect them to be. In this case, the Date column is an object and will need to be converted to DateTime. With the code below you can check the type and memory usage for each column, note that string columns consume plenty of memory, I won’t deal with optimizing memory usage in this post, but one option is to convert the string column to categorical.

pd.DataFrame({'Column Type':df.dtypes, 'Memory Usage': df.memory_usage(deep=True)}) 

The date column is a string, in order to convert it to from object to DateTime, I used the code below.

df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d %H:%M:%S')

If the date format is not likeYYYY/mm/dd hh:mm:ss you will need to change the format argument in the pd.to_datetime function, all formatting options can be found in this website http://strftime.org/.

The Date column can also be converted when importing the CSV file.

date_column_names = ['Date'] #List of columns names with date format
pd.read_csv('./file.csv', parse_dates = date_column_names)

After converting Date column to DateTime:

4. df.isnull().sum(): counts how many null values in each column.

5. df.describe(): creates descriptive statistics to summarize the data frame. The output will vary depending on the datatypes used. In the first image the parameter include=['float']will run the describe function only on the float columns.

São Felix do Xingu is the county with the most amount of observations.

There is data from different states, 28 different satellites, and 547 counties, 562,618 unique date times.

Creating auxiliary columns:

To facilitate data exploration I’ve created some time auxiliary columns.

df['Day'] = pd.to_datetime(df['Date']).dt.normalize()
result['Month'] = result['Date'].dt.strftime('%m').astype(int)
result['Year'] = result['Date'].dt.strftime('%Y').astype(int)

This is how the new columns are:

With Moth and Year columns, I can easily use the .value_counts():

From Months 9 to 11 there are more fire spots, this makes sense since the rainy season in Brazil peak is in Dec and Jan.
sad2017 and 2015 were the years with the most amount of fire spots before removing duplicates.

The code below does the count by year without creating auxiliary columns.

df["Date"].dt.year.value_counts() 
df.set_index(df["Date"])['State'].resample('Y').count()

In my next post, I will attempt to remove duplicates, since the same fire spot can be counted twice or more times and I will generate some cool visualizations.

Thanks for reading!

--

--

Bernard Kurka
Analytics Vidhya

Passionate about science, technology, and business. I love to use technology to solve problems and help people.