Olas - Medium

Preparing for a Graduate Thesis Defence

Idalia Machuca — Fri, 27 Sep 2019 02:41:32 GMT

At a few months post-graduation, I look back at my “defence day” as one of the proudest and happiest days of my life. Shocking, I know! Had you asked me how I was handling my impending defence while I was still preparing for it, I would have tossed around the words — nervousness, anxiety, uncertainty, dread! It takes enormous levels of preparation and courage to prepare for and get through a defence. Let’s discuss 6 tips to keep in mind while getting ready for the big day!

1. You are the expert

After a series of edits prescribed by your peers, supervisor, and committee, you’ve probably read and reread your thesis to the point of memorization. No matter how complex or technical, the work to which you’ve dedicated years can begin to sound uninspiring. You know this is imposter syndrome, but it’s shaking your confidence nonetheless!

I’m here to remind you that — you are the expert! You’ve spent an extraordinary amount of time and effort working on your project. Take ownership of this great accomplishment. No one else has spent as much time as you have on understanding the nuances related to your project and field of expertise. Defending your choices and findings is, indeed, intimidating (and nauseating), but this final step is also one of celebration for your dedication to the work and contribution to your field.

2. Review the thesis, again

On a more logistical note, reviewing the document that you will be presenting is of highest priority. You may choose to read the full work or focus your energy on reviewing the more complex and detailed sections. Stepping back and viewing the document in its entirety can also be quite invigorating as it is a very concrete, visual representation of your efforts.

3. Practice answering difficult questions

As you review the thesis, compile a list of questions that might arise during your defence. It might help to carefully consider the specific areas that are likely to be supported or opposed by individual members of your committee based on their input during the thesis editing stage or from past meetings.

One of your most valuable resources, especially at this stage of your academic career, is your research group or cohort! A useful exercise is to conduct a “mock defence” with a “mock committee” made up of your own peers. This type of rehearsal allows you to practice answering challenging and unanticipated questions in a low-risk, though still realistically stressful, setting.

4. Prepare required and supplementary materials

How is a thesis defence conducted in your department? What is the goal of a defence? What are the outcomes of a defence and who determines these? What is expected of you? Schedule a meeting with your supervisor to clarify the expectations for your defence. In my department, for example, a defence is expected to begin with a ~20 minute presentation given by the student, and it is followed by 2–3 rounds of questions, each round allowing up to 10 minutes per committee member.

As additional preparation, also consider collecting key figures or extracts from your thesis and relevant literature and organizing these as a separate document to be presented if/when a question requires auxiliary visuals.

Finally, speak with members of the team who have attended past defences to gauge the culture regarding expected formalities and customs (eg. committee introductions, dress code, coffee or snacks, restrictions for attendees — significant others or family members).

5. Find your comfort

It’s human to feel uneasy about facing a challenging task. Everyone will express confidence and insecurity for different elements of their graduate career. Throughout your graduate program, however, you have probably discovered your comforts — items, activities, personal rituals that soothe you and help you regain a positive outlook on difficult situations.

On the days leading up to the defence, you could have dinner with friends, go for a long walk through your favourite park, catch up on a television series, or pick up a neglected personal project.

On the day of your defence, do what feels right for you. Have a good night’s sleep, spend extra time getting ready in the morning, bring your lucky pencil into the defence, or call your relatives on Skype. Looking back now, I remember all the small details that made my day more comfortable and, therefore, much more pleasant.

6. Celebrate

Celebrating can seem far off. After all, you first need to conquer a trial! And, this is absolutely true! What is also true is that the defence, no matter the outcome, is a major accomplishment following years of personal sacrifice and personal growth. This is a very real cause for celebration!

Today, I remember the day of my defence for being surrounded by my favourite people! I invited my closest friends and colleagues to the local community hall (the Royal Canadian Legion), and we had an unforgettable day filled with dancing, karaoke, surprise cakes, surprise gifts, and heart-to-hearts! We even had party hats (which graduate students and faculty were not just willing but actually excited to wear)! By the end of the night, we had closed down the town, taking away memories to last a lifetime.

While this was perfect for me, everyone has unique preferences and visions for a fun-filled day. One of the greatest benefits of planning a celebration, though, is envisioning a future beyond this difficult task and looking forward to treating yourself with what brings you joy!

Final thoughts

For those of you who are at the defence stage of your graduate program, congratulations! You’re at the very end of the tunnel, and you’ve displayed great resilience and determination to be here. This is the final push! You’ve done something truly great for your field; now is the time to show everyone! Take in the view!

For those of you are knee-deep in research, readings, and writing, keep pushing! Remind yourself of your inspirations and goals. You can do this! And, by the end of it all, you will have learned so much about your field and yourself. Godspeed, my friends!

Preparing for a Graduate Thesis Defence was originally published in Olas on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize

Idalia Machuca — Sat, 13 Jul 2019 00:04:20 GMT

Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize

Photo by Spencer Watson on Unsplash

Hi friends!

With the growing influence of Airbnb on the tourism and hospitality industry around the globe, I was curious about its presence in Belize. With sprawling jungles, coastal landscapes, and a rich, multicultural history, Belize is a hidden treasure for adventure-seekers. Though the hospitality sector in Belize still appears to be dominated by hotels and resorts, there is now chatter of an increasing number of local listings appearing on Airbnb.

This article is primarily a data analysis tutorial, though I take the liberty (at the end of the article) to comment on how these findings influence the current conversation in Belize.

That said, in this exercise, we review a few basic functions for quickly extracting meaning from data and practice troubleshooting techniques when data manipulation is not as straight forward as initially perceived. Given the potential for further analysis, this article is part 1 of a series, and it seeks to calculate initial estimates (or “numbers”) from data.

Part 1: A quick look at the numbers

1. About the data

The data for current Airbnb listings in Belize was sourced from the Inside Airbnb website. The dataset is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) “Public Domain Dedication” license.

For this exercise, I’ll be using the “Detailed Listings” dataset for Belize (filename: listings.csv) retrieved on 26 May, 2019.

2. About the code

Pandas is a data manipulation and analysis library with easy-to-use functions designed for relational or labelled data. Pandas is written for Python and builds on the functionality of the NumPy library. It is well suited for tabular data (for example, Excel spreadsheets), time series data, and matrix data with labelled columns and rows. Pandas uses Series and DataFrame objects. A series is a one-dimensional labelled array, and a DataFrame is a tabular data structure. One of the advantages of using Pandas is, as you will see in the code snippets below, the expression of computational methods, which allows for operations to be performed with fewer lines of code compared to NumPy. I personally prefer to use NumPy in many cases since I get a “feel” for the inner workings of my code, though the time benefits of Pandas do make it a hard-to-resist option. For further background on Python programming, please check out this article.

3. How many accommodations are listed in Belize?

We begin by importing the tabular data in the .csv file as a DataFrame object df. The shape attribute returns a tuple representing the dimensionality of the DataFrame. The columns attribute returns the labels for the 106 columns in the DataFrame. Immediately, this information tells us that there are 2558 total listings (as indicated by the number of rows in the DataFrame) in the country of Belize. The values attribute returns a NumPy representation of the array of column names, which allows for the complete (instead of truncated) array of names to be printed. Without manually opening the large .csv file, we now have the names of the metrics we will explore.

import pandas as pd

df = pd.read_csv('listings.csv')

df.shape
(2558, 106)

df.columns.shape
(106,)

df.columns.values
array(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name',
       'summary', 'space', 'description', 'experiences_offered',
       'neighborhood_overview', 'notes', 'transit', 'access',
       'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
       'picture_url', 'xl_picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
       'market', 'smart_location', 'country_code', 'country', 'latitude',
       'longitude', 'is_location_exact', 'property_type', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type',
       'amenities', 'square_feet', 'price', 'weekly_price',
       'monthly_price', 'security_deposit', 'cleaning_fee',
       'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'requires_license', 'license',
       'jurisdiction_names', 'instant_bookable',
       'is_business_travel_ready', 'cancellation_policy',
       'require_guest_profile_picture',
       'require_guest_phone_verification',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype=object)

4. What are the types of listings are advertised?

Let’s start isolating specific metrics in which we’re interested.

df['room_type'].count()
2558

df['room_type'].nunique()
3

df['room_type'].unique()
array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object)

df['room_type'].value_counts()
Entire home/apt    1552
Private room        965
Shared room          41
Name: room_type, dtype: int64

1552+965+41
2558

We take a closer at the column room_type, which gives us information on the kinds of accommodations offered by the current listings in Belize. The count method returns the total number of non-NaN cells in this column. More specifically, the unique and nunique (i.e. number+unique) methods tell us that there are 3 distinct values or categories of accommodations, namely Entire home/apt, Private room, and Shared room. A more meaningful finding, however, is the relative population of each category, which is neatly printed using the value_counts method. Note that all operations are performed on df[‘room_type’], which is a Pandas Series. Also, while value_counts returns a Pandas Series, unique returns a NumPy array. It is always useful to keep track of the types and structures of data you are currently analyzing.

5. What is the average rating of a listing in Belize?

Next, we move on to quick statistics.

df['review_scores_rating'].min()
20.0

df['review_scores_rating'].max()
100.0

df['review_scores_rating'].mean()
93.80575945793338

Isolating the review_scores_rating column and performing minimum, maximum, and mean operations on this Pandas Series provides us with a quick overview of the quality of the accommodations in Belize as indicated by the satisfaction of reviewers. Note that each function returns a float.

6. What is the average price for a listing in Belize?

We try the same quick statistics as above on the price column. Quickly, however, we run into a problem when we attempt to find the average price for a listing using the mean function.

df['price'].min()
'$0.00'

df['price'].max()
'$998.00'

df['price'].mean()
ValueError: could not convert string to float

Using the dtypes (which stands for data-type) function, the difference between review_scores_rating and price is easy to spot. While both are Pandas Series, they contain data of different types. The rating column is made up of floats (float64), while the price column is made up of objects (O), namely strings (str).

df[‘review_scores_rating’].dtypes
dtype('float64')

df['price'].dtypes
dtype('O')

type(df['price'][0])
str

Upon closer inspection, we notice the quotes around the monetary values in price which indicate that the values are expressed as strings.

df['price'].unique()
array(['$75.00', '$35.00', '$95.00', '$60.00', '$140.00', '$185.00', '$49.00', '$26.00', '$85.00', '$117.00', '$88.00', '$79.00', '$89.00', '$189.00', '$200.00', '$149.00', '$130.00', '$110.00', '$72.00', '$99.00', '$148.00', '$59.00', '$250.00', '$58.00', '$239.00', '$120.00', '$125.00', '$105.00', '$150.00', '$139.00', '$119.00', '$225.00', '$135.00', '$98.00', '$1,947.00',....

Manipulation of a string is different than that of a float. Finding the minimum and maximum of price didn’t produce an error since this computation was being performed on the alphabetical characters of the strings. This is why the max method returned ‘$998.00’ even though we can easily spot a quantitatively larger value near the top of the column (‘$1,947.00’). The mean function did trigger an error, however, since this is an illogical (and, therefore, impossible) operation on a string.

In the code snippet below, we see that the zeroth component of the first value in price (i.e. ‘$75.00’) is the dollar sign. Therefore, we not only need to convert the values in price from strings to floats in order to perform computational operations, but we also need to remove characters such as the dollar signs and (here’s a surprise) thousands separators (for example, the comma in ‘$1,947.00’).

df[‘price’].unique()[0][0], df[‘price’].unique()[0][1]
('$', '7')

First, we replace the dollar signs and commas with empty space using replace.

price_str = df[‘price’].str.replace(‘$’, ‘’).str.replace(‘,’, ‘’)

price_str.unique()
array(['75.00', '35.00', '95.00', '60.00', '140.00', '185.00', '49.00', '26.00', '85.00', '117.00', '88.00', '79.00', '89.00', '189.00', '200.00', '149.00', '130.00', '110.00', '72.00', '99.00', '148.00', '59.00', '250.00', '58.00', '239.00', '120.00', '125.00', '105.00', '150.00', '139.00', '119.00', '225.00', '135.00', '98.00', '1947.00',....

Next, we change the data type of the Series from strings to floats.

price_flt = price_str.astype(float)

array([  75.,   35.,   95.,   60.,  140.,  185.,   49.,   26.,   85.,   117.,   88.,   79.,   89.,  189.,  200.,  149.,  130.,  110.,   72.,   99.,  148.,   59.,  250.,   58.,  239.,  120.,  125.,   105.,  150.,  139.,  119.,  225.,  135.,   98.,   1947.,....

Now we’re ready to implement the same operations as with the ratings column!

price_flt.min()
0.0

price_flt.max()
6200.0

price_flt.mean()
180.7587959343237

7. How many listings are advertised for each district (i.e. state) in Belize?

Let’s summarize what we’ve done until this point! The room_type column was populated with pre-assigned, closed-ended categories. The review_scores_rating column consisted of floats, which is a ready-to-use type of data for calculations. Next, we learned to manipulate data in the price column in order to perform operations we had previously used. In this final section, we take data processing to the next level using the state column, which is populated by open-ended responses.

Upon inspecting the first 10 entries in state (code clip below), we notice that the data has likely been supplied by open-ended prompts or questions. Stann Creek, Belize, and Cayo are 3 of the 6 districts in the country of Belize. Belize City, however, is not. It is, in fact, a city located in the district of Belize. We also note that Toledo is labelled with the additional word “District”. How do we go about reformatting 2558 entries, especially under time constraints?

df['state']
0                        NaN
1                        NaN
2                        NaN
3                Stann Creek
4                Belize City
5                        NaN
6                     Belize
7            Toledo District
8                Stann Creek
9                       Cayo
10                       NaN

Let’s first establish the names of the states or districts of Belize. The first 6 names in the list dists (code snippet below) are all the districts of Belize. Note, however, that Ambergris Caye and Caye Caulker are such popular tourist destinations (and geographically separate from the mainland of Belize) that we will consider them as separate categories for a more realistic and detailed analysis of listing locations.

dists = ['corozal', 'orange walk', 'belize', 'cayo', 'stann creek', 'toledo', 'ambergris caye', 'caye caulker']

We start processing the data by dropping all letters to lower case. So, entries for Cayo, cayo, and CAYO are counted for the same district.

state_orig = df['state'].str.lower()

print(state_orig.value_counts())

cayo district                              329
corozal district                           307
stann creek district                       243
belize district                            192
belize                                     130
stann creek                                116
cayo                                       108
ambergris caye                             103
be                                          93
st                                          45
corozal                                     37
toledo district                             22
orange walk district                        16
ca                                          15
toledo                                      11
bz                                          10
caye caulker                                 7
ambergris caye, belize                       3
orange walk                                  3
belize city                                  3
bze                                          2
san pedro                                    2
*                                            2
cayo district, belize, central america       1
ontario                                      1
cayo, belize                                 1
san ignacio                                  1
corozal, belize                              1
ambergis caye, belize                        1
quintana roo                                 1
caribbean sea                                1
coronal district                             1
caribben sea                                 1
belize central america                       1
placencia, stann creek district, belize      1
ambergris caye, south                        1
toledo district belize ca                    1
belize c.a                                   1
Name: state, dtype: int64

Straight away, we notice a few simple fixes (eg. excess words like ‘district’). Some issues require further investigation (eg. the entry ‘be’ with 93 listings, which is not insignificant and should therefore not be ignored). There are also a couple egregious errors, which could be addressed with manual cleaning (eg. Quintana Roo is not even in the country of Belize).

To be on the safe side, we make a copy of the Series state_orig.

state_new = state_orig.copy()

We tackle instances when the name of a district is surrounded by excess information. If there is an entry with the district name as specified by the list dists (defined above), we want to replace the entire entry with just the name of the district. In other words, we are trimming excess words or characters from these entry.

The ‘.*’ accounts for all characters. All entries with characters before and/or after the district name are rewritten with just the district name. For example, the entry ‘toledo district belize ca’ will be reduced to ‘toledo’.

def extract_district_name(dists, state_new):
 for dist_name in dists:
   state_new = state_new.str.replace(‘.*’+dist_name+’.*’, dist_name)
 return state_new

Instantly, we have reduced the number of unique entries and added listings to the 8 categories defined for districts. For example, the number of listings determined to be in Cayo has risen from 329 to 437.

state_new = extract_district_name(dists, state_new)
state_new.value_counts()

cayo                437
stann creek         359
corozal             345
belize              335
ambergris caye      104
be                   93
st                   45
toledo               33
orange walk          19
ca                   15
bz                   10
caye caulker          7
*                     2
san pedro             2
bze                   2
san ignacio           1
quintana roo          1
caribben sea          1
caribbean sea         1
coronal district      1
ontario               1
Name: state, dtype: int64

We still have the issue of many listings falling into nondescript categories like be, st, ca, and bze. There must be more information we can gather from other areas of the dataset that could narrow down the location of these listings. Perhaps users incorporated the name of the district into their entries for city.

In the snippet below, we perform two actions. First, we ask if entries in state_new are in the list of defined districts dists. If a certain entry is not in the list of districts, the answer is False and the second action is performed.

The second action is that the entry in state is reassigned as the corresponding entry for city. As a minor detail, this entry is dropped to lower case as previously done.

state_new.loc[state_new.isin(dists)==False] = df[‘city’].str.lower()

Example: The entries be, st, ca, and bze in state return False because they are not in the list dists. Consequently, these entries are replaced with their corresponding entries in city.

state_new.value_counts()

cayo                                                  437
stann creek                                           359
corozal                                               346
belize                                                338
san pedro                                             253
caye caulker                                          183
ambergris caye                                        129
placencia                                              82
san ignacio                                            69
toledo                                                 33
belize city                                            28
san pedro, ambergris caye                              25
orange walk                                            22
placencia, stann creek district, belize                 9
cayo belize                                             9
hopkins, stann creek                                    8
placencia village, stann creek district belize, ca      7

The list above is only a snippet of the first few unique entries. We notice that our strategy has been successful. Entries like be, st, ca, and bze have been replaced by more descriptive entries. We also note that the new entries oftentimes contain the names of the districts. Therefore, we can execute the function extract_district_name again!

state_new = extract_district_name(dists, state_new)
state_new.value_counts()

Notice that, in executing this function once more, most entries fall inside the defined categories for districts. Still there are a few highly populated strays with city names like placencia and san ignacio.

cayo                               440
belize                             404
stann creek                        368
corozal                            346
san pedro                          253
caye caulker                       188
ambergris caye                     158
placencia                           82
san ignacio                         69
toledo                              33
orange walk                         22
belmopan                             6
dangriga                             5
san pedro town                       5
hopkins                              3

We can address the few strays by manually replacing the city names with the correct district names. Clearly, a more automated method would be necessary if there were more incorrect entries. Also note that this method can only be done by someone with in-depth knowledge of the content. With more time, perhaps, one could produce a script that gathers this knowledge from an existing reference or online source. For the purpose of this exercise, however, this method proves to be sufficient.

state_new = state_new.str.replace(‘belmopan.*’,’cayo’)
state_new = state_new.str.replace(‘san pedro.*’, ‘ambergris caye’)
state_new = state_new.str.replace(‘coronal district.*’,’corozal’)
state_new = state_new.str.replace(‘san ignacio.*’,’cayo’)
state_new = state_new.str.replace(‘placencia.*’,’stann creek’)
state_new = state_new.str.replace(‘dangriga.*’,’stann creek’)
state_new = state_new.str.replace(‘hopkins.*’,’stann creek’)
state_new.value_counts()

Manually addressing the few remaining non-district entries has improved the relative counts. For example, the listings for Cayo have increased from 440 to 515. The remaining stray entries have a maximum of 2 listings each.

cayo                               515
stann creek                        464
ambergris caye                     416
belize                             404
corozal                            346
caye caulker                       188
toledo                              33
orange walk                         22
seine bight village                  2
long caye                            2
punta gorda                          2
sittee river village                 2

For the final count, we categorize all entries that do not fall within a pre-defined district category as ‘other’.

state_fin = state_new.copy()

state_fin.loc[(state_fin.isin(dists)==False)& (pd.isna(state_fin)==False)] = ‘other’

state_fin.value_counts()

The final count shows the relative population of listings in all pre-defined districts. Of course, more time could be dedicated to improving the accuracy of this exercise, but we are now able to derive meaning from our data — at a glance!

cayo              515
stann creek       464
ambergris caye    416
belize            404
corozal           346
caye caulker      188
toledo             33
other              23
orange walk        22
Name: state, dtype: int64

Key findings

There are 2558 Airbnb listings in Belize.
Airbnb advertises 3 possible options for room types, namely ‘Entire home/apartment’, ‘Private room’, and ‘Shared room’. All types are available in Belize with the ‘Entire home/apartment’ category garnering the most listings.
Assuming review ratings are scored out of 100 points, the average score in Belize is ~94%, which is equivalent to 4.7/5 stars!
The average price of a listing in Belize is ~$181, with the most expensive listing being $6200. This analysis assumes the variable used for this calculation is specific to a one-night’s stay since other pricing variables are provided for weekly and monthly stays.
Given a few approximations, the district with the most Airbnb listings in Belize is Cayo. This is followed, in descending order, by Stann Creek, Ambergris Caye, Belize, Corozal, Caye Caulker, Toledo, and Orange Walk.

Comments and considerations

The number of Airbnb listings is not indicative of the number of active hosts in Belize since property owners or managers could post multiple listings.
The few, more expensive options likely skew the national mean. For example, the maximum price found is a shocking $6200 per night. It is important to note that extravagant prices likely originate from all-inclusive, luxury resorts with experiences especially tailored to foreign visitors. Therefore, the result of an average price of $181 per night does not necessarily mean that local Belizeans are receiving this much from renting their personal property.
It is, therefore, important to ask — Where are the most expensive listings (in the cayes and popular tourism spots)? Are the most expensive listings part of larger businesses (hotels, resorts)? What is the average price for listings that are personal property of the locals? What percentage of the listings are large businesses vs personal property? Where are listings being booked more often; does this align with the most popular tourist attractions? How do prices change depending on the tourist season (i.e. “low” and “high” seasons)?
These questions raised above are of particular importance in light of recent developments regarding taxation of Airbnb listings by the country’s tourism board. Any regulation implemented should consider the differences in property type and ownership, booking capacity, and popularity of the listing locations. For example, a conversation could be had regarding taxation of small private properties in Orange Walk versus the luxury resorts in Stann Creek.
Finally, regarding the methods for analyzing districts, it is important to note that some hosts may (correctly) consider Ambergris Caye and Caye Caulker to be in the Belize district. One should also be careful to consider any possible confusion given that one of the districts has the same name as the country. In general, further care should be taken in avoiding any double counting. For the purpose of this tutorial, however, the methods employed were successful in producing a quick overview of the data.

Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize was originally published in Olas on Medium, where people are continuing the conversation by highlighting and responding to this story.

8 tips to successfully write a thesis for graduate school — and live

Idalia Machuca — Wed, 10 Jul 2019 02:23:26 GMT

8 tips to successfully write a thesis for graduate school — and live

Photo by JESHOOTS.COM on Unsplash

Hi friends!

Writing a thesis or dissertation for your graduate degree is a daunting task. After years of conducting independent research, running experiments, compiling data, and staying up to date with the most recent advancements in your field, it is easy to feel overwhelmed by that final (and arguably most significant) stage. Remember, everyone’s thesis-writing experience is unique. The expectations for your research and thesis vary depending on your topic, field, supervisor, department, university, and even academic “generation”. That said, this is a list of tips to keep in mind as you navigate through the process.

1. Be clear (and firm) about the deadlines you’d like to keep

As an independent researcher, it is up to you to manage your time and prioritize the activities and tasks that will not only nurture your academic and professional development, but also get you through the door. This is quite difficult since graduate students are just that — students! You’re still learning to design a major research project, to predict possible challenges and setbacks, and to estimate how long it will take to pass each milestone of your project. Remember, by the end of your academic journey, you’ll be an expert!

…everything you do that day, week, or month should be in service of your ultimate goal…

Every university has its own set of graduation requirements and deadlines. Study these carefully, and discuss them with your graduate supervisor. If you have your eye set on a specific degree granting date, schedule your milestones starting from that date, moving backwards. Be realistic about how long each stage of your research and writing will take, and allow for unanticipated difficulties. Keep a calendar or checklist with important deadlines on the cover of your planner, over your work desk, or on the refrigerator door. This will help you remember that everything you do that day, week, or month should be in service of your ultimate goal, ideally within your ideal timeline.

2. Find your writing space

The kind of creativity needed for conducting research can be quite different to that which supports writing (even if the writing is technical). For example, while your most productive days of research might be in the lab or library, you may find that writing could be more enjoyable for you at a coffee shop or park. Everyone has unique preferences for the environment that is most conducive to their progress. In general, however, try to aim for a simple, clean, and comfortable space with limited distractions.

Find your writing time, too. Perhaps you prefer to answer emails and do technical work in the morning and dedicate the entire afternoon to writing. You might also be a night owl who is most creative at night when the ruckus of the day has subsided. Some folks prefer to write for extended periods of time, while others prefer to work in short bursts using the Pomodoro Technique.

Sometimes, it is these short, unencumbered moments that spark the greatest ideas!

Cherish your favourite places and times, but remember that you are capable of producing excellent work even if you are not in your ideal work environment. For example, if you have an important deadline coming up, you may choose to work during your daily 15-minute bus ride from campus to your apartment. Don’t despair! Sometimes, it is these short, unencumbered moments that spark the greatest ideas!

3. Formulate a writing strategy that works for you

This is tricky! How do you tackle the intimidating, blank page in front of you? There are 2 parts to your writing strategy.

First, plan the order in which you will write your thesis chapters. Some students write their conclusions chapter first and structure the entire thesis based on it. Others prefer to start with the methodologies chapter because, in many cases, it is the least intensive. I personally tackled the introduction chapter first. This strategy helped me feel confident about the scientific contributions that my research was offering to my field. My introduction also set the tone for the remainder of the thesis. This strategy worked for me and it helped to garner compliments on my finished thesis.

Second, what editing process works best for you? For many, it’s easier to write a very rough draft quickly with free-flowing ideas and then refine their work through multiple rounds of self-reviews. I personally prefer to painstakingly work towards a polished draft from the start and bypass future self-editing iterations. Remember that your writing strategy depends on not only your personal preference, but also the proximity of your deadlines and the preferences of your thesis committee.

4. Spend time outlining your project’s key findings or products

Condensing years of research into a single document is difficult! At the end of their graduate school career, many students have already published a couple articles in peer-reviewed journals. Many student’s haven’t. Sometimes, the university requires a thesis to be written as a completely separate entity from any published works based on the same project. In any case, one of the most useful techniques to finding a path through your mind’s labyrinth is compiling a document (only a few pages long) with the key findings of your work. For example, what helped me was outlining approximately 7 key points (in bullet format) with a brief 1-paragraph description and pairing these with the figures and tables that would serve as highlights for my thesis.

5. Hold a meeting with your thesis committee

You’re on the last leg of the marathon. You are fatigued. Your vision is blurred. And, at times, it seems as if the finish line keeps moving farther back. You’re determined to finish.

Throughout your project, it’s important to hold frequent meetings with your thesis committee to understand their expectations for your work and to gain perspective from their extensive experience as leaders in their fields. In this stage of your project, they are both your cheerleaders and also your ultimate judges. Do they have any concerns about your abilities as an independent researcher? Do they have any lingering questions regarding your work that they would like addressed before you’ve submitted and defended your thesis? Do they approve of your graduation timeline? What final improvements do they recommend to raise your work from an A to an A+? The answers to these questions will give you the stamina and confidence for that final push.

… You are fatigued… And, at times, it seems as if the finish line keeps moving farther back.

6. Be open to feedback

Feedback is an extension to your writing strategy. If you’ve written your first chapter or the entire thesis, ask for feedback! Ask members of your research group or writing group, great writers who are experts in other areas of research, even your scientifically literate or academic neighbour! You could take a writing break for a few days and return to your work with fresh eyes. When you’re ready, submit your draft to your graduate supervisor and start the thesis-editing process! Keep in mind, multiple rounds of revisions by your graduate supervisor, thesis committee, and external reviewers could take months!

7. Review your university’s requirements

Your thesis is finished! You’ve passed your graduate defence! You have a signed and stamped form recommending your degree to be approved by the university. But wait, there is one final step. Every university has its own set of requirements for graduate theses and dissertations. Visit your university library for a drop-in or scheduled consultation for thesis formatting or citation management help. Carefully run through a requirements checklist, and ensure that you’ve used a pre-approved template for your document (if one exists).

8. Take care of yourself

You are only human!

One of the greatest challenges academics are facing today is mental health. Throughout the thesis-writing and editing process, always remember one thing — you are only human! Your graduate supervisor is only human. The most accomplished member of your research group is also only human.

Nurture a genuine and caring support system. Take pride in your accomplishments and praise yourself for your hard work. Share your triumphs with the elders of your family or with childhood friends. Treat yourself to simple joys after tough days, no matter if you’ve written an entire thesis chapter or if you’ve finally solved a small, yet persistent issue. Find a corner of the day to do what makes you happy — running, singing, going to the gym, praying, chatting with your parents, or watching that episode of The Office for the umpteenth time.

You are more than your thesis. You’ve already made a great contribution to your field of expertise. Now, it’s only a matter of telling everyone about it!

Godspeed!

8 tips to successfully write a thesis for graduate school — and live was originally published in Olas on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction to object-oriented data visualization with Python and Matplotlib

Idalia Machuca — Mon, 08 Jul 2019 22:29:23 GMT

Photo by rawpixel on Unsplash

Hi friends!

With Python gaining popularity, you might be curious about all the hype surrounding it. One of Python’s greatest strengths is that it is an object-oriented programming language. In this article, we take a quick look at a data visualization example to help us build an intuition for what “object-oriented” means (rather, how it feels) while introducing a few basic terms along the way. Let’s get started!

1. Set up your Jupyter Notebook

import matplotlib.pyplot as plt
import numpy as np 
from numpy import genfromtxt
%matplotlib inline

There are many great beginner guides for using Jupyter Notebook, for example this one by Dataquest.
Matplotlib is a 2D plotting library for Python programming that can be imported into a Jupyter Notebook. Here, we use the module (collection of functions) Pyplot and give it the alias plt for brevity.
NumPy is a package used for scientific computing with Python. We will only use NumPy very briefly to create an array of numbers.
Data can be stored in different file formats (CSV, text, NetCDF, etc). In this example, we will use data from a .csv file. Since we are only interested in the numeric data in the file, we can use the genfromtxt function.
When using Jupyter Notebook, %matplotlib inline (known as a line-oriented magic function) allows graphics to be displayed as static images in the notebook just below the cell that produces them.

2. Download and load the data you’ll use for the visualization

The data used in this example was downloaded from the NASA Goddard Institute for Space Studies website (citation below):

GISTEMP Team, 2018: GISS Surface Temperature Analysis (GISTEMP). NASA Goddard Institute for Space Studies. Dataset accessed 2018–09–09 at https://data.giss.nasa.gov/gistemp/.
Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change, Rev. Geophys., 48, RG4004, doi:10.1029/2010RG000345.

The dataset contains monthly averages for global land and sea temperature anomalies from 1880–2018.

We see the dataset (modified from the original download to run from years 1880–2017) has 139 rows (years)and 13 columns (monthly temperature anomalies). The first 6 rows look like:

# load data using its file name
filename = './GLB.Ts+dSST_modified.csv'
data = genfromtxt(filename, delimiter=',')

# assign data to variables
years = data[1:,0]
jan = data[1:,1]
aug = data[1:,8]

We load the data and extract the columns for Year (col 0), Jan (col 1), and Aug (col 8), starting at the second row (row 1) to exclude the header row (row 0).

Brief reminder for those still new to Python: we start counting from 0.

3. Construct a simple line plot — a time series of temperature anomalies for January and August from years 1880 to 2017

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))

matplotlib.pyplot.subplots is a function which creates and returns a Figure object and either an Axes object or an array of Axes objects.

The Axes object will contain the plot or graph you choose to create. The Figure object is the frame that will hold all your Axes objects.

In this case, we only want to build one Axes object(you can also call this a subplot) inside the Figure object, so we say there is only 1 row and 1 column. We give these objects the names ax and fig, respectively. Giving the Figure and Axes names allows us to add to, modify, and customize these objects individually.

In the same code cell, we continue:

ax.plot(years, jan, label='January', c='steelblue', ls='-', lw=3, marker='.', ms=7, mfc='#50514F', zorder=2)

ax.plot(years, aug, label='August', c='#F25F5C', ls='--', lw=3, zorder=2)

leg = ax.legend(loc=2, ncol=2, numpoints=1, fontsize=15, shadow=True)

Axes.plot (expressed below as ax.plot) is a plotting function in Matplotlib. This function is called a method because it belongs to an object (i.e. ax). Axes.plot requires just two parameters, the x (i.e. years) and y (i.e. temperature anomaly for a particular month) values. It also allows for optional plotting properties, such as a line label (label), line colour ( c), line style (ls), line width (lw), marker style (marker), marker size (ms), marker colour (mfc), plotting order (zorder). The line labels entered as properties in the ax.plot function are automatically used in Axes.legend, which is given the name leg and allows for its own properties, such as location, number of columns, number of marker points, font size, and shadowing.

If you’ve used languages like MATLAB (which uses a “state-machine approach”), you may have created the line plot using a the expression plt.plot. In this case, the plotting function would be applied to the active figure or axes. With ax.plot in Python, however, you are specifying the object on/for which you’re applying this action, which provides greater flexibility to your workflow.

In the same cell, we use the axhline function to simply add a zeroth line across the object ax. Note that the plotting functions will be executed according to the order specified by zorder, not by the order in which they appear in the code cell. So, the two temperature anomaly lines (for January and August) will be drawn over the zeroth line.

ax.axhline(0, lw=2, c='dimgray', ls='-', alpha=0.5, zorder=1)

In the same code cell, we continue to customize the objects ax, leg, and fig using the functions set_ylabel, set_xlabel, set_title, set_xticklabels, set_yticklabels, set_xlim, set_ylim, and grid.

ax.set_ylabel('Temperature\nAnomaly [$^o$C]', fontsize=20)
ax.set_xlabel('Year', fontsize=20)

ax.set_title('Yearly Temperature Anomaly Means for January and August', fontsize=24, y=1.05)

xticklabels = np.arange(1880, 2020, 10) 
# plot extends to year 2020 for extra white space on the right side
ax.xaxis.set_ticks(xticklabels)
ax.set_xticklabels(xticklabels, fontsize=15, rotation=45)

yticklabels = np.arange(-1.5, 2, 0.5)
ax.yaxis.set_ticks(yticklabels)
ax.set_yticklabels(yticklabels, fontsize=15)

ax.set_xlim([1880, 2020])
ax.set_ylim([-1.5, 1.5])

ax.grid()

leg.set_title(title='Month', prop = {'size':15})

fig.patch.set_facecolor('lightsteelblue')

You may have also noticed a type of object inheritance in the code above. For example, the object ax has a specific attribute xaxis, for which the tick locations and labels could be managed with the more general function set_ticks. Again, the specificity allowed by object-oriented programming offers more flexible customization.

The last step is to use a function that saves the figure.

fig.savefig('./jan_aug_tempanom.png', dpi=150, bbox_inches='tight', format='png', facecolor=fig.get_facecolor())

Executing this code cell produces the following image:

Summary and challenges

We’ve explored the flexibility with which we can customize a simple line plot by taking advantage of object-oriented programming with Python using the Matplotlib library. If you’re new to data visualization in Python, I challenge you to continue testing the various properties for line plots, other types of visualizations for linear data, or even different figure layouts (such as separate subplots for each month). There are endless possibilities for creativity!

Introduction to object-oriented data visualization with Python and Matplotlib was originally published in Olas on Medium, where people are continuing the conversation by highlighting and responding to this story.