<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Olas - Medium]]></title>
        <description><![CDATA[Academics, the environment, and data science from the perspective of a scientist in industry. - Medium]]></description>
        <link>https://medium.com/idalia-machuca?source=rss----703372415320---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Olas - Medium</title>
            <link>https://medium.com/idalia-machuca?source=rss----703372415320---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 18 May 2026 06:29:48 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/idalia-machuca" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Preparing for a Graduate Thesis Defence]]></title>
            <link>https://medium.com/idalia-machuca/preparing-for-a-graduate-thesis-defence-d7ae4198b876?source=rss----703372415320---4</link>
            <guid isPermaLink="false">https://medium.com/p/d7ae4198b876</guid>
            <category><![CDATA[public-speaking]]></category>
            <category><![CDATA[presentations]]></category>
            <category><![CDATA[conference]]></category>
            <category><![CDATA[graduate-school]]></category>
            <category><![CDATA[thesis]]></category>
            <dc:creator><![CDATA[Idalia Machuca]]></dc:creator>
            <pubDate>Fri, 27 Sep 2019 02:41:32 GMT</pubDate>
            <atom:updated>2019-09-27T02:41:32.577Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Vi4sv5aT9tEakPcBrHPAgA.jpeg" /></figure><p>At a few months post-graduation, I look back at my “defence day” as one of the proudest and happiest days of my life. Shocking, I know! Had you asked me how I was handling my impending defence while I was still preparing for it, I would have tossed around the words — nervousness, anxiety, uncertainty, dread! It takes enormous levels of preparation and courage to prepare for and get through a defence. Let’s discuss <strong>6 tips</strong> to keep in mind while getting ready for the big day!</p><h3>1. You are the expert</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lRAAu4ztCpoY0Iim9gJctg.jpeg" /></figure><p>After a series of edits prescribed by your peers, supervisor, and committee, you’ve probably read and reread your thesis to the point of memorization. No matter how complex or technical, the work to which you’ve dedicated years can begin to sound uninspiring. You know this is imposter syndrome, but it’s shaking your confidence nonetheless!</p><p>I’m here to remind you that — <strong>you are the expert!</strong> You’ve spent an extraordinary amount of time and effort working on your project. Take ownership of this great accomplishment. No one else has spent as much time as you have on understanding the nuances related to your project and field of expertise. Defending your choices and findings is, indeed, intimidating (and nauseating), but this final step is also one of celebration for your dedication to the work and contribution to your field.</p><h3>2. Review the thesis, again</h3><p>On a more logistical note, reviewing the document that you will be presenting is of highest priority. You may choose to read the full work or focus your energy on reviewing the more complex and detailed sections. Stepping back and viewing the document in its entirety can also be quite invigorating as it is a very concrete, visual representation of your efforts.</p><h3>3. Practice answering difficult questions</h3><p>As you review the thesis, compile a list of questions that might arise during your defence. It might help to carefully consider the specific areas that are likely to be supported or opposed by individual members of your committee based on their input during the thesis editing stage or from past meetings.</p><p>One of your most valuable resources, especially at this stage of your academic career, is your research group or cohort! A useful exercise is to conduct a “mock defence” with a “mock committee” made up of your own peers. This type of rehearsal allows you to practice answering challenging and unanticipated questions in a low-risk, though still realistically stressful, setting.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NsEUgMA1Sy8gkTA-SDQ6Rg.jpeg" /></figure><h3>4. Prepare required and supplementary materials</h3><p>How is a thesis defence conducted in your department? What is the goal of a defence? What are the outcomes of a defence and who determines these? What is expected of you? Schedule a meeting with your supervisor to clarify the expectations for your defence. In my department, for example, a defence is expected to begin with a ~20 minute presentation given by the student, and it is followed by 2–3 rounds of questions, each round allowing up to 10 minutes per committee member.</p><p>As additional preparation, also consider collecting key figures or extracts from your thesis and relevant literature and organizing these as a separate document to be presented if/when a question requires auxiliary visuals.</p><p>Finally, speak with members of the team who have attended past defences to gauge the culture regarding expected formalities and customs (eg. committee introductions, dress code, coffee or snacks, restrictions for attendees — significant others or family members).</p><h3>5. Find your comfort</h3><p>It’s human to feel uneasy about facing a challenging task. Everyone will express confidence and insecurity for different elements of their graduate career. Throughout your graduate program, however, you have probably discovered your comforts — items, activities, personal rituals that soothe you and help you regain a positive outlook on difficult situations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6rCVIapmbH8b22-e1BgQww.jpeg" /></figure><p>On the days leading up to the defence, you could have dinner with friends, go for a long walk through your favourite park, catch up on a television series, or pick up a neglected personal project.</p><p>On the day of your defence, do what feels right for you. Have a good night’s sleep, spend extra time getting ready in the morning, bring your lucky pencil into the defence, or call your relatives on Skype. Looking back now, I remember all the small details that made my day more comfortable and, therefore, much more pleasant.</p><h3>6. Celebrate</h3><p>Celebrating can seem far off. After all, you first need to conquer a trial! And, this is absolutely true! What is also true is that the defence, no matter the outcome, is a major accomplishment following years of personal sacrifice and personal growth. This is a very real cause for celebration!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aYZtdrAO3GtYsNaIRDVFKQ.jpeg" /></figure><p>Today, I remember the day of my defence for being surrounded by my favourite people! I invited my closest friends and colleagues to the local community hall (the Royal Canadian Legion), and we had an unforgettable day filled with dancing, karaoke, surprise cakes, surprise gifts, and heart-to-hearts! We even had party hats (which graduate students and faculty were not just willing but actually excited to wear)! By the end of the night, we had closed down the town, taking away memories to last a lifetime.</p><p>While this was perfect for me, everyone has unique preferences and visions for a fun-filled day. One of the greatest benefits of planning a celebration, though, is envisioning a future beyond this difficult task and looking forward to treating yourself with what brings you joy!</p><h3>Final thoughts</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RWG7ZPcygZDKUbA0mFsRng.jpeg" /></figure><p>For those of you who are at the defence stage of your graduate program, congratulations! You’re at the very end of the tunnel, and you’ve displayed great resilience and determination to be here. This is the final push! You’ve done something truly great for your field; now is the time to show everyone! Take in the view!</p><p>For those of you are knee-deep in research, readings, and writing, keep pushing! Remind yourself of your inspirations and goals. You can do this! And, by the end of it all, you will have learned so much about your field and yourself. Godspeed, my friends!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d7ae4198b876" width="1" height="1" alt=""><hr><p><a href="https://medium.com/idalia-machuca/preparing-for-a-graduate-thesis-defence-d7ae4198b876">Preparing for a Graduate Thesis Defence</a> was originally published in <a href="https://medium.com/idalia-machuca">Olas</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize]]></title>
            <link>https://medium.com/idalia-machuca/data-at-a-glance-with-pandas-part-1-numbers-airbnb-in-belize-682c1ad32b01?source=rss----703372415320---4</link>
            <guid isPermaLink="false">https://medium.com/p/682c1ad32b01</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[vacation]]></category>
            <category><![CDATA[airbnb]]></category>
            <dc:creator><![CDATA[Idalia Machuca]]></dc:creator>
            <pubDate>Sat, 13 Jul 2019 00:04:20 GMT</pubDate>
            <atom:updated>2019-07-13T00:04:20.323Z</atom:updated>
            <content:encoded><![CDATA[<h3>Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OBG9ajwKILnjvGEtj0_StQ.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/@thebrownspy?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Spencer Watson</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure><p><strong>Hi friends!</strong></p><p>With the growing influence of Airbnb on the tourism and hospitality industry around the globe, I was curious about its presence in Belize. With sprawling jungles, coastal landscapes, and a rich, multicultural history, Belize is a hidden treasure for adventure-seekers. Though the hospitality sector in Belize still appears to be dominated by hotels and resorts, there is now chatter of an increasing number of local listings appearing on Airbnb.</p><p><strong>This article is primarily a data analysis tutorial</strong>, though I take the liberty (at the end of the article) to comment on how these findings influence the current conversation in Belize.</p><p>That said, in this exercise, we review a few basic functions for quickly extracting meaning from data and practice troubleshooting techniques when data manipulation is not as straight forward as initially perceived. Given the potential for further analysis, this article is part 1 of a series, and it seeks to calculate initial estimates (or “numbers”) from data.</p><h3>Part 1: A quick look at the numbers</h3><h4>1. About the data</h4><p>The data for current Airbnb listings in Belize was sourced from the <a href="http://insideairbnb.com/get-the-data.html">Inside Airbnb</a> website. The dataset is available under a <a href="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons CC0 1.0 Universal (CC0 1.0) “Public Domain Dedication”</a> license.</p><p>For this exercise, I’ll be using the “Detailed Listings” dataset for Belize (filename: <a href="http://data.insideairbnb.com/belize/bz/belize/2019-05-26/data/listings.csv.gz">listings.csv</a>) retrieved on 26 May, 2019.</p><h4>2. About the code</h4><p>Pandas is a data manipulation and analysis library with easy-to-use functions designed for relational or labelled data. Pandas is written for Python and builds on the functionality of the NumPy library. It is well suited for tabular data (for example, Excel spreadsheets), time series data, and matrix data with labelled columns and rows. Pandas uses Series and DataFrame objects. A series is a one-dimensional labelled array, and a DataFrame is a tabular data structure. One of the advantages of using Pandas is, as you will see in the code snippets below, the expression of computational methods, which allows for operations to be performed with fewer lines of code compared to NumPy. I personally prefer to use NumPy in many cases since I get a “feel” for the inner workings of my code, though the time benefits of Pandas do make it a hard-to-resist option. For further background on Python programming, please check out this <a href="https://medium.com/idalia-machuca/introduction-to-object-oriented-data-visualization-with-python-and-matplotlib-962308265d37">article</a>.</p><h4>3. How many accommodations are listed in Belize?</h4><p>We begin by importing the tabular data in the .csv file as a DataFrame object <strong>df</strong>. The <strong>shape</strong> attribute returns a tuple representing the dimensionality of the DataFrame. The <strong>columns</strong> attribute returns the labels for the 106 columns in the DataFrame. Immediately, this information tells us that there are 2558 total listings (as indicated by the number of rows in the DataFrame) in the country of Belize. The <strong>values</strong> attribute returns a NumPy representation of the array of column names, which allows for the complete (instead of truncated) array of names to be printed. Without manually opening the large .csv file, we now have the names of the metrics we will explore.</p><pre><strong>import pandas as pd</strong></pre><pre><strong>df = pd.read_csv(&#39;listings.csv&#39;)</strong></pre><pre><strong>df.shape</strong><br>(2558, 106)</pre><pre><strong>df.columns.shape<br></strong>(106,)</pre><pre><strong>df.columns.values</strong><br>array([&#39;id&#39;, &#39;listing_url&#39;, &#39;scrape_id&#39;, &#39;last_scraped&#39;, &#39;name&#39;,<br>       &#39;summary&#39;, &#39;space&#39;, &#39;description&#39;, &#39;experiences_offered&#39;,<br>       &#39;neighborhood_overview&#39;, &#39;notes&#39;, &#39;transit&#39;, &#39;access&#39;,<br>       &#39;interaction&#39;, &#39;house_rules&#39;, &#39;thumbnail_url&#39;, &#39;medium_url&#39;,<br>       &#39;picture_url&#39;, &#39;xl_picture_url&#39;, &#39;host_id&#39;, &#39;host_url&#39;,<br>       &#39;host_name&#39;, &#39;host_since&#39;, &#39;host_location&#39;, &#39;host_about&#39;,<br>       &#39;host_response_time&#39;, &#39;host_response_rate&#39;, &#39;host_acceptance_rate&#39;,<br>       &#39;host_is_superhost&#39;, &#39;host_thumbnail_url&#39;, &#39;host_picture_url&#39;,<br>       &#39;host_neighbourhood&#39;, &#39;host_listings_count&#39;,<br>       &#39;host_total_listings_count&#39;, &#39;host_verifications&#39;,<br>       &#39;host_has_profile_pic&#39;, &#39;host_identity_verified&#39;, &#39;street&#39;,<br>       &#39;neighbourhood&#39;, &#39;neighbourhood_cleansed&#39;,<br>       &#39;neighbourhood_group_cleansed&#39;, &#39;city&#39;, &#39;state&#39;, &#39;zipcode&#39;,<br>       &#39;market&#39;, &#39;smart_location&#39;, &#39;country_code&#39;, &#39;country&#39;, &#39;latitude&#39;,<br>       &#39;longitude&#39;, &#39;is_location_exact&#39;, &#39;property_type&#39;, &#39;room_type&#39;,<br>       &#39;accommodates&#39;, &#39;bathrooms&#39;, &#39;bedrooms&#39;, &#39;beds&#39;, &#39;bed_type&#39;,<br>       &#39;amenities&#39;, &#39;square_feet&#39;, &#39;price&#39;, &#39;weekly_price&#39;,<br>       &#39;monthly_price&#39;, &#39;security_deposit&#39;, &#39;cleaning_fee&#39;,<br>       &#39;guests_included&#39;, &#39;extra_people&#39;, &#39;minimum_nights&#39;,<br>       &#39;maximum_nights&#39;, &#39;minimum_minimum_nights&#39;,<br>       &#39;maximum_minimum_nights&#39;, &#39;minimum_maximum_nights&#39;,<br>       &#39;maximum_maximum_nights&#39;, &#39;minimum_nights_avg_ntm&#39;,<br>       &#39;maximum_nights_avg_ntm&#39;, &#39;calendar_updated&#39;, &#39;has_availability&#39;,<br>       &#39;availability_30&#39;, &#39;availability_60&#39;, &#39;availability_90&#39;,<br>       &#39;availability_365&#39;, &#39;calendar_last_scraped&#39;, &#39;number_of_reviews&#39;,<br>       &#39;number_of_reviews_ltm&#39;, &#39;first_review&#39;, &#39;last_review&#39;,<br>       &#39;review_scores_rating&#39;, &#39;review_scores_accuracy&#39;,<br>       &#39;review_scores_cleanliness&#39;, &#39;review_scores_checkin&#39;,<br>       &#39;review_scores_communication&#39;, &#39;review_scores_location&#39;,<br>       &#39;review_scores_value&#39;, &#39;requires_license&#39;, &#39;license&#39;,<br>       &#39;jurisdiction_names&#39;, &#39;instant_bookable&#39;,<br>       &#39;is_business_travel_ready&#39;, &#39;cancellation_policy&#39;,<br>       &#39;require_guest_profile_picture&#39;,<br>       &#39;require_guest_phone_verification&#39;,<br>       &#39;calculated_host_listings_count&#39;,<br>       &#39;calculated_host_listings_count_entire_homes&#39;,<br>       &#39;calculated_host_listings_count_private_rooms&#39;,<br>       &#39;calculated_host_listings_count_shared_rooms&#39;, &#39;reviews_per_month&#39;],<br>      dtype=object)</pre><h4>4. What are the types of listings are advertised?</h4><p>Let’s start isolating specific metrics in which we’re interested.</p><pre><strong>df[&#39;room_type&#39;].count()<br></strong>2558</pre><pre><strong>df[&#39;room_type&#39;].nunique()<br></strong>3</pre><pre><strong>df[&#39;room_type&#39;].unique()<br></strong>array([&#39;Entire home/apt&#39;, &#39;Private room&#39;, &#39;Shared room&#39;], dtype=object)</pre><pre><strong>df[&#39;room_type&#39;].value_counts()<br></strong>Entire home/apt    1552<br>Private room        965<br>Shared room          41<br>Name: room_type, dtype: int64</pre><pre><strong>1552+965+41<br></strong>2558</pre><p>We take a closer at the column <strong>room_type</strong>, which gives us information on the kinds of accommodations offered by the current listings in Belize. The <strong>count</strong> method returns the total number of non-NaN cells in this column. More specifically, the <strong>unique </strong>and<strong> nunique </strong>(i.e. number+<strong>unique</strong>) methods tell us that there are 3 distinct values or categories of accommodations, namely <em>Entire home/apt</em>,<em> Private room</em>, and <em>Shared room</em>. A more meaningful finding, however, is the relative population of each category, which is neatly printed using the <strong>value_counts</strong> method. Note that all operations are performed on <strong>df[‘room_type’]</strong>, which is a Pandas Series. Also, while <strong>value_counts</strong> returns a Pandas Series, <strong>unique</strong> returns a NumPy array. It is always useful to keep track of the types and structures of data you are currently analyzing.</p><h4>5. What is the average rating of a listing in Belize?</h4><p>Next, we move on to quick statistics.</p><pre><strong>df[&#39;review_scores_rating&#39;].min()</strong><br>20.0</pre><pre><strong>df[&#39;review_scores_rating&#39;].max()</strong><br>100.0</pre><pre><strong>df[&#39;review_scores_rating&#39;].mean()</strong><br>93.80575945793338</pre><p>Isolating the <strong>review_scores_rating </strong>column and performing minimum, maximum, and mean operations on this Pandas Series provides us with a quick overview of the quality of the accommodations in Belize as indicated by the satisfaction of reviewers. Note that each function returns a float.</p><h4>6. What is the average price for a listing in Belize?</h4><p>We try the same quick statistics as above on the <strong>price</strong> column. Quickly, however, we run into a problem when we attempt to find the average price for a listing using the <strong>mean</strong> function.</p><pre><strong>df[&#39;price&#39;].min()</strong><br>&#39;$0.00&#39;</pre><pre><strong>df[&#39;price&#39;].max()</strong><br>&#39;$998.00&#39;</pre><pre><strong>df[&#39;price&#39;].mean()</strong><br>ValueError: could not convert string to float</pre><p>Using the <strong>dtypes</strong> (which stands for data-type) function, the difference between <strong>review_scores_rating </strong>and <strong>price </strong>is easy to spot. While both are Pandas Series, they contain data of different types. The rating column is made up of floats (<strong>float64</strong>), while the price column is made up of objects (<strong>O</strong>), namely strings (<strong>str</strong>).</p><pre><strong>df[‘review_scores_rating’].dtypes</strong><br>dtype(&#39;float64&#39;)</pre><pre><strong>df[&#39;price&#39;].dtypes</strong><br>dtype(&#39;O&#39;)</pre><pre><strong>type(df[&#39;price&#39;][0])</strong><br>str</pre><p>Upon closer inspection, we notice the quotes around the monetary values in <strong>price </strong>which indicate that the values are expressed as strings.</p><pre><strong>df[&#39;price&#39;].unique()<br></strong>array([&#39;$75.00&#39;, &#39;$35.00&#39;, &#39;$95.00&#39;, &#39;$60.00&#39;, &#39;$140.00&#39;, &#39;$185.00&#39;, &#39;$49.00&#39;, &#39;$26.00&#39;, &#39;$85.00&#39;, &#39;$117.00&#39;, &#39;$88.00&#39;, &#39;$79.00&#39;, &#39;$89.00&#39;, &#39;$189.00&#39;, &#39;$200.00&#39;, &#39;$149.00&#39;, &#39;$130.00&#39;, &#39;$110.00&#39;, &#39;$72.00&#39;, &#39;$99.00&#39;, &#39;$148.00&#39;, &#39;$59.00&#39;, &#39;$250.00&#39;, &#39;$58.00&#39;, &#39;$239.00&#39;, &#39;$120.00&#39;, &#39;$125.00&#39;, &#39;$105.00&#39;, &#39;$150.00&#39;, &#39;$139.00&#39;, &#39;$119.00&#39;, &#39;$225.00&#39;, &#39;$135.00&#39;, &#39;$98.00&#39;, &#39;$1,947.00&#39;,....</pre><p>Manipulation of a string is different than that of a float. Finding the minimum and maximum of <strong>price</strong> didn’t produce an error since this computation was being performed on the alphabetical characters of the strings. This is why the <strong>max</strong> method returned ‘$998.00’ even though we can easily spot a quantitatively larger value near the top of the column (‘$1,947.00’). The <strong>mean</strong> function did trigger an error, however, since this is an illogical (and, therefore, impossible) operation on a string.</p><p>In the code snippet below, we see that the zeroth component of the first value in <strong>price</strong> (i.e. ‘$75.00’) is the dollar sign. Therefore, we not only need to convert the values in <strong>price</strong> from strings to floats in order to perform computational operations, but we also need to remove characters such as the dollar signs and (here’s a surprise) thousands separators (for example, the comma in ‘$1,947.00’).</p><pre><strong>df[‘price’].unique()[0][0], df[‘price’].unique()[0][1]<br></strong>(&#39;$&#39;, &#39;7&#39;)</pre><p>First, we replace the dollar signs and commas with empty space using <strong>replace</strong>.</p><pre><strong>price_str = df[‘price’].str.replace(‘$’, ‘’).str.replace(‘,’, ‘’)</strong></pre><pre><strong>price_str.unique()<br></strong>array([&#39;75.00&#39;, &#39;35.00&#39;, &#39;95.00&#39;, &#39;60.00&#39;, &#39;140.00&#39;, &#39;185.00&#39;, &#39;49.00&#39;, &#39;26.00&#39;, &#39;85.00&#39;, &#39;117.00&#39;, &#39;88.00&#39;, &#39;79.00&#39;, &#39;89.00&#39;, &#39;189.00&#39;, &#39;200.00&#39;, &#39;149.00&#39;, &#39;130.00&#39;, &#39;110.00&#39;, &#39;72.00&#39;, &#39;99.00&#39;, &#39;148.00&#39;, &#39;59.00&#39;, &#39;250.00&#39;, &#39;58.00&#39;, &#39;239.00&#39;, &#39;120.00&#39;, &#39;125.00&#39;, &#39;105.00&#39;, &#39;150.00&#39;, &#39;139.00&#39;, &#39;119.00&#39;, &#39;225.00&#39;, &#39;135.00&#39;, &#39;98.00&#39;, &#39;1947.00&#39;,....</pre><p>Next, we change the data type of the Series from strings to floats.</p><pre><strong>price_flt = price_str.astype(float)</strong></pre><pre>array([  75.,   35.,   95.,   60.,  140.,  185.,   49.,   26.,   85.,   117.,   88.,   79.,   89.,  189.,  200.,  149.,  130.,  110.,   72.,   99.,  148.,   59.,  250.,   58.,  239.,  120.,  125.,   105.,  150.,  139.,  119.,  225.,  135.,   98.,   1947.,....</pre><p>Now we’re ready to implement the same operations as with the <strong>ratings</strong> column!</p><pre><strong>price_flt.min()</strong><br>0.0</pre><pre><strong>price_flt.max()</strong><br>6200.0</pre><pre><strong>price_flt.mean()</strong><br>180.7587959343237</pre><h4>7. How many listings are advertised for each district (i.e. state) in Belize?</h4><p>Let’s summarize what we’ve done until this point! The <strong>room_type</strong> column was populated with pre-assigned, closed-ended categories. The <strong>review_scores_rating</strong> column consisted of floats, which is a ready-to-use type of data for calculations. Next, we learned to manipulate data in the <strong>price</strong> column in order to perform operations we had previously used. In this final section, we take data processing to the next level using the <strong>state</strong> column, which is populated by open-ended responses.</p><p>Upon inspecting the first 10 entries in <strong>state </strong>(code clip below), we notice that the data has likely been supplied by open-ended prompts or questions. Stann Creek, Belize, and Cayo are 3 of the 6 districts in the country of Belize. Belize City, however, is not. It is, in fact, a city located in the district of Belize. We also note that Toledo is labelled with the additional word “District”. How do we go about reformatting 2558 entries, especially under time constraints?</p><pre><strong>df[&#39;state&#39;]</strong><br>0                        NaN<br>1                        NaN<br>2                        NaN<br>3                Stann Creek<br>4                Belize City<br>5                        NaN<br>6                     Belize<br>7            Toledo District<br>8                Stann Creek<br>9                       Cayo<br>10                       NaN</pre><p>Let’s first establish the names of the states or districts of Belize. The first 6 names in the list <strong>dists </strong>(code snippet below) are all the districts of Belize. Note, however, that Ambergris Caye and Caye Caulker are such popular tourist destinations (and geographically separate from the mainland of Belize) that we will consider them as separate categories for a more realistic and detailed analysis of listing locations.</p><pre><strong>dists = [&#39;corozal&#39;, &#39;orange walk&#39;, &#39;belize&#39;, &#39;cayo&#39;, &#39;stann creek&#39;, &#39;toledo&#39;, &#39;ambergris caye&#39;, &#39;caye caulker&#39;]</strong></pre><p>We start processing the data by dropping all letters to lower case. So, entries for Cayo, cayo, and CAYO are counted for the same district.</p><pre><strong>state_orig = df[&#39;state&#39;].str.lower()</strong></pre><pre><strong>print(state_orig.value_counts())</strong></pre><pre>cayo district                              329<br>corozal district                           307<br>stann creek district                       243<br>belize district                            192<br>belize                                     130<br>stann creek                                116<br>cayo                                       108<br>ambergris caye                             103<br>be                                          93<br>st                                          45<br>corozal                                     37<br>toledo district                             22<br>orange walk district                        16<br>ca                                          15<br>toledo                                      11<br>bz                                          10<br>caye caulker                                 7<br>ambergris caye, belize                       3<br>orange walk                                  3<br>belize city                                  3<br>bze                                          2<br>san pedro                                    2<br>*                                            2<br>cayo district, belize, central america       1<br>ontario                                      1<br>cayo, belize                                 1<br>san ignacio                                  1<br>corozal, belize                              1<br>ambergis caye, belize                        1<br>quintana roo                                 1<br>caribbean sea                                1<br>coronal district                             1<br>caribben sea                                 1<br>belize central america                       1<br>placencia, stann creek district, belize      1<br>ambergris caye, south                        1<br>toledo district belize ca                    1<br>belize c.a                                   1<br>Name: state, dtype: int64</pre><p>Straight away, we notice a few simple fixes (eg. excess words like ‘district’). Some issues require further investigation (eg. the entry ‘be’ with 93 listings, which is not insignificant and should therefore not be ignored). There are also a couple egregious errors, which could be addressed with manual cleaning (eg. Quintana Roo is not even in the country of Belize).</p><p>To be on the safe side, we make a copy of the Series <strong>state_orig</strong>.</p><pre><strong>state_new = state_orig.copy()</strong></pre><p>We tackle instances when the name of a district is surrounded by excess information. If there is an entry with the district name as specified by the list <strong>dists</strong> (defined above), we want to replace the entire entry with just the name of the district. In other words, we are trimming excess words or characters from these entry.</p><p>The <strong>‘.*’</strong> accounts for all characters. All entries with characters before and/or after the district name are rewritten with just the district name. For example, the entry ‘toledo district belize ca’ will be reduced to ‘toledo’.</p><pre><strong>def extract_district_name(dists, state_new):<br> for dist_name in dists:<br>   state_new = state_new.str.replace(‘.*’+dist_name+’.*’, dist_name)<br> return state_new</strong></pre><p>Instantly, we have reduced the number of unique entries and added listings to the 8 categories defined for districts. For example, the number of listings determined to be in Cayo has risen from 329 to 437.</p><pre><strong>state_new = extract_district_name(dists, state_new)<br>state_new.value_counts()</strong></pre><pre>cayo                437<br>stann creek         359<br>corozal             345<br>belize              335<br>ambergris caye      104<br>be                   93<br>st                   45<br>toledo               33<br>orange walk          19<br>ca                   15<br>bz                   10<br>caye caulker          7<br>*                     2<br>san pedro             2<br>bze                   2<br>san ignacio           1<br>quintana roo          1<br>caribben sea          1<br>caribbean sea         1<br>coronal district      1<br>ontario               1<br>Name: state, dtype: int64</pre><p>We still have the issue of many listings falling into nondescript categories like <em>be</em>, <em>st</em>, <em>ca</em>, and <em>bze</em>. There must be more information we can gather from other areas of the dataset that could narrow down the location of these listings. Perhaps users incorporated the name of the district into their entries for <strong>city</strong>.</p><p>In the snippet below, we perform two actions. First, we ask if entries in <strong>state_new</strong> are in the list of defined districts <strong>dists</strong>. If a certain entry is not in the list of districts, the answer is <em>False</em> and the second action is performed.</p><p>The second action is that the entry in <strong>state</strong> is reassigned as the corresponding entry for <strong>city</strong>. As a minor detail, this entry is dropped to lower case as previously done.</p><pre><strong>state_new.loc[state_new.isin(dists)==False] = df[‘city’].str.lower()</strong></pre><p>Example: The entries <em>be</em>, <em>st</em>, <em>ca</em>, and <em>bze </em>in <strong>state</strong> return <em>False</em> because they are not in the list <strong>dists</strong>. Consequently, these entries are replaced with their corresponding entries in <strong>city</strong>.</p><pre><strong>state_new.value_counts()</strong></pre><pre>cayo                                                  437<br>stann creek                                           359<br>corozal                                               346<br>belize                                                338<br>san pedro                                             253<br>caye caulker                                          183<br>ambergris caye                                        129<br>placencia                                              82<br>san ignacio                                            69<br>toledo                                                 33<br>belize city                                            28<br>san pedro, ambergris caye                              25<br>orange walk                                            22<br>placencia, stann creek district, belize                 9<br>cayo belize                                             9<br>hopkins, stann creek                                    8<br>placencia village, stann creek district belize, ca      7</pre><p>The list above is only a snippet of the first few unique entries. We notice that our strategy has been successful. Entries like <em>be</em>, <em>st</em>, <em>ca</em>, and <em>bze </em>have been replaced by more descriptive entries. We also note that the new entries oftentimes contain the names of the districts. Therefore, we can execute the function <strong>extract_district_name</strong> again!</p><pre><strong>state_new = extract_district_name(dists, state_new)<br>state_new.value_counts()</strong></pre><p>Notice that, in executing this function once more, most entries fall inside the defined categories for districts. Still there are a few highly populated strays with city names like <em>placencia</em> and <em>san ignacio.</em></p><pre>cayo                               440<br>belize                             404<br>stann creek                        368<br>corozal                            346<br>san pedro                          253<br>caye caulker                       188<br>ambergris caye                     158<br>placencia                           82<br>san ignacio                         69<br>toledo                              33<br>orange walk                         22<br>belmopan                             6<br>dangriga                             5<br>san pedro town                       5<br>hopkins                              3</pre><p>We can address the few strays by manually replacing the city names with the correct district names. Clearly, a more automated method would be necessary if there were more incorrect entries. Also note that this method can only be done by someone with in-depth knowledge of the content. With more time, perhaps, one could produce a script that gathers this knowledge from an existing reference or online source. For the purpose of this exercise, however, this method proves to be sufficient.</p><pre><strong>state_new = state_new.str.replace(‘belmopan.*’,’cayo’)<br>state_new = state_new.str.replace(‘san pedro.*’, ‘ambergris caye’)<br>state_new = state_new.str.replace(‘coronal district.*’,’corozal’)<br>state_new = state_new.str.replace(‘san ignacio.*’,’cayo’)<br>state_new = state_new.str.replace(‘placencia.*’,’stann creek’)<br>state_new = state_new.str.replace(‘dangriga.*’,’stann creek’)<br>state_new = state_new.str.replace(‘hopkins.*’,’stann creek’)<br>state_new.value_counts()</strong></pre><p>Manually addressing the few remaining non-district entries has improved the relative counts. For example, the listings for Cayo have increased from 440 to 515. The remaining stray entries have a maximum of 2 listings each.</p><pre>cayo                               515<br>stann creek                        464<br>ambergris caye                     416<br>belize                             404<br>corozal                            346<br>caye caulker                       188<br>toledo                              33<br>orange walk                         22<br>seine bight village                  2<br>long caye                            2<br>punta gorda                          2<br>sittee river village                 2</pre><p>For the final count, we categorize all entries that do not fall within a pre-defined district category as ‘other’.</p><pre><strong>state_fin = state_new.copy()</strong></pre><pre><strong>state_fin.loc[(state_fin.isin(dists)==False)&amp; (pd.isna(state_fin)==False)] = ‘other’</strong></pre><pre><strong>state_fin.value_counts()</strong></pre><p>The final count shows the relative population of listings in all pre-defined districts. Of course, more time could be dedicated to improving the accuracy of this exercise, but we are now able to derive meaning from our data — at a glance!</p><pre>cayo              515<br>stann creek       464<br>ambergris caye    416<br>belize            404<br>corozal           346<br>caye caulker      188<br>toledo             33<br>other              23<br>orange walk        22<br>Name: state, dtype: int64</pre><h3>Key findings</h3><ul><li>There are 2558 Airbnb listings in Belize.</li><li>Airbnb advertises 3 possible options for room types, namely ‘Entire home/apartment’, ‘Private room’, and ‘Shared room’. All types are available in Belize with the ‘Entire home/apartment’ category garnering the most listings.</li><li>Assuming review ratings are scored out of 100 points, the average score in Belize is ~94%, which is equivalent to 4.7/5 stars!</li><li>The average price of a listing in Belize is ~$181, with the most expensive listing being $6200. This analysis assumes the variable used for this calculation is specific to a one-night’s stay since other pricing variables are provided for weekly and monthly stays.</li><li>Given a few approximations, the district with the most Airbnb listings in Belize is Cayo. This is followed, in descending order, by Stann Creek, Ambergris Caye, Belize, Corozal, Caye Caulker, Toledo, and Orange Walk.</li></ul><h3>Comments and considerations</h3><ul><li>The number of Airbnb listings is not indicative of the number of active hosts in Belize since property owners or managers could post multiple listings.</li><li>The few, more expensive options likely skew the national mean. For example, the maximum price found is a shocking $6200 per night. It is important to note that extravagant prices likely originate from all-inclusive, luxury resorts with experiences especially tailored to foreign visitors. Therefore, the result of an average price of $181 per night does not necessarily mean that local Belizeans are receiving this much from renting their personal property.</li><li>It is, therefore, important to ask — Where are the most expensive listings (in the cayes and popular tourism spots)? Are the most expensive listings part of larger businesses (hotels, resorts)? What is the average price for listings that are personal property of the locals? What percentage of the listings are large businesses vs personal property? Where are listings being booked more often; does this align with the most popular tourist attractions? How do prices change depending on the tourist season (i.e. “low” and “high” seasons)?</li><li>These questions raised above are of particular importance in light of recent developments regarding taxation of Airbnb listings by the country’s tourism board. Any regulation implemented should consider the differences in property type and ownership, booking capacity, and popularity of the listing locations. For example, a conversation could be had regarding taxation of small private properties in Orange Walk versus the luxury resorts in Stann Creek.</li><li>Finally, regarding the methods for analyzing districts, it is important to note that some hosts may (correctly) consider Ambergris Caye and Caye Caulker to be in the Belize district. One should also be careful to consider any possible confusion given that one of the districts has the same name as the country. In general, further care should be taken in avoiding any double counting. For the purpose of this tutorial, however, the methods employed were successful in producing a quick overview of the data.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=682c1ad32b01" width="1" height="1" alt=""><hr><p><a href="https://medium.com/idalia-machuca/data-at-a-glance-with-pandas-part-1-numbers-airbnb-in-belize-682c1ad32b01">Data at a glance with Pandas (Part 1: Numbers) — Airbnb in Belize</a> was originally published in <a href="https://medium.com/idalia-machuca">Olas</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[8 tips to successfully write a thesis for graduate school — and live]]></title>
            <link>https://medium.com/idalia-machuca/8-tips-to-successfully-write-a-thesis-for-graduate-school-and-live-e433d98391ca?source=rss----703372415320---4</link>
            <guid isPermaLink="false">https://medium.com/p/e433d98391ca</guid>
            <category><![CDATA[writing-tips]]></category>
            <category><![CDATA[dissertation]]></category>
            <category><![CDATA[thesis]]></category>
            <category><![CDATA[writing]]></category>
            <category><![CDATA[graduate-school]]></category>
            <dc:creator><![CDATA[Idalia Machuca]]></dc:creator>
            <pubDate>Wed, 10 Jul 2019 02:23:26 GMT</pubDate>
            <atom:updated>2019-07-10T02:46:18.369Z</atom:updated>
            <content:encoded><![CDATA[<h3>8 tips to successfully write a thesis for graduate school — and live</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*64npeqa57xpKgcQB3jnfbQ.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/photos/pUAM5hPaCRI?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">JESHOOTS.COM</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure><p><strong>Hi friends!</strong></p><p>Writing a thesis or dissertation for your graduate degree is a daunting task. After years of conducting independent research, running experiments, compiling data, and staying up to date with the most recent advancements in your field, it is easy to feel overwhelmed by that final (and arguably most significant) stage. Remember, everyone’s thesis-writing experience is unique. The expectations for your research and thesis vary depending on your topic, field, supervisor, department, university, and even academic “generation”. That said, this is a list of tips to keep in mind as you navigate through the process.</p><h3>1. Be clear (and firm) about the deadlines you’d like to keep</h3><p>As an independent researcher, it is up to you to manage your time and prioritize the activities and tasks that will not only nurture your academic and professional development, but also get you through the door. This is quite difficult since graduate students are just that — students! You’re still learning to design a major research project, to predict possible challenges and setbacks, and to estimate how long it will take to pass each milestone of your project. Remember, by the end of your academic journey, you’ll be an expert!</p><blockquote>…everything you do that day, week, or month should be in service of your ultimate goal…</blockquote><p>Every university has its own set of graduation requirements and deadlines. Study these carefully, and discuss them with your graduate supervisor. If you have your eye set on a specific degree granting date, schedule your milestones starting from that date, moving backwards. Be realistic about how long each stage of your research and writing will take, and allow for unanticipated difficulties. Keep a calendar or checklist with important deadlines on the cover of your planner, over your work desk, or on the refrigerator door. This will help you remember that everything you do that day, week, or month should be in service of your ultimate goal, ideally within your ideal timeline.</p><h3>2. Find your writing space</h3><p>The kind of creativity needed for conducting research can be quite different to that which supports writing (even if the writing is technical). For example, while your most productive days of research might be in the lab or library, you may find that writing could be more enjoyable for you at a coffee shop or park. Everyone has unique preferences for the environment that is most conducive to their progress. In general, however, try to aim for a simple, clean, and comfortable space with limited distractions.</p><p>Find your writing <em>time</em>, too. Perhaps you prefer to answer emails and do technical work in the morning and dedicate the entire afternoon to writing. You might also be a night owl who is most creative at night when the ruckus of the day has subsided. Some folks prefer to write for extended periods of time, while others prefer to work in short bursts using the Pomodoro Technique.</p><blockquote>Sometimes, it is these short, unencumbered moments that spark the greatest ideas!</blockquote><p>Cherish your favourite places and times, but remember that you are capable of producing excellent work even if you are not in your ideal work environment. For example, if you have an important deadline coming up, you may choose to work during your daily 15-minute bus ride from campus to your apartment. Don’t despair! Sometimes, it is these short, unencumbered moments that spark the greatest ideas!</p><h3>3. Formulate a writing strategy that works for you</h3><p>This is tricky! How do you tackle the intimidating, blank page in front of you? There are 2 parts to your writing strategy.</p><p>First, plan the order in which you will write your thesis chapters. Some students write their conclusions chapter first and structure the entire thesis based on it. Others prefer to start with the methodologies chapter because, in many cases, it is the least intensive. I personally tackled the introduction chapter first. This strategy helped me feel confident about the scientific contributions that my research was offering to my field. My introduction also set the tone for the remainder of the thesis. This strategy worked for me and it helped to garner compliments on my finished thesis.</p><p>Second, what editing process works best for you? For many, it’s easier to write a very rough draft quickly with free-flowing ideas and then refine their work through multiple rounds of self-reviews. I personally prefer to painstakingly work towards a polished draft from the start and bypass future self-editing iterations. Remember that your writing strategy depends on not only your personal preference, but also the proximity of your deadlines and the preferences of your thesis committee.</p><h3>4. Spend time outlining your project’s key findings or products</h3><p>Condensing years of research into a single document is difficult! At the end of their graduate school career, many students have already published a couple articles in peer-reviewed journals. Many student’s haven’t. Sometimes, the university requires a thesis to be written as a completely separate entity from any published works based on the same project. In any case, one of the most useful techniques to finding a path through your mind’s labyrinth is compiling a document (only a few pages long) with the key findings of your work. For example, what helped me was outlining approximately 7 key points (in bullet format) with a brief 1-paragraph description and pairing these with the figures and tables that would serve as highlights for my thesis.</p><h3>5. Hold a meeting with your thesis committee</h3><p>You’re on the last leg of the marathon. You are fatigued. Your vision is blurred. And, at times, it seems as if the finish line keeps moving farther back. You’re determined to finish.</p><p>Throughout your project, it’s important to hold frequent meetings with your thesis committee to understand their expectations for your work and to gain perspective from their extensive experience as leaders in their fields. In this stage of your project, they are both your cheerleaders and also your ultimate judges. Do they have any concerns about your abilities as an independent researcher? Do they have any lingering questions regarding your work that they would like addressed before you’ve submitted and defended your thesis? Do they approve of your graduation timeline? What final improvements do they recommend to raise your work from an A to an A+? The answers to these questions will give you the stamina and confidence for that final push.</p><blockquote>… You are fatigued… And, at times, it seems as if the finish line keeps moving farther back.</blockquote><h3>6. Be open to feedback</h3><p>Feedback is an extension to your writing strategy. If you’ve written your first chapter or the entire thesis, ask for feedback! Ask members of your research group or writing group, great writers who are experts in other areas of research, even your scientifically literate or academic neighbour! You could take a writing break for a few days and return to your work with fresh eyes. When you’re ready, submit your draft to your graduate supervisor and start the thesis-editing process! Keep in mind, multiple rounds of revisions by your graduate supervisor, thesis committee, and external reviewers could take months!</p><h3>7. Review your university’s requirements</h3><p>Your thesis is finished! You’ve passed your graduate defence! You have a signed and stamped form recommending your degree to be approved by the university. But wait, there is one final step. Every university has its own set of requirements for graduate theses and dissertations. Visit your university library for a drop-in or scheduled consultation for thesis formatting or citation management help. Carefully run through a requirements checklist, and ensure that you’ve used a pre-approved template for your document (if one exists).</p><h3>8. Take care of yourself</h3><blockquote>You are only human!</blockquote><p>One of the greatest challenges academics are facing today is mental health. Throughout the thesis-writing and editing process, always remember one thing — you are only human! Your graduate supervisor is only human. The most accomplished member of your research group is also only human.</p><p>Nurture a genuine and caring support system. Take pride in your accomplishments and praise yourself for your hard work. Share your triumphs with the elders of your family or with childhood friends. Treat yourself to simple joys after tough days, no matter if you’ve written an entire thesis chapter or if you’ve finally solved a small, yet persistent issue. Find a corner of the day to do what makes you happy — running, singing, going to the gym, praying, chatting with your parents, or watching that episode of The Office for the umpteenth time.</p><p>You are more than your thesis. You’ve already made a great contribution to your field of expertise. Now, it’s only a matter of telling everyone about it!</p><p><strong>Godspeed!</strong></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e433d98391ca" width="1" height="1" alt=""><hr><p><a href="https://medium.com/idalia-machuca/8-tips-to-successfully-write-a-thesis-for-graduate-school-and-live-e433d98391ca">8 tips to successfully write a thesis for graduate school — and live</a> was originally published in <a href="https://medium.com/idalia-machuca">Olas</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introduction to object-oriented data visualization with Python and Matplotlib]]></title>
            <link>https://medium.com/idalia-machuca/introduction-to-object-oriented-data-visualization-with-python-and-matplotlib-962308265d37?source=rss----703372415320---4</link>
            <guid isPermaLink="false">https://medium.com/p/962308265d37</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[climate-change]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-vizualisation]]></category>
            <dc:creator><![CDATA[Idalia Machuca]]></dc:creator>
            <pubDate>Mon, 08 Jul 2019 22:29:23 GMT</pubDate>
            <atom:updated>2019-07-10T02:47:11.481Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XL7CyJjdZfY1C3LJYmqg3Q.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/photos/Xnv7O4jBAEQ?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">rawpixel</a> on <a href="https://unsplash.com/search/photos/business-meeting?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure><p>Hi friends!</p><p>With Python gaining popularity, you might be curious about all the hype surrounding it. One of Python’s greatest strengths is that it is an object-oriented programming language. In this article, we take a quick look at a data visualization example to help us build an intuition for what “object-oriented” means (rather, how it feels) while introducing a few basic terms along the way. Let’s get started!</p><h3>1. Set up your Jupyter Notebook</h3><pre>import matplotlib.pyplot as plt<br>import numpy as np <br>from numpy import genfromtxt<br>%matplotlib inline</pre><ul><li>There are many great beginner guides for using Jupyter Notebook, for example <a href="https://www.dataquest.io/blog/jupyter-notebook-tutorial/">this one</a> by Dataquest.</li><li>Matplotlib is a 2D plotting library for Python programming that can be imported into a <a href="http://jupyter.org/">Jupyter Notebook</a>. Here, we use the module (collection of functions) <a href="https://matplotlib.org/api/pyplot_summary.html">Pyplot</a> and give it the alias <em>plt</em> for brevity.</li><li><a href="http://www.numpy.org/">NumPy</a> is a package used for scientific computing with Python. We will only use NumPy very briefly to create an array of numbers.</li><li>Data can be stored in different file formats (CSV, text, NetCDF, etc). In this example, we will use data from a .csv file. Since we are only interested in the numeric data in the file, we can use the <em>genfromtxt</em> function.</li><li>When using Jupyter Notebook, <em>%matplotlib inline</em> (known as a line-oriented magic function) allows graphics to be displayed as static images in the notebook just below the cell that produces them.</li></ul><h3>2. Download and load the data you’ll use for the visualization</h3><p>The data used in this example was downloaded from the NASA Goddard Institute for Space Studies website (citation below):</p><ul><li>GISTEMP Team, 2018: GISS Surface Temperature Analysis (GISTEMP). NASA Goddard Institute for Space Studies. Dataset accessed 2018–09–09 at <a href="https://data.giss.nasa.gov/gistemp/">https://data.giss.nasa.gov/gistemp/</a>.</li><li>Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: <a href="https://pubs.giss.nasa.gov/abs/ha00510u.html">Global surface temperature change</a>, Rev. Geophys., <strong>48</strong>, RG4004, doi:10.1029/2010RG000345.</li></ul><p>The dataset contains monthly averages for global land and sea temperature anomalies from 1880–2018.</p><p>We see the dataset (modified from the original download to run from years 1880–2017) has 139 rows (years)and 13 columns (monthly temperature anomalies). The first 6 rows look like:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7-GrucvF3CbJU8ow6SpOWA.png" /></figure><pre># load data using its file name<br>filename = &#39;./GLB.Ts+dSST_modified.csv&#39;<br>data = genfromtxt(filename, delimiter=&#39;,&#39;)</pre><pre># assign data to variables<br>years = data[1:,0]<br>jan = data[1:,1]<br>aug = data[1:,8]</pre><p>We load the data and extract the columns for <strong>Year </strong>(col 0), <strong>Jan </strong>(col 1), and <strong>Aug </strong>(col 8), starting at the second row (row 1) to exclude the header row (row 0).</p><p>Brief reminder for those still new to Python: we start counting from 0.</p><h3>3. Construct a simple line plot — a time series of temperature anomalies for January and August from years 1880 to 2017</h3><pre>fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))</pre><p><a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html">matplotlib.pyplot.subplots</a> is a function which creates and returns a <a href="https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure">Figure</a> object and either an <a href="https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes">Axes</a> object or an array of Axes objects.</p><p>The <a href="https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes">Axes</a> object will contain the plot or graph you choose to create. The <a href="https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure">Figure</a> object is the frame that will hold all your <a href="https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes">Axes</a> objects.</p><p>In this case, we only want to build one <a href="https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes">Axes</a> object(you can also call this a subplot) inside the <a href="https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure">Figure</a> object, so we say there is only 1 row and 1 column. We give these objects the names <strong>ax </strong>and <strong>fig</strong>, respectively. Giving the <a href="https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure">Figure</a> and <a href="https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes">Axes</a> names allows us to add to, modify, and customize these objects individually.</p><p>In the same code cell, we continue:</p><pre>ax.plot(years, jan, label=&#39;January&#39;, c=&#39;steelblue&#39;, ls=&#39;-&#39;, lw=3, marker=&#39;.&#39;, ms=7, mfc=&#39;#50514F&#39;, zorder=2)</pre><pre>ax.plot(years, aug, label=&#39;August&#39;, c=&#39;#F25F5C&#39;, ls=&#39;--&#39;, lw=3, zorder=2)</pre><pre>leg = ax.legend(loc=2, ncol=2, numpoints=1, fontsize=15, shadow=True)</pre><p><a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html#matplotlib.axes.Axes.plot">Axes.plot</a> (expressed below as <strong>ax.plot</strong>) is a plotting <em>function</em> in Matplotlib. This function is called a <a href="https://docs.python.org/2/tutorial/classes.html"><em>method</em></a> because it belongs to an object (i.e. <strong>ax</strong>). <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html#matplotlib.axes.Axes.plot">Axes.plot</a> requires just two <em>parameters</em>, the x (i.e. years) and y (i.e. temperature anomaly for a particular month) values. It also allows for optional plotting <em>properties</em>, such as a line label (<strong>label</strong>), line colour (<strong> c</strong>), line style (<strong>ls</strong>), line width (<strong>lw</strong>), marker style (<strong>marker</strong>), marker size (<strong>ms</strong>), marker colour (<strong>mfc</strong>), plotting order (<strong>zorder</strong>). The line labels entered as properties in the <strong>ax.plot</strong> function are automatically used in <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.legend.html">Axes.legend</a>, which is given the name <strong>leg</strong> and allows for its own properties, such as location, number of columns, number of marker points, font size, and shadowing.</p><p>If you’ve used languages like MATLAB (which uses a “state-machine approach”), you may have created the line plot using a the expression <strong>plt.plot</strong>. In this case, the plotting function would be applied to the active figure or axes. With <strong>ax.plot </strong>in Python, however, you are specifying the object on/for which you’re applying this action, which provides greater flexibility to your workflow.</p><p>In the same cell, we use the <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.axhline.html"><strong>axhline</strong></a><strong> </strong>function to simply add a zeroth line across the object <strong>ax</strong>. Note that the plotting functions will be executed according to the order specified by <strong>zorder</strong>, not by the order in which they appear in the code cell. So, the two temperature anomaly lines (for January and August) will be drawn over the zeroth line.</p><pre>ax.axhline(0, lw=2, c=&#39;dimgray&#39;, ls=&#39;-&#39;, alpha=0.5, zorder=1)</pre><p>In the same code cell, we continue to customize the objects <strong>ax</strong>, <strong>leg</strong>, and <strong>fig</strong> using the functions <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_ylabel.html">set_ylabel</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html">set_xlabel</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_title.html">set_title</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html">set_xticklabels</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_yticklabels.html">set_yticklabels</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlim.html">set_xlim</a>, <a href="https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_ylim.html">set_ylim</a>, and <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.grid.html">grid</a>.</p><pre>ax.set_ylabel(&#39;Temperature\nAnomaly [$^o$C]&#39;, fontsize=20)<br>ax.set_xlabel(&#39;Year&#39;, fontsize=20)</pre><pre>ax.set_title(&#39;Yearly Temperature Anomaly Means for January and August&#39;, fontsize=24, y=1.05)</pre><pre>xticklabels = np.arange(1880, 2020, 10) <br># plot extends to year 2020 for extra white space on the right side<br>ax.xaxis.set_ticks(xticklabels)<br>ax.set_xticklabels(xticklabels, fontsize=15, rotation=45)</pre><pre>yticklabels = np.arange(-1.5, 2, 0.5)<br>ax.yaxis.set_ticks(yticklabels)<br>ax.set_yticklabels(yticklabels, fontsize=15)</pre><pre>ax.set_xlim([1880, 2020])<br>ax.set_ylim([-1.5, 1.5])</pre><pre>ax.grid()</pre><pre>leg.set_title(title=&#39;Month&#39;, prop = {&#39;size&#39;:15})</pre><pre>fig.patch.set_facecolor(&#39;lightsteelblue&#39;)</pre><p>You may have also noticed a type of object inheritance in the code above. For example, the object <strong>ax</strong> has a specific attribute <strong>xaxis</strong>, for which the tick locations and labels could be managed with the more general function <a href="https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.axis.Axis.set_ticks.html">set_ticks</a>. Again, the specificity allowed by object-oriented programming offers more flexible customization.</p><p>The last step is to use a function that saves the figure.</p><pre>fig.savefig(&#39;./jan_aug_tempanom.png&#39;, dpi=150, bbox_inches=&#39;tight&#39;, format=&#39;png&#39;, facecolor=fig.get_facecolor())</pre><p>Executing this code cell produces the following image:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gyxJb6TFampiwUoGzX1_3g.png" /></figure><h3>Summary and challenges</h3><p>We’ve explored the flexibility with which we can customize a simple line plot by taking advantage of object-oriented programming with Python using the Matplotlib library. If you’re new to data visualization in Python, I challenge you to continue testing the various properties for line plots, other types of visualizations for linear data, or even different figure layouts (such as separate subplots for each month). There are endless possibilities for creativity!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=962308265d37" width="1" height="1" alt=""><hr><p><a href="https://medium.com/idalia-machuca/introduction-to-object-oriented-data-visualization-with-python-and-matplotlib-962308265d37">Introduction to object-oriented data visualization with Python and Matplotlib</a> was originally published in <a href="https://medium.com/idalia-machuca">Olas</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>