Day 4: Two steps back, still battling Unicode

Roo Harrigan
Making Athena
Published in
2 min readOct 30, 2015

>>> Brief summary == Where I struggled:

Today was disheartening, or rather, I let today get me down. My main objective was to gather a list of countries from Wikipedia and use that list to query for each country’s capital. There is no such thing as a list of countries on Wikipedia. The closest thing you can get is the List of Sovereign States which has a particularly unwieldy table and a ton of unicode characters that English-speaking programmers seem to mostly ignore. This is less Wikipedia’s fault and more the challenge of the significant number of disputes around recognized countries and general sovereignty that exist in the world, which Wikipedia admirably attempts to display. I spent a long time thinking about what to do with Kurdistan, for example. Political controversy aside, the process brought up two things for me I wanted to get down in case I can return to them:

  1. Using an HTML scraper (beautifulsoup) to grab countries from the U. S. State Department’s website instead.
  2. Instead of using flat map of the world to represent countries on a computer screen, it might serve us better to use a graph database. Not just in this project, but in life in general.

However, I really need to keep pushing towards my MVP. So instead of trying out a new tool, I wrote about 50 mindless lines of string clean-up to get the weird characters out, and another 50 lines of ‘if statements’ to fix the capitals that were coming back from my capitals query because they were getting all chopped up and cut off and, worst of all, sending strangely encoded/decoded Unicode characters coming out of the parser (the accented i in Brasilia comes back as ‘xed’, for example) that I couldn’t resolve programmatically. My seed.py function is a horrible mess, I can’t get Ireland or Scotland in for the life of me, and Switzerland’s capital was still ‘none,’ so I’ve been unfortunately grooming things manually. Here’s a short letter I’ve composed to my new friend postgreSQL:

Dear psql,

Can I call you that? Just wanted to drop you a note and say that I am very appreciative of your INSERT INTO statements today. Thanks for being there for me when I couldn’t get my data straight. Also, we have the same favorite animal.

Love,

Roo

>>> Thoughtful takeaway:

I am accidentally learning all sorts of things about the world because of reading through these Wikipedia pages. Did you know that France and Britain both claim territories in the Arctic? How much do you know about the Caribbean, and which colonial powers are still hanging around there? Have you ever heard of Kiribati before? Embarrassingly, I hadn’t. Until now.

Thanks Wikipedia.

--

--