Breaking the HTML Enigma Machine

(or officially, Jour72326: Scraping for Journalists)


April 23 — May 21, Thursdays, 6–8:50 p.m. (2 hours and 50 minutes), Room 436
A five-week elective at the CUNY School of Journalism.


You may have heard: some people believe that all journalists should learn to code. I am one of those people, and skills like web scraping are one of the reasons why.

My name is Sisi Wei. I design, code, and do investigative journalism for a living. In order to accomplish that journalism, I scrape websites. By the end of this class, you’ll be scraping websites too.

In fact, by the end of our five weeks together, you’ll know the basics of programming and understand why web scraping gives you a leg up as a journalist. You’ll also be able to make judgement calls between when to ask for data, when to scrape and when to back off.


Class Philosophy

This is a skills class. I will be teaching you why web scraping is one of the most important tools available to journalists, but in only five weeks, I will primarily be teaching you the technical skill of how to scrape data from a website. Know that as a journalist, this is not enough. The real key is in what you do with the data after you’ve acquired it.

But before you can write that brilliant piece, or create that brilliant graphic, you need the skills to acquire data that isn’t easily downloadable. For most of you, the concepts you learn in class will be the first time you’ve ever heard of them. Be prepared to get frustrated. But also be prepared to feel empowered.

Everything I teach you both will, and will not, be available on the Internet. I mean this in two ways.

First, the concepts and demos that drive this class are almost all literally available online. You are free to go through them on your time, as many times as you’d like.

Second, the answers to all of your homework assignments are also somewhere on the Internet. Not literally, but in the form of other people’s questions and answers. Make use of this, but don’t plagiarize. We’ll talk about plagiarism and academic integrity later in the syllabus.

But because of these vast resources available to you, including help from me by appointment, I not only expect all work to be done on time, but that you do good work.


Grading and Class Policies

Homework

Problem sets will make up 30% of your grade.
Your final scraping project will make up 20% of your grade.

Every week you’ll get a new problem set that helps you practice the concepts we learned in class that day. These assignments will always be due at 11:59 p.m. the following Wednesday, and I’ll be going over the answers during the next class. Therefore, late assignments will automatically receive zero points. Turn in assignments by the deadline, even if you’re unsure of some of your answers. Being able to meet deadlines is essential to working in a newsroom.

In addition to weekly problem sets, you will pitch a website that you’d like to scrape as your final project, due the Sunday after our course ends. If you really can’t think of a site you’d like to scrape, I’ll also provide you with a list of websites to choose from, but fair warning, it will probably be a harder project than a site you pitch yourself. You’ll get guidance on this as we progress, so don’t worry about it too much.

Attendance

Your physical and mental attendance will make up 50% of your grade.

Be physically present. There are only five classes. Unless you have an excused absence, missing a class will immediately result in a lower grade. If you show up more than 10 minutes late, I’d consider you absent for the day. Same with needing to leave early. If you have special circumstances that would require that you miss a portion of class, you must contact me about it beforehand to get approval.

Be mentally present. Attendance is more than showing up physically. Feel free to use laptops and cell phones in class, but if you’re too busy browsing Facebook to keep up with demos, you aren’t mentally present and I’ll consider you absent for the day.

Excused absences. For an absence to count as excused, you must have checked with me first. Illnesses and family emergencies will almost always be excused. CUNY-related duties may be excused. Excluding extreme cases, it will be your responsibility to figure out what you missed in class and get back up to speed.

Software Requirements

For this class, all you’ll need is an account at PythonAnywhere, which should already be created for you. If this is not the case, please email me.

Getting Help

Since I also work full-time, I don’t have set office hours. If you need help, we can either setup an appointment to talk via Google Hangouts, or you can email me the problem you’re having. I’ll try my best to get back to you within 24 hours.

Here’s what you’re required to send me in order to get help:

  1. A copy of the code you’re using.
    Hi Sisi, attached is scraper.py, the code I’m trying to get to work.”
  2. What you expect to happen.
    “I’m trying to write a function that calculates the degrees in Fahrenheit if you tell it the Celsius.”
  3. What is actually happening.
    “No matter what I do, the answer always comes out zero. I know this can’t be right, but I’m not sure what I’m doing wrong. The equation is accurate.”
  4. The exact error message you get (if any).
    “I don’t get any error messages from Python. My code is running.”
  5. What you’ve tried already to fix this problem. You must try to fix it yourself before emailing me for help.
    “I’ve tried googling ‘python always getting zero’ and found this solution on Stack Overflow…oh…actually…yeah this fixed it. Nevermind!”

If you don’t send me all five steps, I’ll just respond with a link to the syllabus. In the real world, following these steps is what will get you quality answers online, so I want you to start getting the hang of it.

Since homework is supposed to help you learn, I will never respond with an exact solution to an assignment. However, I will point out what you need to scrutinize in order to solve your problem.

Finally, please try to avoid sending emergency emails on Wednesday evenings. I can promise I’ll do my best to get back to you within 24 hours, but I can’t promise to get back to you before the midnight deadline. Instead, take a shot at your homework earlier in the week and email me ahead of time. That way it’s guaranteed that I’ll respond in time.


Do Your Own Work

Don’t cheat. I’m here to evaluate your work, not someone else’s, and doing your own work is the best way to learn. Cheating also comes with terrible consequences: If you plagiarize while working at a journalism organization, you will be fired. If you plagiarize at CUNY, you could be dismissed from the program.

When it comes to our class specifically, I encourage you to help each other on your problem sets, but not to copy each other’s work. Oftentimes this can be a fine line. If you ever have a question about what counts as “cheating,” read the Academic Honesty policy section from a programming course at Harvard. It goes into great detail with examples and will be the policy I will be using in this class.

Here’s the more official policy from CUNY. Read it:

“It is a serious ethical violation to take any material created by another person and represent it as your own original work. Any such plagiarism will result in serious disciplinary action, possibly including dismissal from the CUNY J-School. Plagiarism may involve copying text from a book or magazine without attributing the source, or lifting words, code, photographs, videos, or other materials from the Internet and attempting to pass them off as your own. Please ask the instructor if you have any questions about how to distinguish between acceptable research and plagiarism.

In addition to being a serious academic issue, copyright is a serious legal issue. Never “lift” or “borrow” or “appropriate” or “repurpose” graphics, audio, or code without both permission and attribution. This guidance applies to scripts, audio, video clips, programs, photos, drawings, and other images, and it includes images found online and in books.

The exception to this rule is fair use: if your story is about the image itself, it is often acceptable to reproduce the image. If you want to better understand fair use, the Citizen Media Law Project is an excellent resource.

When in doubt: ask.”


Class Schedule

Here’s my general plan for what we’ll be doing for five weeks. This schedule might change depending on how the class goes.

If an assignment is “Due by next class,” it must be digitally accessible and completed by 11:59 p.m. the following Wednesday.

All problem sets and the final project details will be posted on Github: https://github.com/sisiwei/2015-spring-cuny-web-scraping

Week #1: Intro & Programming Basics

Introductions and discussing the syllabus. What is web scraping. How to use PythonAnywhere. Command line basics. Introduction to programming with Python, as far as we can get: comments, functions, variables, data types, if statements, loops.

Due by next class:
- Read about this robot that writes stories on earthquakes.
- Complete Problem Set #1.

Week #2: Build your first scraper

Finish intro to programming with Python. Basics of HTML structure and how to identify individual HTML elements for scraping. Writing a fully functional web scraper using Requests and BeautifulSoup. Talk about final project ideas.

Due by next class:
-
Complete Problem Set #2.
- Spend some time digging around for data that’s online, but not downloadable. Email me a list of three potential datasets (or their URLs) that would be useful for journalists to scrape.

Week #3: Hacking URLs and forms

How scrapers can navigate form-based websites, or websites that show their parameters in the URL, and scrape a targeted subset of a site. Scraping and saving multiple pages of data.

Due by next class:
- Complete Problem Set #3
- Narrow down which website you’ll be scraping for your final project.

Week #4: Ethics and getting complicated

Ethics and etiquette of web scraping. When to scrape and when not to scrape. Honoring robots.txt. Where to get data. Tackling web sites that were built terribly. If we have time: next-level web scraping with Selenium, how to use an API.

Due by next class:
- Work on final project

Week #5: Final Projects

Work on final scrapers in class. Class discussions on problems or struggles. Doing your coding outside of PythonAnywhere. General wrap-up.

🎉 Party 🎉

Due by noon on Sunday, May 24th:
- Complete final projects