Using Python to Download Multiple PDF’s Quickly…

Image for post
Image for post
Photo by: Danni Simmonds (freeimages.com)

Sometimes finding data feels a lot like taking on a mountain. Recently I came across some data related to international adoption on Travel.State.Gov. The data was laid out nicely in Plotly, but not in the way I wanted to look at it. Certainly they provided a raw data download link. Nope, just a link to annual PDF reports that contain the data I need. Doh!


It hurts my heart a little to realize that Jerry Maguire came out in 1996. Maybe if the movie was created today 40-somethings all over the world would shout the mantra, “SHOW ME THE DATA!” .

If you are a data scientist looking for data, there are only so many free resources to download before you realize that you need to learn how to scrape web pages.

Before we start, let me state that web scraping should be done responsibly with as minimal impact as possible to the host servers, and we all should be respectful of others creative work and copywrited content. Be sure to read the robot.text file of web sites to understand what is, and is not allowed, and stick with the rules. …


The pythonic way…

Image for post
Image for post
Photo by Colin Nixon at www.freeimages.com

Have you ever scanned a document into a pdf as an image and then later realized that you actually needed to be able to edit the document? Adobe has built in optical character recognition (OCR) software that can make for any easy fix, if you have adobe professional. If you don’t have this luxury but have a few minutes, keep reading.

What you need…

  1. Python3
  2. Tesseract OCR: sudo apt-get install tesseract-ocr
  3. These python libraries: wand, Pillow, pyocr, PySimpleGUI

Set up your virtual environment, import your python version of choice, install the libraries and run the code:

import PySimpleGUI as sg
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import…

And how it can help you reach your fitness goals!

Image for post
Image for post
Photo courtesy of Kenia Castro at www.freeimages.com

I was in the final few months of my second year of my medical residency. I had gained over twenty pounds since I entered the program, I was sleep deprived, eating whenever and whatever I could, and felt like crap. Then I took a military required fitness test and scored the lowest of my career. Something needed to change. That was over ten years ago. Here’s what I’ve learned.

Goals Matter:

When I decided I needed a change, I made a pivotal mistake. Instead of clarifying what those changes needed to be, I just took immediate action with what I thought would work. I told my wife I was going for a run and walked out the door. Three months later, I had lost 30 pounds, was sleeping better, and had definitely improved my running pace (more on efficiency in a latter article). But here’s the thing, I hate running. Even worse, when I looked at myself in the mirror, I was not happy with the results. I was skinny. Not the good skinny, but the gaunt, “could be blown over with a strong wind” type of skinny. That’s when something amazing happened. I sprained my foot. I could barely walk, and running was completely out-of-the question. …


Image for post
Image for post
photo “data” by CyberHades is licensed under CC BY-NC 2.0

As a physician data scientist and healthcare administer, one of the frequent complaints I hear from other data scientists is that it is difficult to get clinicians to accept the validity of their “new prediction tool”. While I personally feel that the the perception of the clinical community is shifting towards embracing big data and predictive analytics, I am also acutely aware that there is indeed an environment of mistrust between clinicians, administrators, and data analysts/scientists. What steps can we take to change these perceptions and shift towards an environment of collaboration? Here are my thoughts.

1) Clinicians and Data Scientists speak different languages.

Over 77% of clinicians have undergraduate degrees in either the biological sciences, premed, or another physical science (see US Bureau of Labor link below). Further, while most have participated in some type of bench research, they tend to focus on the biochemical aspects of the research more than statistical and mathematical modeling. There are certainly exceptions and finding a MD/PhD, clinical nurse informaticist, or clinician with a MPH in field is not necessarily difficult. Such clinicians bridge a critical gap between the academic and operational medical communities. That said, while one or two classes of calculus are a requirement for medical school admission, statistics is not. …


Ahh…”Just right!”

Image for post
Image for post
photo by Mercelo Gerpe at freeimages.com

Finding the right IDE is like stepping into a “The Three Bears” storybook. This one is too simple, this one is too complicated, this one is ‘Just Right’…except for the annoying fact that it’s using Python 2.7.

At least that was my issue with Sublime. I’ve tried other editors and they all work fine, but for some reason I like the look and feel of Sublime. I an attempt to rectify my issue, I of course turned to DuckDuckGo and found the internet was largely silent on an answer. …


Image for post
Image for post
By Dietmar Rabich, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=38134508

I’m a little bit of an efficiency freak. Little things that make more work drive me a little…well…crazy. Things like the uncommented code, improperly labeled files, inconsistent use of cammelCase/PascalCase/Underscore naming conventions, or having the toilet paper on the roll backwards really, really annoy me (it should roll over the top!). From a data science standpoint, my biggest pet peeve, outside of the use of Excel as a document program, is untidy data. If you don’t know what tidy data is, read this!

With that off my chest, I was playing with some data related to worldwide Systolic Blood Pressure trends (fitting right?) …


Image for post
Image for post
hdhut.blogspot.com

Understanding the Central Limit Theorem, Bootstrapping, and why you should care…

“The central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions. “

— Wikipedia

For medical providers, data scientists and statisticians the central limit theorem is akin to the law of gravity; it is a fundamental truth that holds steady until some important assumptions are violated (see some discussion here). …


Image for post
Image for post
Image Source: medium.com

Deriving data from PDF files: What do you do when the data you need isn’t easily accessible?

I recently found myself needing to parse out over 2,000 ICD codes from a PDF file at the start of a proof-of-concept project. I have a reasonable grasp on REGEX and more than novice experience in python. Further, I have parsed some data and tables from PDF’s in the past, but the table layout in the PDF limited the options available to parse this data. In the past, I have utilized Tabula-py with good success. Unfortunately, I was having multiple issues getting Tabula to parse out the data I needed in this case. If your interested in the original pdf file, the link is here. …

About

MB

Physician Data Scientist & Pythonista

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store