How to scrape a PDF for keywords using Ruby
I recently started a company called Jobcrate in Austin, Texas. The goal of Jobcrate is to build a platform that makes it easy for talent to apply to hundreds of technology companies and startups, while offering those companies and startups an affordable way to source and recruit the best talent in the tech industry.
In order to do this we get the best talent signed up onto our platform and we have to make it easy for companies to search through this talent. We strictly make talent apply using a .PDF version of their resume. As you may know, it’s pretty standard in today’s job market to apply to all jobs using a .PDF version of your resume.
Typically you’ll find .PDF’s are pretty hard to work with unless you have the original MS Word file, Photoshop file or whatever you originally wrote your .PDF file through. However at Jobcrate, we have to analyze every single application and resume that comes our way.
In order to do this, we “scrape” specific keywords, skills and talents using the
pdf-reader gem (https://github.com/yob/pdf-reader). Our platform is built on Rails (STILL THE BEST!) and here is a method in one of our models that parses each keyword.
io = open(user.resume_pdf.to_s)
reader = PDF::Reader.new(io)
reader.pages.each do |page|
string = page.text
KeywordHelper.keywords.each do |word|
Since this method itself could take a while to run and parse through every single word, we run this in our background jobs just so that there are no problems to the user.
In this method, we pass in the user object. We then grab the actual .PDF (we utilize Paperclip and store the .PDF in an s3 bucket). Once we have that instance of the resume, we go through each page (typically there is just one but you never know). We then turn the page into one LOONGGGG string. Internally, we have built out a rather large list of skills, talents and basic keywords. We turn this into an array and looks something like this:
In our user model, in Rails 5 and Postgres, we are now able to have an array field in our database. We simply have a keywords column where we can store the keywords we find in a resume into the array.
This now allows us to easily create a search feature for companies. Let’s say that a company needs to find an Angular developer that is looking for a job, a company can now easily search ‘Angular’ and find awesome engineers that have Angular experience. We have plenty of other features that helps companies find talent, but this is one of our core features.
And that’s it! If you have any questions, looking for a job in Austin, or simply want to chat — shoot me an email: firstname.lastname@example.org