Preserving web content of links provided in Word Documents, using AWS: EC2 and S3

Natalie Olivo
12 min readAug 23, 2018

--

Thomas Edison State University (TESU) of Trenton, NJ is continuing its tradition of taking the next steps towards democratizing education.

“Identified by Forbes magazine as one of the top 20 colleges and universities in the nation in the use of technology to create learning opportunities for adults, Thomas Edison State University is a national leader in the assessment of adult learning and a pioneer in the use of educational technologies. The New York Times has stated that Thomas Edison State University is ‘the college that paved the way for flexibility.’”

https://www.tesu.edu/academics/catalog/about-tesu

TESU is a member of a network of Competency-Based Education institutions, and one of the unique opportunities TESU provides is the ability for students to demonstrate mastery before enrolling.

TESU makes available to its students, and everyone: syllabi using open source materials to teach career skills and liberal arts. Widely dispersed learning material is a familiar concept in many industries. For example, to learn Python, a student has a number of open source resources to assist in self-guided learning. TESU aims to make its courses’ contents similarly accessible, and they’re using Amazon Web Services (AWS) to deliver. I was lucky enough to have the privilege to be contracted on to help them adopt a new technology to accomplish their goals.

The Business Case

TESU has hundreds of lessons syllabi (saved as MS Word Documents), each with 20–80 urls in the text. Each semester, it’s impossible for an instructor to go through and make sure all of the links are up to date and functional.

Why would a link stop working?

  • A URL change. Maybe the content moved to another part of the site, or perhaps the site has a new name.
  • Deletion. Maybe the content was deleted all together.

Solution

Retrieve, store and host each website’s html code so the learning material is still available.

To do this manually, you would

  1. Go to the website with content you wish to preserve
  2. In a Chrome browser, right click anywhere on the page that is not a link, click “View Page Source”

3. Copy/Paste the HTML content into its own document, and host.

Note this only preserves the HTML, and copies image and links to other resources as they are. Meaning, if a website contains graphics that are referenced using relative hyperlinks (rather than absolute hyperlinks), or if there are interactive components to the content that rely on javascript, these features will not be preserved.

For AWS Admins and Practitioners:

Step 1a: Set up IAM Users and Permissions

AWS Best Practices entail doing all things through IAM users. To accomplish this project, IAM users will need full accessibility to both S3 and EC2 services.

  1. Navigate to IAM

2. Click “Users” on the side panel

3. Click “Add User” and allow both programmatic and console access.

4. Now all the information for this account will need to be securely transmitted to the intended account holder.

  • Download and send the CSV containing the access keys (including the Secret Access Key)
  • Email instructions to log in using the Send email link
  • Send/Tell the password you assigned for my first log-in.

In case you miss this screen, you can regenerate access keys again by going to IAM Security Credentials and clicking “Create Access Key”.

5. To Grant Permissions to your user, go to “Users” from the IAM dashboard again.

Click “Add Permissions” and then “Attach existing policies directly”

Add Full Access for both S3 and EC2

Step 1b: As an IAM user, here’s what to do with the keys you just received:

Pre-requisites: Install AWS-CLI, configure it, and manage profiles

Type the following into your command line, replacing your access key and secret access key accordingly:

aws configure set profile.cbe.aws_access_key_id AKIAJIQH6TFZ...aws configure set profile.cbe.aws_secret_access_key ga9i3PumE...aws configure set profile.cbe.region us-east-1

If you have multiple profiles, you will need to run the following code in your command line before each session. This tip comes slightly out of order because it references the .pem key and we haven’t gotten there yet.

Remember these keys are for your eyes only! Treat them as passwords. If they happen to be made public anywhere on the internet, bots will find them and try to use your EC2 resources. Both GitHub and AWS have steps in place to look out for this, including e-mail notifications and suspending your AWS account if they suspect your account has been accessed by someone besides you. You can recover your account by following all steps AWS Support provides.

export AWS_PROFILE=name_of_aws_profile_in_your_.aws/credentials_file

Step 2: Upload word docs to S3 using the GUI

  1. Navigate to S3
  2. Create Bucket, leave everything as it is, until you get to Permissions. For the purposes of this project, grant public read access to this bucket.

3. Create folders in your S3 to stay organized. We will be saving our word docs in the “docs” folder. The “html_content” and “pdfs” will be where we store the html content and pdfs that belong to each syllabi link. These folders are not required in the S3 bucket at the moment, our python code creates them if they don’t already exist. At this point, “html_content” and “pdfs” would not have files in them.

4. Upload your docs (the lessons syllabi) into the “docs” folder!

Ok now we’ve set up our storage. The html content s3 folder will eventually be a repository of static websites. We will be able to replace the links in the syllabi with new ones we can rely on to not change or be deleted.

Step 3: Host Python Script on EC2 by ways of configuring Jupyter Notebook in your EC2 instance

Now to run the code that will scrape the docs, store the html content in the html content folder, and then also assist in quality checking.

  1. Set up your EC2 instance

Click “Create Instance”

2. Choose a free tier Linux instance

Configure Instance

  • Add Storage (8 GiB is fine. This will be the size of your Volume. A Volume is automatically created and attached to each EC2 instance.)
  • Add tags (none is fine)
  • Configure Security Group, click Add Rule
  • Custom TCP Rule, 8888, 0.0.0.0/0

3. Click Launch — yay you now have a virtual computer you can remote into!

Great! Remember where the following information is stored, especially the public dns key. Will need it later.

5. Download the .pem file, administrator will need to give the .pem file to relevant IAM users

Where to store your .pem file, best practices. You only need to do the remainder of this step (3.5) for the first time you connect to an EC2 instance.

Using your command line, in a new window (CMD+n) type the following:

mkdir .ssh

^This creates a folder in your root directory of your local machine called .ssh. You won’t see it because the period preceding the name makes it a hidden folder.

mv Downloads/cbe.pem .ssh           #I named my pem key "cbe"

^This moves your key from your Downloads folder to your .ssh folder.

chmod 400 ~/.ssh/cbe.pem

^This sets the access permissions to your pem key. You are allowing it to be accessed.

Now to ssh in, yes, “S-S-H” is a verb.

ssh -i  ~/.ssh/cbe.pem ec2-user@public_dns_key

As mentioned previously, if you have multiple profiles, you will need to be on the right one.

To view profiles:

nano ~/.aws/credentials
Make sure you can see the profile you wish to use.

If you have multiple profiles, you will need to tell AWS which profile is being used in order to modify the EC2 instance. To be in the right profile, type the following into the command line before you start:

export AWS_PROFILE=cbe              #because I named my profile cbe

Now you can SSH. SSH is short for Secure Shell and it’s a way to securely access a remote computer.

ssh -i  ~/.ssh/cbe.pem ec2-user@public_dns_key  
#put your own dns key after the @ sign, keep ec2-user the same
Yay! you did it!

If this times out, AWS provides resources to troubleshoot your connection.

For Pythonistas:

Prerequisites: Install everything you need onto your EC2 Instance.

Install Anaconda by typing the following. The Hacker Noon blog I linked does a great job of giving more detail. I got the link to the Anaconda download files from the Anaconda Installer archive, as detailed in the Hacker Noon blog.

In the EC2 terminal:

wget https://repo.continuum.io/archive/Anaconda3-5.3.1-Linux-x86_64.shbash Anaconda3-5.3.1-Linux-x86_64.sh# respond affirmatively to the key prompts
# press ENTER to continue, then answer 'yes' and click 'Enter' to
# continue the installation
# set anaconda as your default environment
source .bashrc
which python /usr/bin/python
# To get out of the Python 3 REPL just hold CONTROL then hit “d” or type quit()

Agree to the license by typing yes, and click ENTER, type yes again

Install other packages:

sudo yum install python36 sudo pip-3.6 install boto3

If you come across any errors regarding missing packages later on, install them like the example.

Notebooks are a great way to see how your code works and explain to others what it’s doing, so we will use Jupyter Notebook. The following blog does a great job of walking you through the process. It is repeated here as brief as possible.

Configure Jupyter Notebook on your EC2 Instance.

  1. Create a jupyter notebook password.

In the EC2 instance terminal:

ipythonfrom IPython.lib import passwdpasswd()*type password*### Save sha hash 
### I provide a doctored sha hash as an example of what it should look like:
### sha1:89b37eb3bb72:097a36~*lol*~b8b95f23705cd8141b8fc
exit

2. Configure notebook to work out of browser, create an ssh so our browser will trust our Jupyter server:

jupyter notebook --generate-configmkdir certscd certssudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem##it will prompt you for some information, what you type in is inconsequential, so you can type whatever you like.cd

3. Edit Jupyter configuration file using vim. Vim allows you to open files and edit them like text editor within terminal.

vim .jupyter/jupyter_notebook_config.py
If it doesn’t look like this, click [ESC] and type :q!, then type cd [ENTER]

Hit “i” to enter insert mode, and make the following edits. This part deviates from the blog I linked. Remember when you preserved your sha hash? You will use it here!

c = get_config()# Kernel config

c.IPKernelApp.pylab = 'inline'
# if you want plotting support always in your notebook


# Notebook config
c.NotebookApp.certfile = u'/home/ec2-user/certs/mycert.pem'#location of your certificate filec.NotebookApp.ip = '*'
# sometimes this produces an error later.
# if so, replace '*' with '0.0.0.0'
c.NotebookApp.open_browser = False#so that the ipython notebook does not opens up a browser by defaultc.NotebookApp.password = u'sha1:89b37eb3bb72:097a36~*lol*~b8b95f23705cd8141b8fc' #edit this with the SHA hash that you generated# This is the port we opened in Step 3.

c.NotebookApp.port = 8888

Once you got that in, hit [ESC] and type :wq to save and quit out of vim.

4. Create folder in your EC2 instance for your notebooks. In your EC2 terminal:

mkdir Notebookscd Notebooks#start your notebook!
jupyter notebook

It should look like this:

If this produces an error check here for solution. You may need to go into the vim editor and replace c.NotebookApp.ip = ‘*’ with c.NotebookApp.ip = ‘0.0.0.0’.

5. Access it from your browser.

Open up a new tab and type into the address bar: https://your_ec2_instance_public_dns_key:8888

If you’re in Chrome:

Click “Advanced”

Click “Proceed To”

Now enter your password you created back in step 1 of the configuration. Great! Now you’re ready to get scraping, storing and checking! Here I will describe the scraper. Please refer to my GitHub to see how I checked that it was working how we intended.

As a reminder: To exit out of the EC2 instance in terminal, type enter, tilda, period: [ENTER]+~+.

Step 1: Read in each .doc file from the S3 bucket, save it into the EC2 instance.

Step 2: Using XML, get the xml content from each .doc file

from xml.etree.cElementTree import XML

Step 3: Using Regex, find all urls in the text, and store them in a list.

import relink_list = re.findall('>http.*?\<',xml_str)[1:]

Step 4: Store all html content and pdfs in an S3 bucket

s3.Bucket(bucket_name).put_object(Key=key, Body=content, ContentType='text/html')

Step 5: Keep a dataframe recording all documents, urls, and comments

For this particular scraper, each link is assigned an identifier based on which document it is in, and which link it is. For example “link_001_005” is the fifth link in the first document. We do this by adding in lines of code that increment. We also store the document name and comments noting if it’s a youtube video or a pdf.

Step 6: Create a list of all the links so there is a central directory of all your work and it can easily be referenced:

Here is an example of what I’m talking about: https://s3.amazonaws.com/aals1/index.html

So in order to do that, you will need to programmatically change the metadata of each item in your S3 buckets so that they open in new tabs rather than download as a .html file onto the user’s computer when clicked.

Step Always: Check your work

This is an important aspect of any project.

This was helpful in identifying which pages got skipped because they took too long to load or had other connectivity issues.

Next Steps:

Update course materials

Use the data produced in Step 5 to update course syllabi with new links containing content. The following code, when typed into your EC2 instance command line, will save a copy of the index.html file to your local machine’s Desktop.

scp -r ec2-user@public_dns_key.compute-1.amazonaws.com:Notebooks/index.html ~/Desktop/

Perfect the process and make it more efficient and user friendly

There are a number of limitations including but not limited to

  • Website interactivity and functionality disabled on our static sites due to limitations of html
  • Websites blocking automated requests

Feel free to submit pull requests and suggestions to help us tackle these issues.

--

--