Essential HTML Tags for Web Scraping: Every Data Scientist Should Know

Pallavi Padav
Women in Technology
7 min readJun 11, 2024
https://www.verbolia.com/6-html-tags-to-improve-your-seo-and-rank-better/

In the world of data science, proficiency in programming languages like Python and R often takes center stage. However, understanding the foundational elements of web development, such as HTML, can significantly enhance a data scientist’s skill set. HTML (HyperText Markup Language) is the backbone of web pages, enabling the structuring and presentation of content on the internet. In this blog, we will delve into the essential HTML tags that every data scientist should know, empowering you to do web scraping, using Flask a web framework for Python.

HTML stands for “Hypertext Markup Language.” It’s the language developers use to build web pages. HTML tags are the key tools in this language. They’re like instructions or labels that you put around different parts of your web content to tell web browsers how to display them.

These tags consist of opening and closing parts enclosed in angle brackets (“<” and “>”).

Basic HTML Tags

  1. DOCTYPE <!DOCTYPE>

The HTML document type declaration, also known as DOCTYPE, is the first line of code required in every HTML or XHTML document. The DOCTYPE the declaration is an instruction to the web browser about <!DOCTYPE> is not an HTML tag, instead, it is “information” to the browser about what version of HTML the page is written in.

Doctype syntax for HTML5 and beyond:

<!DOCTYPE html>

Doctype syntax for strict HTML 4.01

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

2. Html Tag <html>

Syntax <html> … </html>

The <html> tag indicates that the document is a web page. The rest of the HTML elements are written between <html> and </html> tags. <HTML> is followed by <!DOCTYPE html> declaration at the beginning of the file.

<html>Content</html>

3. Head Tag <head>

Syntax <head> … </head>

Contains header information about the webpage, including title, meta tags, and linked stylesheets. It is part of the document’s structure but is not displayed on the webpage.

<!DOCTYPE html>  
<html>
<head>
<title>Welcome to our page</title>
</head>
</html>

4. Title Tag <title>

Syntax <title> </title>

The <title> tag specifies the title of the web page. This tag is described in the head tag. The content in between the <title>…</title> tag appears on the tab or title bar in the browser window. Since it's written within <head> </head> it does not get displayed on the webpage.

<title>Title tag: the ultimate reference guide to make it work for you</title>
https://www.conductor.com/academy/title-tag/

5. Body Tag <body>

Syntax <body> … </body>

HTML <body> tag defines the main content of an HTML document displayed on the browser. It can contain text content, paragraphs, headings, images, tables, links, videos, etc.

The <body> must be the second element after the <head> tag or it should be placed between </head> and </html> tags.

<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>
<h1>HTML Tag guide</h1>
<hr>
<p>HTML tags are the key tools in this language. They're like instructions or labels that you put around different parts of your web content to tell web browsers how to display them.</p>
</body>
Created by author

6. Paragraph Tag

Syntax <p> </p>

<p> defines the paragraph that is going up on the website. The browser removes extra spaces and extra lines while displaying the page. The browser counts the number of spaces and lines as a single one.

<p>  
I am
going to provide
you a tutorial on HTML
and hope that it will
be very beneficial for you.
</p>
<p>
Look, I put here a lot
of spaces but I know, Browser will ignore it.
</p>
<p>
You cannot determine the display of HTML</p>
<p>because resized windows may create different result.
</p>
Created by author

7. Heading Tag

Titles and subtitles can be displayed on the webpage using heading tags. There are six heading tags in HTML: <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>. The <h1> tag represents the most important heading while <h6> is for the least important ones.

<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>
<h1>HTML Tag guide</h1>
<h1>Heading using h1 </h1>
<h2>Heading using h2</h2>
<h3>Heading using h3</h3>
<h4>Heading using h4</h4>
<h5>Heading using h5</h5>
<h6>Heading using h6</h6>
</body>
</html>

8. Image tag <img>

Syntax : <img src=”image.jpg” alt=”Description of the image”>

The <img> tag is used to insert an image in an HTML document. The HTML image tag is an empty tag that contains attributes only, closing tags are not used in HTML image element.

Attributes of the image tag include:

  • src: It is the path or source of the image file. It instructs the browser where to look for the image on the server. The location of the image may be on the same directory or another server. When a web page loads, the browser gets the image from a web server and inserts it into the page. The broken link icon and the alt text is shown if the browser cannot find the image.
  • alt: The required alt attribute provides an alternate text for an image if the user for some reason cannot view it.
  • width and height: Define the width and height of the image in pixels.
  • style: Specify the width and height of an image. It prevents style sheets from changing the size of images
<img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">
<img src="img_girl.jpg" alt="Girl in a jacket" style="width:500px;height:600px;">

Image from another folder.

<img src="/images/html5.gif" alt="HTML5 Icon" style="width:128px;height:128px;">

Image from another site

<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>
<h1>Good Morning</h1>

<img src="https://images.pexels.com/photos/1266810/pexels-photo-1266810.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1" alt="Good Morning">
</body>
</html>
Created by author

9. Anchor tag

Syntax:

<a href=”Document URL” … attributes-list>Link Text</a>

It is used to create hyperlinks to another web page as well as files, locations, or any URL. The “href” attribute specifies the URL of the page the link goes to.

<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>
<h1>Good Morning</h1>
<p>Click following link</p>
<a href="https://indianexpress.com/" target="_self">News Headlines!</a>
</body>
</html>
Created by author

Appearance of HTML anchor tag

An unvisited link is displayed underlined and blue.

A visited link is displayed underlined and purple.

An active link is underlined and red.

Created by author

10. Div Tag

Syntax : <div> … </div>

<div> tag is a container unit that is used to encapsulate other page elements divide the HTML documents into sections and apply styles.


<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>
<h1>Good Morning</h1>
<p>Click following link</p>
<a href="https://indianexpress.com/" target="_self">News Headlines!</a>

<div style="border:4px solid pink;padding:10px;font-size:20px">
<h2>New Delhi</h2>
<p>New Delhi is the capital of India and one of Delhi city's 11 districts.
New Delhi is the seat of all three branches of the Government of India,
hosting the Rashtrapati Bhavan, Sansad Bhavan, and the Supreme Court.</p>
</div>
<div style="border:2px solid green;padding:10px;font-size:20px">
<h2>London</h2>
<p>London is the capital city of England.</p>
<p>London has over 13 million inhabitants.</p>
</div>
</body>
</html>

11. Form tag

Syntax:

<form action="server url" method="get|post">  
//input controls e.g. textfield, textarea, radiobutton, button
</form>

The <form> tag is used to create an HTML form for user input. Its the section of a document which contains controls such as text fields, password fields, checkboxes, radio buttons, submit button, menus etc.

The <form> element can contain one or more of the following form elements:

  • <input> : It is used to create form fields, to take input from user.
<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>

<form>
Enter your name <br>
<input type="text" name="username">
</form>

</body>
</html>
Created by author
  • <textarea> : Used to insert multiple-line text in a form. The size of <textarea> can be specified either using “rows” or “cols” attributes or by CSS.
<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
</head>
<body>

<form>
Enter your address:<br>
<textarea rows="2" cols="20"></textarea>
</form>

</body>
</html>
Created by author
  • <button>: The radio button is used to select one option from multiple options.
    <form>  
<label for="gender">Gender: </label>
<input type="radio" id="gender" name="gender" value="male"/>Male
<input type="radio" id="gender" name="gender" value="female"/>Female <br/>
</form>
  • <label> : This tag defines a label for an <input> element.

12. Table tag

Table tag is used to display data in tabular form (row * column).

Table row is defined by <tr> tag, table header is defined by <th>, and table data is defined by <td> tags.


<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
border: 1px solid black;
}

th, td {
padding: 10px;
}
</style>
</head>
<body>

<table>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</table>

</body>
</html>

EndNote:

Thanks for reading the blog. Have thoughts or questions? We’d love to hear from you! Feel free to leave a comment below.

Would love to catch you on Linkedin. Mail me here for any queries.

Stay tuned for more exciting content till then Happy reading!!!!

I believe in the power of continuous learning and sharing knowledge with the community. Your contributions are invaluable in helping me create meaningful content and resources that benefit everyone. Join me on this journey of exploration and innovation in the fascinating world of data science by donating to Buy Me a Coffee.

--

--