Introduction to growth hacking: How to expand your contacts database virtually infinitely?

Boris Maysel
Analytics Vidhya
Published in
9 min readFeb 2, 2020
Photo by Jukebox Print on Unsplash

I’ll show you how to add 100’s or even 1,000’s of industry leaders to your marketing database, in just a matter of minutes. We’ll use python, data science, and growth hacking methods.

This post is more marketing-related rather than sales or BD.

Contacts is a complicated topic — the roadmap

Since taking care of the contacts is an extremely important task, I divided this subject into several parts that will be covered in different blogs. After these blogs will be published, I’ll add links to the list of topics below:

  1. How to get new contacts to your database (this part)
  2. In the next post, I’ll cover how to come up with a list of target companies for your marketing and business development folks as well as retrieve physical addresses and contact details in some cases. We’ll use pretty much similar technique that is shown here
  3. Why your existing contacts database sucks and what you can do to resolve it: What contact details do you really need and how to deal with missing data. One of the subjects is how to normalize companies' names. It has many implications like sorting PoS reports
  4. We’ll use machine learning to perform smart classification of the database to significantly improve targeting and marketing performance
  5. After we’ll sort out the existing contacts database, we’ll use machine learning to guess emails with high probability. We’ll return to this blog and see how it can be used to add emails to the extracted contacts

Before we start, a few words about growth hacking. In my opinion, growth has to do with achieving the massive outcome and hacking has to do with utilizing unconventional methods, being lean and having a very limited budget. The part I liked from the Wikipedia definition:

“The typical growth hacker often focusses on finding smarter, low-cost alternatives to traditional marketing”.

This is exactly what we are going to do here. It will require some basic knowledge of HTML and of course Python. Nothing too much complicated though.

How to add more contacts to your database?

I’m not talking about the organic growth of your contacts database, by using campaigns and forms that encourage users to give their contact details to your company, usually in exchange for something valuable. I’m neither talking about buying contact lists nor about using services like zoom.info.

We are dealing with growth hacking. Therefore, we are going to utilize non-traditional methods and being lean.

The goal here is to add bulks of 100’s of contacts to the marketing database. You can, of course, replicate this technique and run several iterations, increasing the number of added contacts to 1,000’s.

How are we going to do that?

  1. Identify contacts sources
  2. Understand how contacts data is structured
  3. Code

Contacts sources

We’ll focus on publicly available contacts sources: conference sites, members of industry organizations, list of dealers, etc.

It doesn’t matter what industry you are targeting. The starting point is industry events. So, the first question is: What are the leading events of this industry? and the second is: Who are the speakers?

A quick Google search will return a bunch of events. To name a few:

  1. Broadband World Forum — https://tmt.knect365.com/bbwf/
  2. Content Marketing World — https://www.contentmarketingworld.com/
  3. Broadband Communities Summit — https://www.bbcmag.com/events/summit-2019/; I took this one because it is a little bit more complex. I just want to illustrate the robustness of the method

Contacts structure on the web page

Usually, all these event sites, contain speakers’ pages. All the speaker entries are organized in a dedicated table, and each entry contains a name, a title and a company name (the order and the format may be different). This is what we need.

Let’s start with the broadband forum:

Right-click on speakers’ name, and click inspect

In the HTML code, we can see that there is a ‘div’ element that contains all the speakers’ entries:

And there are speaker entries under each table element:

You can see that the speaker’s name, for example, is defined under ‘p’ element with class ‘c-speaker-block__name’, the same applies to the job title and the company name.

In this case, every entry has a unique class name, which is not always the case. In the second example, we’ll see a more cumbersome structure. The good news that the code is robust enough to handle all the cases.

Coding everything together

On top of Pandas, we’ll rely on python’s BeautifulSoup library that enables easy manipulations with HTML code and extraction of different types of information.

First, let’s define a utility function that will be responsible for downloading a webpage and returning a string containing the HTML code. We’ll use requests and socket, which comes with Python3.

import pandas as pd
from bs4 import BeautifulSoup
import requests
import socket
def read_url(url, timeout = 10):
socket.setdefaulttimeout(timeout)
return requests.get(url).text

Second, to make the code more reusable, we’ll use the OO approach and define a base class that will be responsible for the parsing of a site (that passed in the constructor). Each site that you’ll need to extract the contacts from, will inherit from the base class and override its 2 methods:

  1. get_speakers_entries — that is responsible for identifying the location of the speakers entries (the table) and returning a list of entries
  2. parse_speaker_entry — will parse each entry and return name, title and company

This is the implementation of the SpeakerParser base class:

class SpeakerParser:
def __init__(self, url):
"""
Class constructor
:param url: url of the contacts page
"""
self.contacts = pd.DataFrame()
self.url = url

def get_speakers_entries(self, soup):
"""
Method that parses the main table of the contacts. Should be ovveriden by the inheriting class.
:param soup: soup object
:return: a list of speakers entries
"""
return None

def parse_speaker_entry(self, speaker_entry):
"""
Method that parses the each contact entry. Should be ovveriden by the inheriting class.
:param speaker_entry:
:return: [name, title, company]
"""
return ['', '', '']

def parse(self):
"""
Method that performs the parsing
:return: DataFrame of contacts, with the following columns: ['name', 'title', 'company']
"""
self.contacts = pd.DataFrame()

page_html = read_url(self.url)
soup = BeautifulSoup(page_html, "html.parser")
speakers_entries = self.get_speakers_entries(soup)

if speakers_entries != None:
contacts = [self.parse_speaker_entry(speaker_entry) for speaker_entry in speakers_entries]
self.contacts = pd.DataFrame(contacts, columns = ['name', 'title', 'company'])

return self.contacts

Now, we’ll implement the actual part that will implement the parsing of the speakers of the BFF speakers section:

class BBWF(SpeakerParser):
def get_speakers_entries(self, soup):
return soup.find('div', {'class': 'c-speakers-table'}).find_all('div', {'class': 'col-xs-12 col-sm-6'}, limit=None)
def parse_speaker_entry(self, speaker_entry):
name = speaker_entry.find('p', {'class': 'c-speaker-block__name'}).get_text(strip=True)
title = speaker_entry.find('p', {'class': 'c-speaker-block__job'}).get_text(strip=True)
company = speaker_entry.find('p', {'class': 'c-speaker-block__company'}).get_text(strip=True)
return [name, title, company]

bbwf_2019 = BBWF(r'https://tmt.knect365.com/bbwf/speakers/')
contacts = bbwf_2019.parse()
contacts.to_csv('bbf_2020_speakers.csv', index = False)

Using just a few lines of code, in a matter of minutes, we were able to download 224 contacts!

Second example

In the speakers' page of the ContentMarketingWorld, let’s set the pointer on the speaker's name and click inspect.

This is the HTML code:

Each speaker entry is defined in the speaker class. Speaker’s name is under the text-fit class, but the company and the title are in a bit trickier format: it is separated by a comma, but if you look further, there are several cases: only the title presents or there are multiple titles. When there are multiple titles, the company is usually the last one.

So, the parsing code will be:

class ContentMarketingWorld(SpeakerParser):
def get_speakers_entries(self, soup):
return soup.find_all('div', {'class':'speaker'}, limit = None)
def parse_speaker_entry(self, speaker_entry):
name = speaker_entry.find('span', {'class': 'text-fit'}).get_text(strip=True)
title_company = speaker_entry.find('span', {'class': 'description'})

# if there is no title and no company, return just the name
if title_company == None:
return [name, '', '']

title_company = title_company.get_text(strip=True).split(',')

# if there is only one element in title_company, assume it is a title
if len(title_company) < 2:
return [name, title_company[0], '']

return [name, ','.join(title_company[:-1]), title_company[-1]]

ContentMarketingWorld(r'https://www.contentmarketingworld.com/speakers/').parse().to_csv('content_marketing_world_2019_speakers.csv', index = False)

The parse_speaker_entry method needs to deal with these different formats of the title_company string. We have 3 cases:

  1. When the string is empty, return just the speaker’s name
  2. When it has a single element, assume that this is a title
  3. When there are several elements, the last one is a company name and all the rest will be the title

Boom, 73 new entries to our database!

A bit more complex example

The information in the HTML code is not always well organized. It depends on many factors, and it is not straightforward that all the tags will be ordered and organized in a meaningful way.

Let’s look at the other site:

After inspecting the speaker element, we can see that there is no unique tag for the table that hosts the contacts. In fact, there are 2 tables with the same class name: ‘event-table-speakers

After examining these 2 tables, we find out that the second table hosts all the speakers' entries. Let’s how these entries look like.

We can see here, that the speaker's name is located in the text of the span element with a class named ‘style10’. The title and the company are located under the span element with a class ‘ploop’ and are divided by a br element.

If we’ll look further, we’ll find out that some names will be missing title or company.

Well, let’s see how the code looks like:

class BBC(SpeakerParser):
def get_speakers_entries(self, soup):
return soup.find_all('table', {'class':'event-table-speakers'})[1].find_all('tr', limit = None)
def parse_speaker_entry(self, speaker_entry):
name = speaker_entry.find('span', {'class': 'style10'}).get_text(strip=True)
title_company = speaker_entry.find('span', {'class': 'plop'})

# if there is no title and no company, return just the name
if title_company == None:
return [name, '', '']

title_company = title_company.get_text(strip=True, separator="\n").split('\n')

# if there is no title or no company, return just the name
if len(title_company) < 2:
return [name, '', '']

return [name, title_company[0], title_company[1]]

BBC(r'https://www.bbcmag.com/events/summit-2019/2019-speakers').parse().to_csv('bbc_2019_speakers.csv', index = False)

The same structure, few notes though:

  1. In the get_speakers_entries method, we take the second table after applying find_all method. This is because there are two identical tables, and the second one contains the actual speakers’ details
  2. The parse_speaker_entry has some error handling logic to deal with the missing title or company name. For the sake of simplicity, in this case, the method will return just the person name if the title OR the company name is missing

>>>%run …

We just added 167 new names! Since we utilized the code from the previous example, it took us even faster than the first time.

Summary

  1. In just 3 examples, we’ve downloaded almost 500 contacts, including company names and titles
  2. We’ve created a reusable, readable code that can be utilized to download contacts from different sources
  3. The output format is a DataFrame, you can use its various methods to analyze the data. For example, running contacts[‘company’].value_counts(), will display the company with the most speakers
  4. What you can do with this information is mind-blowing!
    You can download contacts from the old events of the same organizers to see how the industry landscape is changing: who are the newcomers, who from the incumbents become more or less dominant etc.
    You can use the conference name or after small research, a subject of the speaker slot or pannel as conversation starters…

Cheatsheet

  1. requests.get(url).text is useful to download the HTML of a web page
  2. Adding socket.setdefaulttimeout(timeout) before calling the request method will set up a timeout (in seconds), such that if there is no Internet connection or the URL is wrong the process won’t be stuck
  3. The following functions nesting from the BeautifulSoup: speaker_entry.find(‘span’, {‘class’: ‘plop’}).get_text(strip=True, separator=”\n”).split(‘\n’)
    find the first span element with class ‘plop’
    get the text of this element, while removing spaces from the front and the end, and also replace the br element with a new line
    split the string into a list of string, using a new line as a separator
  4. Another nesting example is: BBC(…).parse().to_csv(…, index = False).It is actually creating the object, parsing the site and saving all the results in the csv file, all in one line of code

--

--

Boris Maysel
Analytics Vidhya

Strategy, Python, Sales, Data Science, Business Development, Growth Hacking, Marketing