A growth hacker guide to business intelligence
Continuing the growth hacking theme, we’ll be discussing several very powerful methods and tools on how to gather business intelligence at scale. Scale means that you’ll be able to perform in minutes, a task that can take a team of analysts to accomplish.
The blog contains 3 parts. You can read each part independently, so feel free to skip:
- How to identify industry players
- How to get a company’s business information
- How to scale and automate the process for 100’s or even 1,000’s of companies
Though the code examples are in Python, the focus is on the methodology, data sources, data processing, and its analysis.
You’ll be able to identiy industry players and gather thier business characteristics such as company URL, country, the geography of business activities, get an estimate of their size and relative position within their main geography and within the global industry.
There are plenty of paid BI tolls. But, even the paid tools don’t always provide the necessary insight. They usually don’t work well for medium/small size companies. For big companies, they don’t provide the needed granularity. For example, there is no tool to tell you about what business unit to target inside a big organization.
The paradigms are generic and can be implemented in other programming languages. The tools we’ll focus on are free of charge and don’t even require to signup!
We are going to use some very robust growth hacking techniques:
- How to clean the data using Regula Expressions — an effective way to clean textual data
- How to access APIs of different online resources — API (Application Programming Interface) is an essential tool to request information. Most of the online data services rely on this approach for data exchange. It is very easy to implement using Python
- How to scale the above methods to be able to get information about 100’s of companies without any issues of being blocked or denied by the service provider
The programming will be a little bit more advanced. I’ll explain the concept, give examples, emphasizing some of the key points. Feel free to reach out if you need a complete code or a more detailed explanation.
If you don’t feel comfortable with programming, the tools I’ll cover can be used manually. They are still very useful for business intelligence and competitive analysis.
Last, but not least. The data you’ll get won’t be 100% accurate or complete, therefore it shouldn’t be used as the only source for the decision making process.
Part 1: How to identify industry players
In the previous blog (Introduction to growth hacking: How to expand your contacts database virtually infinitely), I introduced the technique of discovering new contacts. In minutes, you can add 1,000’s of new names to your marketing database.
You can still use the company names from the previous blog. We’ll see additional sources of the industry players.
Who are the industry players?
How do we identify companies in a specific industry? I would suggest using the following sources as a starting point
- Members of industry organizations
- Companies of the speakers in the main industry events (from the previous blog)
- Event sponsors
- Partners of other companies in the industry
Let’s get our hands dirty.
How to get the names of the companies
Let’s take as an example the Open RAN alliance, the members’ directory is located here: https://www.o-ran.org/membership. It is one of the organizations
It is not straight forward, there are actually no names, but just logos of the companies. I’ll use the technique, I showed in my other blog: Introduction to growth hacking: How to expand your contacts database virtually infinitely to extract the names of the companies.
As you can see, the company name is actually the name of the alt attribute under the img element.
Here is the code to extract all these attributes (we reuse some of the code from the previous blog):
url = r'https://www.o-ran.org/membership'
data = read_url(url)
soap = BeautifulSoup(data, "html.parser")
images = soap.find_all('img', {'class': 'thumb-image'})
company_names = [image['alt'] for image in images]
- Images will contain the img object for all the relevant images that represent company logos
- image[‘alt’] will retrieve the ‘alt’ attribute from
However, after looking at the company_names list, it requires some cleaning:
['ATTTile.png','ChinaMobileTile.png','NTTTile.png','OrangeTile.png','T-MobileTile.png','AirtelLogoTile.png','BellCanadaTile.png','British_Telecom-400x400.png','ChinaTelecomLogoTile.png','ChinaUnicomTile.png','Chunghwa_400x400.png',…
Using RegEx to clean textual data
We’ll use regular expressions techniques to clean the data:
def clean_company_name(company_name):
p = re.compile(r'(Tile.png|LogoTile.png|[-_]400x400.png|[_-]new|-400x400 (1).png|.png)', re.VERBOSE + re.I)
company_name = p.sub('', company_name)
p = re.compile(r'([_])', re.VERBOSE + re.I)
return p.sub(' ', company_name)
The regular expression r’(Tile.png|LogoTile.png|[-_]400x400.png|[_-]new|-400x400 (1).png|.png will clean all the possible suffixes and return a clean company name.
- The expression in the parentheses () is the textual expression we’d like to look for and substitute/replace
- The | between different strings means OR. In other words, match the one of this expression
- re.I flag indicates to ignore the case of the expression
Let’s apply this function to clean up our list, and export it to CSV using Pandas:
company_names = [clean_company_name(company_name) for company_name in company_names]
pd.DataFrame(company_names, columns=['Company']).to_csv('oran_members.csv')
Now we have a clean list of companies.
So this is our starting point. A few minutes and a few lines of code give us 145 companies.
Combining these names with the companies from the previous blog, one can get a pretty good overview of the industry players. Looking at companies of the speakers, main industry events sponsors, members (companies) of the main industry organizations can point out what companies are more influential and what companies you need to partner with.
There may be some issues, though, in combining all these lists, but we’ll handle it in the next blog.
Now, let’s populate this table with some business information.
Part 2: How to get a company’s business information
There are paid sources for business intelligence data. These tools include LinkedIn, Zoom.info, Clearbit, etc. However, paying for the service, won’t always provide you with a piece of insightful information. In fact, the basic LinkedIn API is very lean, to get access to more data requires approval that can take weeks or even denied. So, if you are in а startup bootstrap mode, or want some quick results it may not always work for you.
Just to be clear, I’m not telling not to use other tools, but just introducing other approaches that may be extremely helpful, and cost effective
Introduction to API
API stands for Application Programming Interface. The APIs will discuss here, based on the principle that a user/Python script sends a request to the server of the data service and the server replies back. Pretty much the same way the internet browser works. You type a URL the server responds with the web page.
There are 2 types of requests GET and POST, and the type of the requested information is encoded in the URL itself or/and in the HTTP headers of the request.
Most of the APIs require some type of registration in order to obtain a KEY. The APIs in this article neither require registration nor require a KEY to get access to the data.
Let’s see how it works.
How to get a company’s URL
There is a free API available by Clerabit to look up a company domain by company name. This API doesn’t require registration.
DOMAIN_ENDPOINT_NOKEY = 'https://autocomplete.clearbit.com/v1/companies/suggest?query={}'
- The URL of the API is called the ‘Endpoint’.
- It usually contains a list of parameters, in this case, it is ‘query’ that should contain the company’s name
Also, it should be sent using GET method, so you can even type it in a browser. Replacing {} with ATT, will get the following response:
[{"name":"AT\u0026T","domain":"att.com","logo":"https://logo.clearbit.com/att.com"},{"name":"Attentive","domain":"attentivemobile.com","logo":"https://logo.clearbit.com/attentivemobile.com"},{"name":"Attack of the Fanboy","domain":"attackofthefanboy.com","logo":"https://logo.clearbit.com/attackofthefanboy.com"},{"name":"AT\u0026T Performing Arts Center","domain":"attpac.org","logo":"https://logo.clearbit.com/attpac.org"},{"name":"Attendance on Demand","domain":"attendanceondemand.com","logo":"https://logo.clearbit.com/attendanceondemand.com"}]
We can see that the first dictionary contains the name of the company, it’s domain and even a link to its logo. Actually, all is needed to implement it in Python is to send the get request (we’ve already done that) and parse the reply.
The return format of most of the APIs is in JSON format. And of course, there is a library to handle it. import json
def get_domain_name(company_name):
endpoint = DOMAIN_ENDPOINT_NOKEY.format(company_name)
result = requests.get(url=endpoint, data={}, headers=HEADERS, proxies=proxy)
result = json.loads(result.text)
d = {
'name' : result[0]['name'],
'domain' : result[0]['domain'],
}
return pd.Series(d)
- HEADERS and proxy will be discussed when we’ll talk about automation and scalability
- Before using json.loads we need to use import json. loads is a method that takes a string in a JSON format and returns a python object
BTW, the Clearbit is not very expensive, and one can purchase access to a full API.
As a side note, the Clearbit API works better for US-based companies. It may return a US-based domain even when searching for an international company.
How to get a company’s business information
We’ll be using a free API from Similarweb. This API doesn’t require registration either.
If you decide that implementing APIs access in Python is outside of your comfort zone, Similarweb is still a great business intelligence tool. It provides a very user-friendly analysis of internet traffic data. Which can be extremely insightful to understand the company’s position, business activities and competitive landscape.
TOTAL_TRAFFIC_ENDPOINT_NOKEY = 'https://data.similarweb.com/api/v1/data?domain={}'
This API endpoint requires the company’s domain, which we’ve got in the previous section. In return, it provides a handful amount of information about the traffic, where it is coming from, ranking of the company, its industry, etc.
{"SiteName":"att.com","Description":"visit att.com to switch and save on unlimited data plans, internet service, & tv with premium entertainment! america's best network is also the fastest.","TopCountryShares":[{"Value":0.9500846049919677,"Country":840},{"Value":0.01045316020720349,"Country":484},{"Value":0.0038907583335344616,"Country":356},{"Value":0.003625260293148592,"Country":124},{"Value":0.002542974682644363,"Country":630}],"Title":"at&t official site - unlimited data plans, internet service, & tv","Engagments":{"BounceRate":"0.5961736247674132","Month":"01","Year":"2020","PagePerVisit":"4.451902109368646","Visits":"9.756941889301659E7","TimeOnSite":"221.5620802830074"},"EstimatedMonthlyVisits":{"2019-08-01":127864051,"2019-09-01":112905075,"2019-10-01":96790400,"2019-11-01":99639971,"2019-12-01":105287460,"2020-01-01":97569418},"GlobalRank":{"Rank":329},"CountryRank":{"Country":840,"Rank":79},"IsSmall":false,"TrafficSources":{"Social":0.014144215350222455,"Paid Referrals":0.020342651724442284,"Mail":0.10367724601162684,"Referrals":0.05567331021727888,"Search":0.3391175301685869,"Direct":0.46704504652784257},"Category":"Computers_Electronics_and_Technology/Telecommunications","CategoryRank":{"Rank":"5","Category":"Computers_Electronics_and_Technology/Telecommunications"},"LargeScreenshot":"https://site-images.similarcdn.com/image?url=att.com&t=1&h=c7506c43a742090dc4e676d63b2e82a257d212ee387c0b0dc8666db04bc07d66"}
At this point, we’ll use the following fields from the response:
def get_site_data(domain):
endpoint = TOTAL_TRAFFIC_ENDPOINT_NOKEY.format(domain)
result = requests.get(url=endpoint, data={}, headers=HEADERS, proxies=proxy)
result = json.loads(result)
d = {
'monthly_visits' : int(float(result.get('Engagments', {}).get('Visits', 0))),
'global_rank' : result.get('GlobalRank', {}).get('Rank', 0),
'country_rank' : result.get('CountryRank', {}).get('Rank', 0),
'top_country_name' : get_country_name(result.get('CountryRank', {}).get('Country', 0)),
'similar_web_site_name' : result.get('SiteName', ''),
'category' : result.get('Category', ''),
'category_rank' : result.get('CategoryRank', {}).get('Rank', 0),
}
return pd.Series(d)
- monthly_visits — can be used as an indication of the company size
- global_rank — an indication of the company position in the global market, comparing to the internet traffic of all other companies
- country_rank — an indication of the company position in the local market, comparing to the internet traffic of all other companies
- top_country_name — the country that most of the traffic is coming from. The API reveals other countries as well and it can be an indication of what’s the global business distribution of the company’s activities
- similar_web_site_name — the web site name according to SimilarWeb
- category — the industry category assigned by SimilarWeb
- category_rank — company's rank within the above category
In order to get the country name, I’m using a helper function get_country_name(country_code).
def get_country_name(self, country_code):
try:
return pycountry.countries.get(numeric = str(country_code)).alpha_3
except:
return ''
I’m using pycountry library, which is very useful for country-related manipulation. In the next blog, I’ll show some other useful use-cases of pycountry.
It provides many methods to lookup country names and codes. We are using the ISO 3166 convention, and for that pycountry.countries.get method is being utilized. The numeric parameter specifies the string that represents the country code.
Other sources of business intelligence data
- Dun & Bradstreet — provides various sorts of companies' information, analytics, and industry reports, as well as credit score ratings, etc. The API is available here. There is a Python implementation of this API here (I didn’t test it yet), however, you’ll need to obtain a username and a password
- Quandl — the discussion about business intelligence won’t be called business intelligence without mentioning Quandl. Most of the data is either financial or economic and primarily cover big companies. It provides an official Python library to access the data. You’ll need an account and will be required to pay for most of the data
- Morningstar research — also financially focused information. I didn’t use their API directly. I’ve used Globa Equity Classification Structure reports for segmentation and vertical markets analysis. The Morningstar data is also available in Quantopian and is very easily accessible in its research notebooks
Part 3: How to scale
What’s does it mean to scale? The goal is to be able to execute these API requests per 100’s or 1,000’s of companies. From a practical perspective, we basically need to take care of 2 things:
- Implement a proper error handling
- Make sure that the server will not recognize us as a bot and won’t block access to the API or blacklist our IP
Error handling has to do more with programming and is less interesting to the readers. Feel free to comment or reach out directly if you need some more information about it.
I’ll focus on the robustness of the API access.
Making the API requests look like a web browser
The additional information in HTTP requests, beyond the method and the URL is contained in the HTTP header. Among different fields, it may contain the cookies information, authentication data, the type of the browser and the type of application sending the request
So in order not to look like a bot, we can add some headers that will mark us look like a regular browser:
HEADERS = {
"User-Agent": r"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31",
"Accept": r" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": r"en-US,en;q=0.8"
}
- The “User-Agent” header identifies the browser, in this case, it will be identified as Chrome, more information can be found here
- The “Accept” header advertises which content types the client is able to understand, once again, this is a default value that corresponds to Chrome browser, more info here
- The “Accept-Language” defines which language the client is able to accept
All is left is to add the HEADERS as a parameter to the requests.get method, so it will be sent as a part of our request call.
There are additional advanced techniques, to emulate cookies and automate the browser, but it is not required here.
Making robust API calls
The API server may block API request in case there are too many requests coming from a unique IP address. In order to overcome it, we’ll implement a random proxy selection.
Sending requests through a proxy server will make it look like the requests are coming from the IP of this proxy server and not from the real request sender.
So, the implementation steps in order to use the proxy servers:
- If you recall from the previous part, all the APIs we used require HTPPS requests. Therefore we need a list of free proxy servers that support HTTPS and are based in the countries with a good internet infrastructure to minimize the latency and increase reliability
- Verify connectivity of this proxy servers, to make sure our requests will go through
- Implement a random selection mechanism, such that the API request will be sent through a randomly selected proxy server
There is a list of free proxy servers that we can use, available at https://free-proxy-list.net/.
Here is what we’re going to do to get a list of reliable proxy servers:
- Scrap the table from the web site into a dataframe using the technique we’re already familiar with
- Filter the proxy servers to select those which support HTTPS
- Check the connectivity
The function will return a list of servers that match the selection criteria and have internet connectivity. This list will be used in our request calls.
def get_proxies(num_of_proxies):
PROXIES_URL = 'https://free-proxy-list.net/'
CONNECTION_CHECK_URL = 'https://httpbin.org/ip'
HEADERS = {
"User-Agent": r"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31",
"Accept": r"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": r"en-US,en;q=0.8"
}
resp = requests.get(url=PROXIES_URL, headers=HEADERS, timeout=10)
soap = BeautifulSoup(resp.text, "html.parser")
proxies_list_entries = soap.find('table', {'id': 'proxylisttable'}).find_all('tr')
proxy_list = []
for proxies_list_entry in proxies_list_entries:
entries = proxies_list_entry.find_all()
proxy_list.append(
[entries[0].text.strip(), entries[1].text.strip(), entries[2].text.strip(), entries[4].text.strip(),
entries[6].text.strip()])
proxy_list = pd.DataFrame(proxy_list, columns=['ip', 'port', 'country', 'anonimity', 'https'])
proxy_list_valid = proxy_list[(proxy_list['https'] == 'yes') & ((proxy_list['country'].isin(['US', 'DE', 'SG', 'TH'])))]
n = 0
proxy_list = []
for i, proxy_entry in shuffle(proxy_list_valid).iterrows():
if n == num_of_proxies:
break
proxy = 'https://{}:{}'.format(proxy_entry['ip'], proxy_entry['port'])
try:
response = requests.get(CONNECTION_CHECK_URL, proxies={"https": proxy}, timeout=(2, 2))
print('Adding proxy: {}'.format(proxy))
proxy_list.append(proxy)
n += 1
except:
print('Skipping proxy: {}'.format(proxy))
return proxy_list
Some key points
- proxy_list_valid = proxy_list[(proxy_list[‘https’] == ‘yes’) & ((proxy_list[‘country’].isin([‘US’, ‘DE’, ‘SG’, ‘TH’])))] — filter the proxy_list to support HTTPS and to be from the following countries: USA, Germany, Singapore or Thailand
- shuffle — is a method from sklearn.utils. In order to use it you’ll need to add the following line to the code: from sklearn.utils import shuffle. It shuffles the rows of the dataframe
Summary
I tried to write a guide for an intelligent process for a market research and business development in the data era.
There are many tools and techniques to gather business intelligence information. Python and data science is a mean to efficiently implement some techniques to help you to build a complete picture of the industry and its stakeholders.
Business intelligence tools that we’ve covered:
- Similarweb
- Clearbit
- D&B
- Quandl
- Morningstar
These tools are useful as they are even without utilizing their APIs and programming.
In order to scale the approach to 100’s or 1,000 companies, automation is required. For robust automation, you’ll need to ‘hide’ bot behavior by adding standard browser HEADERs to the HTTP requests and utilizing proxy rolling techniques to access the service from different IP addresses.
Cheatsheet
The Python modules we’ve used in this post:
- re — regular expressions. We’ve used compile and sub methods. This is just scratching the surface. Regular expressions is an essential tool for data cleaning and text processing. There is a cheat sheet here. My favorite one is Regex101, which is an online tool that enables to test the syntax and generate code
- json — a library to handle json data objects which are the usual reply for the API call
- pycountry-a library to process countries' names. We’ll be using it while talking about taking care of the contacts database
- sklearn — we’ve just touched on this library. As we’ll see in the future, sklearn is one of the most useful libraries when it comes to AI and machine learning