Part 2: Scraping data of Football players using Python
FIFA stats of players and python
If you have came directly to this page, then you must visit Part One of this tutorial first. In this section we will try and scrap all the details of football player from their dedicated page on sofifa.com. Before you start, you should install jupyter
. As you can edit code and easily run on jupyter to see different output after editing. And you should use chrome’s developer tool as much as possible to see under the hood.
pip install jupyter.
# To run jupyter
jupyter notebook
On the left you can see the details of Cristiano Ronaldo. We need to scrap as much data as possible. So when you inspect this page, you will be able to see these details which are divided into three divs(red)
. The first div
consists of top half of the details and all those stats are further divided into two more divs
. When we look into the first (red)div it’s also further divided into four divs(pink)
. So we need to scrap all the details of each player from all these divs
.
def player_all_details(url):
all_details = {}
soup = soup_maker(url)
player_info = soup.find('div', {'class': 'player'})
all_details.update(find_player_info(player_info))
player_stats = soup.find('div', {'class': 'stats'})
all_details.update(find_player_stats(player_stats))
secondary_info = soup.find('div', {'class': 'teams'})
all_details.update(find_player_secondary_info(secondary_info))
fifa_info = soup.find_all('div', {'class': 'columns mb-20'})
all_details.update(find_fifa_info(fifa_info))
return(all_details)
So in the above code block, I am passing the player URL to the function player_all_details
. Then we get the soup of that page. Now we find the divs
which I told you about.
- In first red
div
, we have first pinkdiv
withclass:player
- 2nd pink
div
withclass:stats
- 3rd pink
div
withclass:teams
- Both second and third red
div
has sameclass:columns mb-20
Now we find all these divs
and pass it to their respective functions.
def find_player_info(soup):
player_data = {}
player_data[‘image’] = soup.find(‘img’)[‘data-src’]
player_data[‘full_name’] = soup.find(‘h1’).text.split(‘ (‘)[0]
span = soup.find(‘span’, attrs={‘class’: None}).text.strip()
dob = re.search(‘(\(.*)\)’, span).group(0)
player_data[‘dob’] = dob.replace(‘(‘, ‘’).replace(‘)’, ‘’)
infos = span.replace(dob + ‘ ‘, ‘’).split(‘ ‘)
player_data[‘pref_pos’] = infos[:infos.index(‘Age’)]
player_data[‘age’] = int(infos[infos.index(‘Age’) + 1: -2][0])
player_data[‘height’] = int((infos[infos.index(‘Age’) + 2: -1][0]).replace(‘cm’, ‘’))
player_data[‘weight’] = int((infos[infos.index(‘Age’) + 3:][0]).replace(‘kg’, ‘’))
return(player_data)
When you inspect on first div
you will see that there is only one img
and h1
tag. So we will directly find those and get data from them and need to filter a bit. Now we need to find the span
which has no class
. There we have all the data such as DOB, Age, Height and Weight. The span's
text has all this info. Then through regex
I able to find dob and then replaced it with None
. After that we split it and through slicing we get our all required information. You can see I am using Age
to get the information from relative position. Because Age
is only the common string in there otherwise every value changes. So I used it as relative to find all infos.
def find_player_stats(soup):
player_data = {}
info = re.findall(‘\d+’, soup.text)
player_data[‘rating’] = int(info[0])
player_data[‘potential’] = int(info[1])
player_data[‘value’] = int(info[2])
player_data[‘wage’] = int(info[3])
return(player_data)
In the second div
, all the information are numbers. So I thought just using simple regex will give us all the details from that div
. \d+
means give me all the numbers from given chunk of string. And according to that, I mapped all the values.
def find_player_secondary_info(soup):
player_data = {}
player_data[‘preff_foot’] = soup.find(‘label’, text=’Preferred Foot’).parent.contents[2].strip(‘\n ‘)
player_data[‘club’] = soup.find_all(‘ul’)[1].find(‘a’).text
player_data[‘club_pos’] = soup.find(‘label’, text=’Position’).parent.find(‘span’).text
player_data[‘club_jersey’] = soup.find(‘label’, text=’Jersey number’).parent.contents[2].strip(‘\n ‘)
if soup.find(‘label’, text=’Joined’):
player_data[‘club_joined’] = soup.find(‘label’, text=’Joined’).parent.contents[2].strip(‘\n ‘)
player_data[‘contract_valid’] = soup.find(‘label’, text=’Contract valid until’).parent.contents[2].strip(‘\n ‘)
if len(soup.find_all(‘ul’)) > 2:
player_data[‘country’] = soup.find_all(‘ul’)[2].find(‘a’).text
return(player_data)
Now, we come to the third and last pink div
. The structure of infos here is like this.
<li>
<label>Preferred Foot</label>
Right
</li>
So, the info “Preferred Foot” is inside label
tag, which we can easily find by passing attribute text
and value to .find
method. Finding the label
will take us to the exact position inside of li
tag inside label
. But our info which is “Right” is in parent of label
. So we will use .parent
on the result set to get the parent which means the whole li
consisting of required label
. Then will get the content and strip some un required characters. I have put check for two values using if
statement because all players don’t have these infos.
Now the second and last red div
. They both have the same class
which means will get a list
in return. So, we pass the list
to the function. The structure of the second div
is somewhat like this.
<div class="columns mb-20">
<div class="column col-3">
<ul class="pl">
<li>
<span class="label p84">84</span> Crossing </li>
<li>
<span class="label p93">93</span> Finishing </li>
<li>
<span class="label p85">85</span> Heading Accuracy </li>
....
</ul>
</div>
....
</div>
Third and last red div
structure is with “Traits” and “Specialities”. So we need to treat them little differently.
<div class="columns mb-20">
<div class="column col-3">
....
....
</div>
....
<div class=”column col-3">
<h4 class=”card-title”>Traits</h4>
<ul class=”pl”>
<li>Power Free Kick</li>
<li>Flair</li>
....
</ul>
<div><h4 class=”card-title”>Specialities</h4></div>
<ul class=”pl”>
<li>Speedster</li>
<li>Dribbler</li>
....
</ul>
</div>
</div>
First will add all the skill attributes of player we need inside a list
like this
fifa_stats = [‘Crossing’, ‘Finishing’, ‘Heading Accuracy’,
‘Short Passing’, ‘Volleys’, ‘Dribbling’, ‘Curve’,
‘Free Kick Accuracy’, ‘Long Passing’, ‘Ball Control’,
‘Acceleration’, ‘Sprint Speed’, ‘Agility’, ‘Reactions’,
‘Balance’, ‘Shot Power’, ‘Jumping’, ‘Stamina’, ‘Strength’,
‘Long Shots’, ‘Aggression’, ‘Interceptions’, ‘Positioning’,
‘Vision’, ‘Penalties’, ‘Composure’, ‘Marking’, ‘Standing Tackle’,
‘Sliding Tackle’, ‘GK Diving’, ‘GK Handling’, ‘GK Kicking’,
‘GK Positioning’, ‘GK Reflexes’]
Then the function definition will be :
def find_fifa_info(soup):
player_data = {}
divs_without_skill = soup[1].find_all(‘div’, {‘class’: ‘col-3’})[:3]
more_lis = [div.find_all(‘li’) for div in divs_without_skill]
lis = soup[0].find_all(‘li’) + more_lis[0]
for li in lis:
for stats in fifa_stats:
if stats in li.text:
player_data[stats.replace(‘ ‘, ‘_’).lower()] = int(
(li.text.split(‘ ‘)[0]).replace(‘\n’, ‘’))
traits = soup[1].find(‘h4’, text=’Traits’)
if traits:
player_data[‘traits’] = [li.text.replace(‘\xa0’, ‘’) for li in traits.parent.next_sibling.next_sibling.find_all(‘li’)]
specialities = soup[1].find(‘h4’, text=’Specialities’)
if specialities:
player_data[‘specialities’] = [li.text.replace(‘\xa0’, ‘’) for li in \ specialities.parent.next_sibling.next_sibling.find_all(‘li’)]
return(player_data)
So here we are first finding all the li
tag as they contain all the required info. We are getting to soups, remember two divs
have same class
. So the first one is simple we can easily find all the li
. But in second div
we have these “Traits” and “Specialities” which sometimes matches with fifa stats info. So we need to remove them and then add all those li
together. So these all li
is inside ul
which is inside div
with class:col-3
. So we get the first three divs
and then there li
. After that we run two loops and check if the stat is present or not. If its there we simply pass the value in responding stat. But for “Traits” and “Specialities” we need to find the div
which contains info, so we can keep record of which are traits and specialities. Therefore, find h4
as from our structure above “Traits” is inside h4
. We go to it’s parent which means the div
which is above it. Then to its next sibling which in our structure is ul
. But in real html
sometimes we get \n
, So that’s why again I have to go to it’s next sibling and get all li
.
url = 'http://sofifa.com/players?offset=0'def soup_maker(url):
r = requests.get(url)
markup = r.content
soup = bs(markup, 'lxml')
return soupdef find_top_players(soup):
final_details = {}
table = soup.find('table', {'class': 'table-striped'})
tbody = table.find('tbody')
all_a = tbody.find_all('a', {'class': ''})
for player in all_a:
final_details['short_name'] = player.text
final_details.update(player_all_details('http://sofifa.com' + player['href']))
return(final_details)
How to get all these information:
- Pass the
url
intosoup_maker
. - Then pass that soup to the
find_top_player()
, which will find all the single player url from the page and pass it to theplayer_all_details
. - Then
player_all_details()
will call each function and add return complete info of each player.
Github URL : https://github.com/rahul3103/footballers-tutorial/blob/master/scrapper.py
Part 1: Scraping Player URLs