Part 2: Scraping data of Football players using Python

FIFA stats of players and python

Rahul Shrivastava
6 min readApr 21, 2017

If you have came directly to this page, then you must visit Part One of this tutorial first. In this section we will try and scrap all the details of football player from their dedicated page on sofifa.com. Before you start, you should install jupyter . As you can edit code and easily run on jupyter to see different output after editing. And you should use chrome’s developer tool as much as possible to see under the hood.

pip install jupyter.
# To run jupyter
jupyter notebook

On the left you can see the details of Cristiano Ronaldo. We need to scrap as much data as possible. So when you inspect this page, you will be able to see these details which are divided into three divs(red). The first div consists of top half of the details and all those stats are further divided into two more divs . When we look into the first (red)div it’s also further divided into four divs(pink) . So we need to scrap all the details of each player from all these divs .

def player_all_details(url):
all_details = {}
soup = soup_maker(url)
player_info = soup.find('div', {'class': 'player'})
all_details.update(find_player_info(player_info))
player_stats = soup.find('div', {'class': 'stats'})
all_details.update(find_player_stats(player_stats))
secondary_info = soup.find('div', {'class': 'teams'})
all_details.update(find_player_secondary_info(secondary_info))
fifa_info = soup.find_all('div', {'class': 'columns mb-20'})
all_details.update(find_fifa_info(fifa_info))
return(all_details)

So in the above code block, I am passing the player URL to the function player_all_details . Then we get the soup of that page. Now we find the divs which I told you about.

  • In first red div , we have first pink div with class:player
  • 2nd pink div with class:stats
  • 3rd pink div with class:teams
  • Both second and third reddiv has same class:columns mb-20

Now we find all these divs and pass it to their respective functions.

def find_player_info(soup):
player_data = {}
player_data[‘image’] = soup.find(‘img’)[‘data-src’]
player_data[‘full_name’] = soup.find(‘h1’).text.split(‘ (‘)[0]
span = soup.find(‘span’, attrs={‘class’: None}).text.strip()
dob = re.search(‘(\(.*)\)’, span).group(0)
player_data[‘dob’] = dob.replace(‘(‘, ‘’).replace(‘)’, ‘’)
infos = span.replace(dob + ‘ ‘, ‘’).split(‘ ‘)
player_data[‘pref_pos’] = infos[:infos.index(‘Age’)]
player_data[‘age’] = int(infos[infos.index(‘Age’) + 1: -2][0])
player_data[‘height’] = int((infos[infos.index(‘Age’) + 2: -1][0]).replace(‘cm’, ‘’))
player_data[‘weight’] = int((infos[infos.index(‘Age’) + 3:][0]).replace(‘kg’, ‘’))
return(player_data)

When you inspect on first div you will see that there is only one img and h1 tag. So we will directly find those and get data from them and need to filter a bit. Now we need to find the span which has no class . There we have all the data such as DOB, Age, Height and Weight. The span's text has all this info. Then through regex I able to find dob and then replaced it with None . After that we split it and through slicing we get our all required information. You can see I am using Age to get the information from relative position. Because Age is only the common string in there otherwise every value changes. So I used it as relative to find all infos.

def find_player_stats(soup):
player_data = {}
info = re.findall(‘\d+’, soup.text)
player_data[‘rating’] = int(info[0])
player_data[‘potential’] = int(info[1])
player_data[‘value’] = int(info[2])
player_data[‘wage’] = int(info[3])
return(player_data)

In the second div , all the information are numbers. So I thought just using simple regex will give us all the details from that div . \d+ means give me all the numbers from given chunk of string. And according to that, I mapped all the values.

def find_player_secondary_info(soup):
player_data = {}
player_data[‘preff_foot’] = soup.find(‘label’, text=’Preferred Foot’).parent.contents[2].strip(‘\n ‘)
player_data[‘club’] = soup.find_all(‘ul’)[1].find(‘a’).text
player_data[‘club_pos’] = soup.find(‘label’, text=’Position’).parent.find(‘span’).text
player_data[‘club_jersey’] = soup.find(‘label’, text=’Jersey number’).parent.contents[2].strip(‘\n ‘)
if soup.find(‘label’, text=’Joined’):
player_data[‘club_joined’] = soup.find(‘label’, text=’Joined’).parent.contents[2].strip(‘\n ‘)
player_data[‘contract_valid’] = soup.find(‘label’, text=’Contract valid until’).parent.contents[2].strip(‘\n ‘)
if len(soup.find_all(‘ul’)) > 2:
player_data[‘country’] = soup.find_all(‘ul’)[2].find(‘a’).text
return(player_data)

Now, we come to the third and last pink div . The structure of infos here is like this.

<li>
<label>Preferred Foot</label>
Right
</li>

So, the info “Preferred Foot” is inside label tag, which we can easily find by passing attribute text and value to .find method. Finding the label will take us to the exact position inside of li tag inside label. But our info which is “Right” is in parent of label . So we will use .parent on the result set to get the parent which means the whole li consisting of required label . Then will get the content and strip some un required characters. I have put check for two values using if statement because all players don’t have these infos.

Now the second and last red div . They both have the same class which means will get a list in return. So, we pass the list to the function. The structure of the second div is somewhat like this.

<div class="columns mb-20">
<div class="column col-3">

<ul class="pl">
<li>
<span class="label p84">84</span> Crossing </li>
<li>
<span class="label p93">93</span> Finishing </li>
<li>
<span class="label p85">85</span> Heading Accuracy </li>
....
</ul>

</div>
....
</div>

Third and last red div structure is with “Traits” and “Specialities”. So we need to treat them little differently.

<div class="columns mb-20">
<div class="column col-3">
....
....
</div>
....
<div class=”column col-3">
<h4 class=”card-title”>Traits</h4>
<ul class=”pl”>
<li>Power Free Kick</li>
<li>Flair</li>
....
</ul>
<div><h4 class=”card-title”>Specialities</h4></div>
<ul class=”pl”>
<li>Speedster</li>
<li>Dribbler</li>
....
</ul>
</div>
</div>

First will add all the skill attributes of player we need inside a list like this

fifa_stats = [‘Crossing’, ‘Finishing’, ‘Heading Accuracy’,
‘Short Passing’, ‘Volleys’, ‘Dribbling’, ‘Curve’,
‘Free Kick Accuracy’, ‘Long Passing’, ‘Ball Control’,
‘Acceleration’, ‘Sprint Speed’, ‘Agility’, ‘Reactions’,
‘Balance’, ‘Shot Power’, ‘Jumping’, ‘Stamina’, ‘Strength’,
‘Long Shots’, ‘Aggression’, ‘Interceptions’, ‘Positioning’,
‘Vision’, ‘Penalties’, ‘Composure’, ‘Marking’, ‘Standing Tackle’,
‘Sliding Tackle’, ‘GK Diving’, ‘GK Handling’, ‘GK Kicking’,
‘GK Positioning’, ‘GK Reflexes’]

Then the function definition will be :

def find_fifa_info(soup):
player_data = {}
divs_without_skill = soup[1].find_all(‘div’, {‘class’: ‘col-3’})[:3]
more_lis = [div.find_all(‘li’) for div in divs_without_skill]
lis = soup[0].find_all(‘li’) + more_lis[0]
for li in lis:
for stats in fifa_stats:
if stats in li.text:
player_data[stats.replace(‘ ‘, ‘_’).lower()] = int(
(li.text.split(‘ ‘)[0]).replace(‘\n’, ‘’))

traits = soup[1].find(‘h4’, text=’Traits’)
if traits:
player_data[‘traits’] = [li.text.replace(‘\xa0’, ‘’) for li in traits.parent.next_sibling.next_sibling.find_all(‘li’)]
specialities = soup[1].find(‘h4’, text=’Specialities’)
if specialities:
player_data[‘specialities’] = [li.text.replace(‘\xa0’, ‘’) for li in \ specialities.parent.next_sibling.next_sibling.find_all(‘li’)]
return(player_data)

So here we are first finding all the li tag as they contain all the required info. We are getting to soups, remember two divs have same class . So the first one is simple we can easily find all the li . But in second div we have these “Traits” and “Specialities” which sometimes matches with fifa stats info. So we need to remove them and then add all those li together. So these all li is inside ul which is inside div with class:col-3 . So we get the first three divs and then there li . After that we run two loops and check if the stat is present or not. If its there we simply pass the value in responding stat. But for “Traits” and “Specialities” we need to find the div which contains info, so we can keep record of which are traits and specialities. Therefore, find h4 as from our structure above “Traits” is inside h4 . We go to it’s parent which means the div which is above it. Then to its next sibling which in our structure is ul . But in real html sometimes we get \n , So that’s why again I have to go to it’s next sibling and get all li .

url = 'http://sofifa.com/players?offset=0'def soup_maker(url):
r = requests.get(url)
markup = r.content
soup = bs(markup, 'lxml')
return soup
def find_top_players(soup):
final_details = {}
table = soup.find('table', {'class': 'table-striped'})
tbody = table.find('tbody')
all_a = tbody.find_all('a', {'class': ''})
for player in all_a:
final_details['short_name'] = player.text
final_details.update(player_all_details('http://sofifa.com' + player['href']))
return(final_details)

How to get all these information:

  • Pass the url into soup_maker .
  • Then pass that soup to the find_top_player() , which will find all the single player url from the page and pass it to the player_all_details .
  • Then player_all_details() will call each function and add return complete info of each player.

Github URL : https://github.com/rahul3103/footballers-tutorial/blob/master/scrapper.py

Part 1: Scraping Player URLs

--

--