Exploratory Data Analysis with Feature Engineering of Christiano Ronaldo’s Goals

Pritul Dave :)
7 min readFeb 8, 2022

About Ronaldo:-

Cristiano Ronaldo is a phenomenon. As the super star of football completed another landmark of becoming the first player in the 21st Century to score a staggering 500 goals and became top scorer for a club like Real Madrid, we take a look at the Portuguese’s career and some amazing facts and stats and the life and career of Cristiano Ronaldo.Statistically, Ronaldo has been unreal.
Out of the 502 goals he has scored so far, 334 has come for Real Madrid, 118 for United, 55 for Portugal and 5 for Sporting Lisbon.

His completeness as a footballer player can be seen in the following stats:

Right foot: 328
Left foot:89
Head: 83
Other: 2

Columns in the dataset:-

df = pd.read_csv("yds_data.csv",index_col="Unnamed: 0")
df.columns

Obtaining null values in the dataset

df.isnull().sum()

The dataset contains many missing values and so it is difficult to predict.

Performing the data cleaning:-

Based on expert knowledge and looking at the data. Some of the columns are not important and so we are dropping that columns

df.drop(labels=["match_event_id","knockout_match","shot_id_number","game_season","team_name","date_of_game","match_id","team_id"],axis=1,inplace=True)

For simplicity we are further renaming the columns

df.rename(columns={"home/away":"home_away","lat/lng":"lat_lng","remaining_min.1":"remaining_min_1","power_of_shot.1":"power_of_shot_1","knockout_match.1":"knockout_match_1","remaining_sec.1":"remaining_sec_1","distance_of_shot.1":"distance_of_shot_1"},inplace=True)

Now here the target column is “is_goal”. Removing this column will make data less and hence we are imputing it with the category name “Unknown” and converting into the boolean type

df['is_goal'].fillna(value='Unknown',inplace=True)
df['is_goal'] = df['is_goal'].astype("bool")

Applying pandas profiler and generating the profiler report

profile = pf.ProfileReport(df=df,explorative=True)
profile.to_file(output_file="Profiling_report.html")

Analysis report from pandas profiler:-

  1. Goal shoot x and y location

Conclusion: Most of the goals are from the 0 location and thereafter location 1

2. Area of shot of ronaldo

Most of the shots are in the center location

3. Most of the shots are made from the location of (42.98, -71.44)

4. Most of the shots are of the type shot — 39

5. Most of the combined shot is shot 3

6. Most of the values of the power shot are in the range 1–5

7. Most of the shot distance is 20 whereas on an average of 40 is perfect shot distance

Removing type_of_shot and type_of_combined_shot

Because it has highest nan values

df.drop(labels=['type_of_shot','type_of_combined_shot'],inplace=True,axis=1)

Determining what are the locations where goal was done

Conclusion: If the y location exceeds 250 then chances of goal are getting reduced

df_loc = df[['location_x','location_y','is_goal']]
fig = px.scatter(df_loc, x="location_x", y="location_y", color="is_goal", hover_data=['is_goal'])
fig.show()

Determining relation between time and goal

Since there happens goal as well as no goal at particular time interval. It is difficult to determine

df_time = df[['remaining_min','remaining_sec','is_goal']]
fig = px.scatter(df_time, x="remaining_sec", y="is_goal", color="is_goal", hover_data=['is_goal'])
fig.show()
df_time = df[[‘remaining_min’,’remaining_sec’,’is_goal’]]
fig = px.scatter(df_time, x=”remaining_min”, y=”is_goal”, color=”is_goal”, hover_data=[‘is_goal’])
fig.show()
df['power_of_shot'].unique()

Relation between power of shot and goal

Most of the goals are in the range of 1–3 of powershot

px.box(df, x="power_of_shot", y="is_goal", points="all")

The best distance of goal is 20 to 40. Moreover, when distance exceeds 63 then chances of goal will be reduced

Relation between is goal and area of shot

  • Since both the values are very close for goal and not goal, it is not possible to make analysis
df[df['is_goal']][['area_of_shot']].value_counts().iplot(kind='bar')
df[df[‘is_goal’]==False][[‘area_of_shot’]].value_counts().iplot(kind=’bar’)

Area of shot for goals

For distance between 20 to 40 and power of shot between 1 to 4. The area of shot should be either center or left side or right side

t = df[df['area_of_shot'].notna()]
px.scatter(x='power_of_shot',y='distance_of_shot',color='area_of_shot',data_frame=t)

Filling nan values in numeric data using KNN imputer

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=4)
temp_df = imputer.fit_transform(df[df.columns[np.where(df.dtypes == ‘float64’)]].copy())
temp_df = pd.DataFrame(temp_df,columns=df.columns[np.where(df.dtypes == 'float64')])

Finding the outliers

  • To find the outlier we first scale the data and plot the box plot.
  • For scaling, I am using the quantile transformer
  • Since the box plot have no datapoints beyond min and max of the boxes. Thus there are no outliers in the numeric data
df_scale = temp_df.copy()from sklearn import preprocessing
df_scale = preprocessing.QuantileTransformer().fit_transform(df_scale)
df_scale = pd.DataFrame(df_scale,columns=df.columns[np.where(df.dtypes == 'float64')])
df_scale.iplot(kind=’box’)

Correlation of the numeric data

fig = px.imshow(df_scale.corr(),text_auto=True)
fig.update_layout(
width=1000,
height=1000,
paper_bgcolor="LightSteelBlue",
)

Multicollinearity detection

  • There is no multicollinearity present based on pearson coeffecient

Finding important features using ANOVA analysis

from sklearn.feature_selection import chi2, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn import metrics
from sklearn.metrics import auc
def apply_f_classif(x, y, k):

select_features = SelectKBest(f_classif, k = k)
x_new = select_features.fit_transform(x, y)

return pd.DataFrame(x_new)
def logistic_fn(x_train, y_train):

model = LogisticRegression(solver = 'saga',)
model.fit(x_train, y_train)

return model
result_dict = {}
from sklearn.model_selection import train_test_split
def build_model(Y,
features,
X,
preprocess_fn,
*hyperparameters):
X = preprocess_fn(X, Y, *hyperparameters)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

model = logistic_fn(x_train, y_train)

y_pred = model.predict(x_test)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=2)
print(fpr,tpr)

acc = metrics.roc_auc_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
return {'accuracy': acc,
'precision' : prec,
'recall' : recall}
FEATURES = list(df_scale.columns[:-1])result_dict = {}

for i in range (1, 12):
result_dict['f_classif - ' + str(i)] = build_model(Y,
FEATURES,
X,
apply_f_classif,
i)
li = []
def compare_results(result_dict):

for key in result_dict:
print('Test: ', key)
li.append(result_dict[key]['accuracy'])
print("accuracy_score : ", result_dict[key]['accuracy'])
print("precision_score : ", result_dict[key]['precision'])
print("recall_score : ", result_dict[key]['recall'])
print()

compare_results(result_dict)
Test: f_classif - 1

accuracy_score : 0.5365195451448821
precision_score : 0.5862383230804283
recall_score : 0.7468795355587808

Test: f_classif - 2

accuracy_score : 0.545386405803008
precision_score : 0.5937211449676824
recall_score : 0.7452912199362504

Test: f_classif - 3

accuracy_score : 0.5440874613027396
precision_score : 0.6010589318600368
recall_score : 0.7453611190408221

Test: f_classif - 4

accuracy_score : 0.5382528095567639
precision_score : 0.5820356578650417
recall_score : 0.7556401992382069

Test: f_classif - 5

accuracy_score : 0.5342773761095279
precision_score : 0.5740615868734547
recall_score : 0.7553978112984324

Test: f_classif - 6

accuracy_score : 0.542633584963786
precision_score : 0.5887640449438202
recall_score : 0.762292697119581

Test: f_classif - 7

accuracy_score : 0.5402292912741782
precision_score : 0.5870997047467635
recall_score : 0.75254730713246

Test: f_classif - 8

accuracy_score : 0.5491327495086207
precision_score : 0.5994550408719346
recall_score : 0.7599309153713298

Test: f_classif - 9

accuracy_score : 0.5421240303289414
precision_score : 0.5814831261101243
recall_score : 0.7712014134275619

Test: f_classif - 10

accuracy_score : 0.547833967626939
precision_score : 0.590274651058082
recall_score : 0.7657710280373832

Test: f_classif - 11

accuracy_score : 0.5456807518388342
precision_score : 0.5844912595248767
recall_score : 0.7675103001765744
px.line(li)
from sklearn.feature_selection import SelectKBest
select_features = SelectKBest(f_classif, k = 7)
X_new = select_features.fit_transform(X, Y)
X_new = pd.DataFrame(X_new)
selected_features = []for i in range(len(X_new.columns)):
for j in range(len(X.columns)):

if(X_new.iloc[:,i].equals(X.iloc[:,j])):
selected_features.append(X.columns[j])

selected_features

After this the model building can be done. One can train XGBoost or Cataboost like models over this features.

~By Pritul Dave

Find the whole EDA on my github https://github.com/pritul2/Exploratory-Data-Analysis

Connect me on linkedin

https://www.linkedin.com/in/prituldave/

About me:-

I am Computer Science Engineer and have interest in Space Applications

I completed my bachelor’s from Charotar University of Science and Technology.

I done prominent internships in deep learning field at Indian Space Research Organisation (ISRO) , Indian Institute of Technology Delhi (IIT-Delhi) and Military College of Telecommunication Engineering (MCTE).

Moreover, I published 3 research-papers at reputed journals.

--

--

Pritul Dave :)

❖ Writes about Data Science ❖ MS CS @UTDallas ❖ Ex Researcher @ISRO , @IITDelhi, @MillitaryCollege-AI COE ❖ 3+ Publications ❖ linktr.ee/prituldave