Exploratory Data Analysis with Feature Engineering of Christiano Ronaldo’s Goals
About Ronaldo:-
Cristiano Ronaldo is a phenomenon. As the super star of football completed another landmark of becoming the first player in the 21st Century to score a staggering 500 goals and became top scorer for a club like Real Madrid, we take a look at the Portuguese’s career and some amazing facts and stats and the life and career of Cristiano Ronaldo.Statistically, Ronaldo has been unreal.
Out of the 502 goals he has scored so far, 334 has come for Real Madrid, 118 for United, 55 for Portugal and 5 for Sporting Lisbon.
His completeness as a footballer player can be seen in the following stats:
Right foot: 328
Left foot:89
Head: 83
Other: 2
Columns in the dataset:-
df = pd.read_csv("yds_data.csv",index_col="Unnamed: 0")
df.columns
Obtaining null values in the dataset
df.isnull().sum()
The dataset contains many missing values and so it is difficult to predict.
Performing the data cleaning:-
Based on expert knowledge and looking at the data. Some of the columns are not important and so we are dropping that columns
df.drop(labels=["match_event_id","knockout_match","shot_id_number","game_season","team_name","date_of_game","match_id","team_id"],axis=1,inplace=True)
For simplicity we are further renaming the columns
df.rename(columns={"home/away":"home_away","lat/lng":"lat_lng","remaining_min.1":"remaining_min_1","power_of_shot.1":"power_of_shot_1","knockout_match.1":"knockout_match_1","remaining_sec.1":"remaining_sec_1","distance_of_shot.1":"distance_of_shot_1"},inplace=True)
Now here the target column is “is_goal”. Removing this column will make data less and hence we are imputing it with the category name “Unknown” and converting into the boolean type
df['is_goal'].fillna(value='Unknown',inplace=True)
df['is_goal'] = df['is_goal'].astype("bool")
Applying pandas profiler and generating the profiler report
profile = pf.ProfileReport(df=df,explorative=True)
profile.to_file(output_file="Profiling_report.html")
Analysis report from pandas profiler:-
- Goal shoot x and y location
Conclusion: Most of the goals are from the 0 location and thereafter location 1
2. Area of shot of ronaldo
Most of the shots are in the center location
3. Most of the shots are made from the location of (42.98, -71.44)
4. Most of the shots are of the type shot — 39
5. Most of the combined shot is shot 3
6. Most of the values of the power shot are in the range 1–5
7. Most of the shot distance is 20 whereas on an average of 40 is perfect shot distance
Removing type_of_shot and type_of_combined_shot
Because it has highest nan values
df.drop(labels=['type_of_shot','type_of_combined_shot'],inplace=True,axis=1)
Determining what are the locations where goal was done
Conclusion: If the y location exceeds 250 then chances of goal are getting reduced
df_loc = df[['location_x','location_y','is_goal']]
fig = px.scatter(df_loc, x="location_x", y="location_y", color="is_goal", hover_data=['is_goal'])
fig.show()
Determining relation between time and goal
Since there happens goal as well as no goal at particular time interval. It is difficult to determine
df_time = df[['remaining_min','remaining_sec','is_goal']]
fig = px.scatter(df_time, x="remaining_sec", y="is_goal", color="is_goal", hover_data=['is_goal'])
fig.show()
df_time = df[[‘remaining_min’,’remaining_sec’,’is_goal’]]
fig = px.scatter(df_time, x=”remaining_min”, y=”is_goal”, color=”is_goal”, hover_data=[‘is_goal’])
fig.show()
df['power_of_shot'].unique()
Relation between power of shot and goal
Most of the goals are in the range of 1–3 of powershot
px.box(df, x="power_of_shot", y="is_goal", points="all")
The best distance of goal is 20 to 40. Moreover, when distance exceeds 63 then chances of goal will be reduced
Relation between is goal and area of shot
- Since both the values are very close for goal and not goal, it is not possible to make analysis
df[df['is_goal']][['area_of_shot']].value_counts().iplot(kind='bar')
df[df[‘is_goal’]==False][[‘area_of_shot’]].value_counts().iplot(kind=’bar’)
Area of shot for goals
For distance between 20 to 40 and power of shot between 1 to 4. The area of shot should be either center or left side or right side
t = df[df['area_of_shot'].notna()]
px.scatter(x='power_of_shot',y='distance_of_shot',color='area_of_shot',data_frame=t)
Filling nan values in numeric data using KNN imputer
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=4)
temp_df = imputer.fit_transform(df[df.columns[np.where(df.dtypes == ‘float64’)]].copy())temp_df = pd.DataFrame(temp_df,columns=df.columns[np.where(df.dtypes == 'float64')])
Finding the outliers
- To find the outlier we first scale the data and plot the box plot.
- For scaling, I am using the quantile transformer
- Since the box plot have no datapoints beyond min and max of the boxes. Thus there are no outliers in the numeric data
df_scale = temp_df.copy()from sklearn import preprocessing
df_scale = preprocessing.QuantileTransformer().fit_transform(df_scale)
df_scale = pd.DataFrame(df_scale,columns=df.columns[np.where(df.dtypes == 'float64')])df_scale.iplot(kind=’box’)
Correlation of the numeric data
fig = px.imshow(df_scale.corr(),text_auto=True)
fig.update_layout(
width=1000,
height=1000,
paper_bgcolor="LightSteelBlue",
)
Multicollinearity detection
- There is no multicollinearity present based on pearson coeffecient
Finding important features using ANOVA analysis
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn import metrics
from sklearn.metrics import aucdef apply_f_classif(x, y, k):
select_features = SelectKBest(f_classif, k = k)
x_new = select_features.fit_transform(x, y)
return pd.DataFrame(x_new)def logistic_fn(x_train, y_train):
model = LogisticRegression(solver = 'saga',)
model.fit(x_train, y_train)
return modelresult_dict = {}
from sklearn.model_selection import train_test_splitdef build_model(Y,
features,
X,
preprocess_fn,
*hyperparameters):X = preprocess_fn(X, Y, *hyperparameters)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
model = logistic_fn(x_train, y_train)
y_pred = model.predict(x_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=2)
print(fpr,tpr)
acc = metrics.roc_auc_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)return {'accuracy': acc,
'precision' : prec,
'recall' : recall}FEATURES = list(df_scale.columns[:-1])result_dict = {}
for i in range (1, 12):
result_dict['f_classif - ' + str(i)] = build_model(Y,
FEATURES,
X,
apply_f_classif,
i)li = []
def compare_results(result_dict):
for key in result_dict:
print('Test: ', key)
li.append(result_dict[key]['accuracy'])
print("accuracy_score : ", result_dict[key]['accuracy'])
print("precision_score : ", result_dict[key]['precision'])
print("recall_score : ", result_dict[key]['recall'])
print()
compare_results(result_dict)Test: f_classif - 1
accuracy_score : 0.5365195451448821
precision_score : 0.5862383230804283
recall_score : 0.7468795355587808
Test: f_classif - 2
accuracy_score : 0.545386405803008
precision_score : 0.5937211449676824
recall_score : 0.7452912199362504
Test: f_classif - 3
accuracy_score : 0.5440874613027396
precision_score : 0.6010589318600368
recall_score : 0.7453611190408221
Test: f_classif - 4
accuracy_score : 0.5382528095567639
precision_score : 0.5820356578650417
recall_score : 0.7556401992382069
Test: f_classif - 5
accuracy_score : 0.5342773761095279
precision_score : 0.5740615868734547
recall_score : 0.7553978112984324
Test: f_classif - 6
accuracy_score : 0.542633584963786
precision_score : 0.5887640449438202
recall_score : 0.762292697119581
Test: f_classif - 7
accuracy_score : 0.5402292912741782
precision_score : 0.5870997047467635
recall_score : 0.75254730713246
Test: f_classif - 8
accuracy_score : 0.5491327495086207
precision_score : 0.5994550408719346
recall_score : 0.7599309153713298
Test: f_classif - 9
accuracy_score : 0.5421240303289414
precision_score : 0.5814831261101243
recall_score : 0.7712014134275619
Test: f_classif - 10
accuracy_score : 0.547833967626939
precision_score : 0.590274651058082
recall_score : 0.7657710280373832
Test: f_classif - 11
accuracy_score : 0.5456807518388342
precision_score : 0.5844912595248767
recall_score : 0.7675103001765744px.line(li)
from sklearn.feature_selection import SelectKBest
select_features = SelectKBest(f_classif, k = 7)
X_new = select_features.fit_transform(X, Y)
X_new = pd.DataFrame(X_new)selected_features = []for i in range(len(X_new.columns)):
for j in range(len(X.columns)):
if(X_new.iloc[:,i].equals(X.iloc[:,j])):
selected_features.append(X.columns[j])
selected_features
After this the model building can be done. One can train XGBoost or Cataboost like models over this features.
~By Pritul Dave
Find the whole EDA on my github https://github.com/pritul2/Exploratory-Data-Analysis
Connect me on linkedin
About me:-
I am Computer Science Engineer and have interest in Space Applications
I completed my bachelor’s from Charotar University of Science and Technology.
I done prominent internships in deep learning field at Indian Space Research Organisation (ISRO) , Indian Institute of Technology Delhi (IIT-Delhi) and Military College of Telecommunication Engineering (MCTE).
Moreover, I published 3 research-papers at reputed journals.