02 Mar Footballer Names and Python Lists
Earlier today I shit-posted on twitter about player names after seeing the goalscorers in the United vs Southampton match.
Yann Valery, Dwight Gayle, Ashley Cole, how many other players’ names are a combination of a boy’s and girl’s name?— Peter McKeever Ô_ō (@petermckeever) March 2, 2019
After getting no response I decided to do some sleuthing of my own. This is by no means a deep dive but a very short look at what player’s have female family names (it’s really a chance to take a look at working with strings and lists).
There are two datasets I want to take a look at for this post. The first one will be the player information I have for the Player Elo model, and the second is a list of baby names for girls.
First I load in the player information dataset and do a little cleaning which leaves us with this:
import pandas as pd p_df = pd.read_csv("master_player_details.csv") ### do boring cleaning etc here ## print(p_df.head()) #show first 5 rows
FirstName LastName pID 0 Onur Okan 37512 1 Marcelo Caro 164572 2 Daoud Mahli 211677 3 Sun Cheung Lai 92016 4 Alejandro Fabián Prieto Romero 149157
Our dataset is pretty extensive, covering over 370,000 players across 359 competitions from youth level to first team. I’m not really prepared to scour the internet for lists of names to cover the various countries in our data, so for today I am using lists of English, German, and French names.
I got the girl names from here, here, here, and here if you are interested in following along. I won’t bore you with posting massive lists of names. Instead I want to share a pretty neat trick for those who work in visual studio. If we go to the first link and copy the data and paste it into visual studio we end up with this:
It’s a list of 1000 popular baby names for girls. as it’s relatively infrequent, I manually went through the list and deleted the 4 occurences of “Related Post" and the following line. This leaves us with a bunch of names, however, as is, it is not formatted to be readable in python (each would be recognised as an unassigned variable). So what can we do here? We can either play a fools game and go to each line, adding a " before and after the name and a comma on the end, OR we can use multi-cursor:
Multicursor can be accessed by selecting the lines you want (select all in our case) and pressing ALT+SHIFT+I. We can then jump from the start and end of each line with CMD + LEFT/RIGHT ARROW.
Once we add in our " marks and commas, all we need to do is give it a name, add [ at the beginning of the list, and ] at the end (remembering to remove the final comma). On a small list it looks like this:
gList1 = ["Emma", "Sarah", "Rosie", "Christina", "Julia"]
Our second list presents us with a different challenge. After we turn it into a list we’re left with a list of names and numbers.
"Areli", "980", "Fabiola", "981", "Lina", "982", "Hillary", "983", "Mireya", "984", "Christiana", "985", "Dania",
1000 numbers are far too many to manually go through and remove, so we’ll create our list (gList2) and deal with it another way
gList2 = [item for item in gList2 if not item.isdigit()] gList2
Here we are using list comprehension to go through our list, if any string is a digit it is removed, this leaves us with only the names.
["Areli", "Fabiola", "Lina", "Hillary", "Mireya", "Christiana", "Dania"
After doing the same with the other two lists, we are left with four lists. Let’s combine them and make sure there are no duplicates by using set.
name_list = list(set(gList1 + gList2 + gList3 + gList4))
Now we have our list of girls’ names, let’s compare against our football player list to see how many matches we get. Again, I am only using a limited sample of girls names so an overall result here will not be valid as we have players from asia, the middle east, africa, south america etc.. who will not be highly represented. I’ll take a look at the last names for our purposes today. There are two ways we can do this, the first way – you guessed it – is playing a fools game, and we don’t want that:
df = p_df[(p_df['lastName'] == "Amy") | p_df[(p_df['lastName'] == "Sarah") |p_df[(p_df['lastName'] == "Claire") ... ]
This is a multiple criteria filter we can use in pandas which can be useful when filtering over multiple columns, however, we are looking at just one column, and we have far too many names to look for. So, instead of doing that, we can just ask python if each name in our list of players is in our list of girls names:
p_df['isMatch'] = np.where(p_df.lastName.isin(name_list),1,0)
Here I am creating a new column in our existing dataset called “isMatch". using numpy, we go through each row and if the player’s last name is in our list of girls names, that row gets a value of 1, otherwise it gets a value of 0. Comparing this number against our player list, we see that 0.88% of players have a girls first name as their last name. Again, huge pinch of salt here (in a meaningless artice). We can then use pandas groupby to count how many matches we got and sort the values:
res = p_df[p_df['isMatch'] == 1] res = pd.DataFrame(res.groupby(['lastName'])['isMatch'].count()) res.sort_values("isMatch",ascending=False,inplace=True)
Here’s a look at the top hits
Of course there are some obvious ones here courtesy of the USA, but I didn’t expect so many Laras, Karas, or Veras.
Because I love a good viz, here’s a little bar chart using the top 50 names we got a match on:
import matplotlib.pyplot as plt x = res.index[0:50] y = res.isMatch[0:50] fig,ax = plt.subplots(figsize=(14,10)) #plt.plot(x,y) plt.bar(x,y,edgecolor="black",zorder=2,alpha=0.8) plt.xticks(x,rotation=90) plt.xlabel("Player Name") plt.ylabel("Count of Players") plt.title("Players with a Female's name as a Surname",fontsize=18) plt.grid(alpha=0.3,zorder=1) #plt.legend() plt.tight_layout() plt.savefig("shitpost", bbox_inches="tight",dpi=300) plt.show()
As a final note, WHO IN THE WORLD IS NAMING THEIR DAUGHTERS DAVID,DANIEL,PAUL, OR GABRIEL ???
Any questions or thoughts, you can contact me as alway through the contact page or on twitter here.