Transfer Recruitment Methodology

Explaining the methodology used to identify transfer targets in our recruitment pieces.

Throughout the recruitment process, several methods were used to identify and narrow down transfer targets for each of the teams looked at. Below is a step by step breakdown of our methodology:

  1. Step 1: Similarities. Similarity scores were calculated for teams, leagues and players using Euclidean Distances. The reason Euclidean Distance was chosen over an alternative approach, such as Cosine Similarity, was because the magnitude of the distance was essential. Teams and leagues were looked at to help identify how easily a player would be able to adapt to a new league and team based on how stylistically similar it is to where they’re coming from. In theory, signing players from stylistically similar teams and leagues would mean a higher chance of success, so we attempted to incorporate that into our method. Player similarities were only used when trying to identify like for like replacements (i.e Thomas Partey’s replacement at Atletico Madrid).

    • Team Similarity:

      • Euclidean distances were used to calculate the similarity of teams compared to the subject team. The teams with the lowest distances were considered the most similar. Offensive and defensive similarities were calculated.

      • Metrics used for Defensive style similarity calculation:

        • Expected Goals Conceded per Game

        • Shots Against per Game

        • PPDA (Passes per Defensive Action)

        • Shot Quality Against (Expected Goals Conceded per Shot)

      • Metrics used for Attacking style similarity calculation:

        • Expected Goals per Game

        • Average Possession

        • Shots per Game

        • Crosses per Game

        • Passes per Game

        • Directness (Passes per Shot)

        • Shot Quality (Expected Goals per Shot)

    • League Similarity:

      • A similarity matrix was calculated using Euclidean distances between leagues. A similarity score (percentage) was then given based on the relative rank of that distance compared to the entire data set. In this example, we had 12 leagues thus giving us a 12x12 matrix. Scores were then given based on how the distances in the matrix compared to each other. Two matrices were eventually computed, one of which was based on respective rank, the other based on how that score compares to the 5th and 95th percentiles in the data set. The results from both were very similar.

      • Metrics used for Euclidean Distance calculation:

        • Expected Goals per Game.

        • Chance Quality (Expected Goals per Shot)

        • Headed Shots per Game

        • Shots per Game

        • Crosses per Game

        • Dribbles per Game

        • Touches in Penalty Area per Game

        • Defensive Duels per Game

        • Aerial Duels per Game

        • Challenge Intensity (Defensive Actions per Minute)

        • PPDA

        • Passes per Game

        • Long Passes per Game

        • Final Third Passes per Game

        • Passes per Shot

        • Through Passes per Game

    • Player Similarity:

      • Players were split into 6 main position groups using Wyscout. The position groups were: Full-Back, Centre-Back, Central Midfielder, Attacking Midfielder, Wide Midfielder, Striker. Metrics were then chosen for each position group based on how relevant that metric was for that position in order to measure style. For example, touches in the penalty area was used for strikers but not centre-backs and so on. Metrics that reflected player style (not performance) were then normalized and used to calculate the Euclidean Distance.

      • As an example, let’s take Thomas Partey at Atletico Madrid. To replace Thomas Partey, we calculated the Euclidean distance between him and every central midfielder in the dataset and the players with the smallest distances were the most similar.

      • Based on the previous example, here are the metrics used to measure the similarity of central midfielders:

        • Defensive: PAdj Successful Defensive Actions p90, PAdj Interceptions, PAdj Defensive Duels p90, Ratio of Ground to Aerial Duels, PAdj Aerial Duels, PAdj Shots Blocked, Total Ground Duels (Offensive + Defensive) p90.

        • Passing: % of Passes Forward, % of Passes Lateral, % of Passes Backwards, % of Passes Short/Medium, % of Passes Long, % of Passes Progressive, % of Passes into Final Third, % of Passes into Penalty Area, % of Passes Smart, % of Passes Through.

        • Attacking: Expected Assists p90, Dribbles per Pass, Ratio of Progressive Runs to Passes (Carrier or Passer), Expected Goals p90, Shots per Pass.

      • As you can see, most stats were adjusted per pass or possession in an attempt to minimize the influence of a team’s style/performance.

      • A similarity ‘score’ was then given based on the distance between every player and the subject player, in this case Thomas Partey. The score was calculated using a percentile rank of the distances.

  2. Step 2: Transfer Prices. A simple linear regression, using several variables, was used to estimate a potential transfer cost to allow us to work under budgetary constraints more accurately. The historical data used was only transfers done by the team in questions and excluded any bosman (free) transfers. All data post-2010 and collected from TransferMarkt. Variables used:

    • Transfermarkt Market Value (in Euros).

    • Player Age at time of transfer.

    • Contract length: Dummy variables where 1 is assigned when the player has more than 1 year remaining on their contract, and 0 when their contract expires within a year.

    • League the player would be arriving from: This is subjective but a score from 1-3 was given to each league. Top 5 leagues were given a score 1, while the other leagues were given scores of 2 or 3 based on their perceived prestige/global size.

    The final R Square differed due to the historical data being different for each team, however the lowest R Square was 0.71 while the highest was 0.84. Needless to say, coefficients and p-values also somewhat differed in magnitude between different clubs, however the trends remained similar (age & contract length being inversely correlated to price, etc..). The reason the historical data for each club only contained that club’s transfers is due to the certain premiums paid by clubs.

    For example, it is well known that some of the bigger English Clubs would have to spend a larger amount of money to sign a player, compared to other clubs. By only using historical data relevant to that particular club, we attempted to factor that premium into our calculations.

  3. Step 3: Performance Score: After identifying the needs from each position in Part 1, a performance score was calculated by normalizing the data per position using a max/min normalization technique. The player with the highest score in a certain metric would have a score of 1, with the lowest having a score of 0. Using an arithmetic average of the the normalized scores for relevant metrics, a final performance score was calculated to be used as a benchmark and an additional filtering tool. The performance score was not used to make a final decision, rather a filtering tool to help narrow down a shortlist.

  4. Step 4: Creating initial shortlists: With price, similarity and performance score now calculated, the data was filtered to eventually create 10-man shortlist using the three aforementioned variables.

  5. Step 5: 3-man shortlist and final pick: Players were then video scouted further, and a final decision was made using the player’s respective data as well as how well we perceived them using video analysis.

Final Thoughts & Limitations:

Despite attempting to factor in as much information as possible, our methods were not perfect. They were several limitations which could and arguably should be improved on.

  1. Leagues chosen. Ideally we would’ve liked to have included more leagues. In the end, the leagues included in the dataset were: Premier League, Ligue 1, La Liga, Bundesliga, Serie A, Scottish Premiership, Liga NOS, Eredivisie, Russian Premier League, English Championship, Belgian First Division and the Austrian Bundesliga. This, evidently, severely limited the pool of available and scouted players.

  2. The data was only for the 2019/20 season. Consistency is one of the most important things in football, and being able to perform season after season is crucial. Choosing only one season worth of data does not factor that in. Besides that, players that suffered major injuries in 2019/20 or had uncharacteristically poor season would not show up well, whilst players who had their first breakout season or their first genuinely great season would perform highly. In the future, we believe it is important to have 2-3 years of data to better judge a players quality.

  3. Minutes filter. As mentioned above, only players with 1000+ minutes according to Wyscout were included. This meant that a lot of good and, at times, suitable, players would not have been considered.

  4. Positions. Football is much more dynamic then it used to be, with players now being defined by their roles rather than positions. Filtering and bucketing players by positions no doubt created certain biases when evaluating players to solve team issues. Players are capable of playing in multiple positions and even permanently changing their positions (see Alphonso Davies at Bayern Munich), if their skillset allows them to do so, so only considering players with similar positions to the incumbent could have meant that suitable players went unnoticed.

  5. Leveling the playing field. Team styles and performances no doubt affect a players performance in certain metrics. We attempted to level that by possession adjusting all metrics, when possible. Defensive metrics were possession adjusted using Wyscout’s possession adjustment factor, while attacking metrics were adjusted on a per pass basis. This meant, that only time on the ball was accounted for and somewhat gives all players an equal fighting chance. Despite this, adjusting metrics does have a negative effect. In certain cases, it would benefit those with less touches and skew their figures upwards significantly. Not adjusting them, could do the same for players with more touches. There is no perfect way of doing this, but we decided to go for adjusted as we felt it gave players from relatively smaller teams more of a fighting chance. Team styles, though, was more difficult and was not adjusted for, severely limiting our ability in picking the optimal player.

  6. Although we looked at and factored in the similarities between team and leagues, it must be noted that the similarities should also extend to roles. For example, a striker moving from the Bundesliga to the Premier League might have an easier adaptation compared to a defender. This was not factored in, a more general approach was taken.

  7. Team and league similarity were used as restrictive tools, alongside price and performance score. Perhaps league and team similarity should’ve been used at a later stage when evaluating players rather than a restrictive tool.

  8. A weighted average score was calculated for team similarity. For example, for a central midfielder, both phases would be weighted equally. However, for a striker that weight shifted to 80-20 in favor of attacking style. The weights chosen were arbitrary and simple, and were mainly used to get a more relevant style similarity rating.

  9. Metrics chosen for each position to calculate performance scores and euclidean distances were subjective. In all honesty, most of the available advanced data metrics were included, however some were excluded if deemed irrelevant to that position group (i.e shots blocked for strikers) based on our understanding of the metric and the requirements of that role.

  10. As mentioned previously, the metrics chosen were slightly subjective, which would no doubt limit the credibility of both the similarity score and the performance score. After conducting several tests and taking random samples, we do believe that the results on the whole were quite realistic and useful. Perhaps a more scientific approach by factoring in a team’s style and needs to come up with the relevant metrics would be more useful for the future.

  11. Transfer prices. Use of historical data post-2010 reduced the sample size significantly and affected the reliability of the results. Despite this, price was mainly used to allow us to recreate a more realistic scenario that would factor in budgetary constraints, so the results did not matter much to the reliability of the method.

Previous
Previous

Manchester United Recruitment - Part 2

Next
Next

Manchester United Recruitment - Part 1