Comparing machine learning and conventional statistical approaches for injury prediction in young professional soccer players
Abstract
with preventive measures. Consequently, injury prediction and prevention
are also increasingly addressed from a statistical perspective. In a pilot
study, several machine learning algorithms and conventional statistical approaches
have been compared regarding their potential to predict time-loss
non-contact lower-body injuries in professional youth soccer players, using
data from a prospective cohort study with 56 players of which 22 were injured.
The covariates considered here include basic soccer-related as well
as neuromuscular and biomechanical features derived from physical testing.
Lasso regularized logistic regression, naive Bayes, linear discriminant analysis,
k-nearest neighbors, classification trees, random forests, XGBoost, and
support vector machines are considered for binary classification and prediction
of an injury occurrence. The prediction results from a cross-validated
procedure are compared regarding multiple quality measures. Post Lasso
logistic regression with a reduced penalty gives the best results with an accuracy
of 0.625, a predictive likelihood of 0.593, and a Brier score of 0.228.
The respective sensitivity and specificity are 0.773 and 0.529, with an AUC of
0.672. Three features have been identified to be of particular relevance, the
concentric extensor peak torque of the knee, the transversal plane moment
of the hip in a single-leg drop landing task, and the sway in postural control
under static conditions. Moreover, an XGBoost model which primarily uses the two first-mentioned covariates slightly outperforms the Lasso model in
terms of accuracy (0.661), while for the other performance measures it is
dominated by the Lasso.
References
Breiman, L. (2001). Random forests. Mach Learn, 45:5–32.
Breiman, L., Friedman, J., Stone, C., and Olshen, R. (1984). Classification and regression trees. Routledge, New York.
Bullock, G. S., Mylott, J., Hughes, T., Nicholson, K. F., Riley, R. D., and Collins, G. S. (2022). Just how confident can we be in predicting sports injuries? A systematic review of the methodological conduct and performance of existing musculoskeletal injury prediction models in sport. Sports Medicine, 52(10):2469–2482.
Casalicchio, G. and Burk, L. (2024). Evaluation and benchmarking. In Bischl, B., Sonabend, R., Kotthoff, L., and Lang, M., editors, Applied Machine Learning Using mlr3 in R. CRC Press.
Chen, T. and Guestrin, C. (2016). XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York. Association for Computing Machinery.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., and Yuan, J. (2023). xgboost: Extreme Gradient Boosting. R package version 1.7.5.1.
Claudino, J. G., Capanema, D. d. O., de Souza, T. V., Serrão, J. C., Machado Pereira, A. C., and Nassis, G. P. (2019). Current approaches to the use of artificial intelligence for injury risk assessment and performance prediction in team sports: A systematic review. Sports Medicine - Open, 5:1–12.
Cleve, J. and Lämmel, U. (2020). Data mining. De Gruyter Oldenbourg, Berlin, Boston, third edition.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:273–297.
Ekstrand, J. (2013). Keeping your top players on the pitch: the key to football medicine at a professional level. Br J Sports Med, 47(12):723–724.
Ekstrand, J., Bengtsson, H., Wald´en, M., Davison, M., Khan, K. M., and Hägglund, M. (2023). Hamstring injury rates have increased during recent seasons and now constitute 24% of all injuries in men’s professional football: the UEFA Elite Club Injury Study from 2001/02 to 2021/22. British Journal of Sports Medicine, 57(5):292–298.
Friedman, J. H. (1997). On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55–77.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5):1189 – 1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378.
Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
Hägglund, M., Waldén, M., Magnusson, H., Kristenson, K., Bengtsson, H., and Ekstrand, J. (2013). Injuries affect team performance negatively in professional football: an 11-year follow-up of the UEFA Champions League injury study. British Journal of Sports Medicine, 47(12):738–742.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Series in Statistics. Springer, New York.
Hurley, O. A. (2016). Impact of player injuries on teams’ mental states, and subsequent performances, at the Rugby World Cup 2015. Frontiers in Psychology, 7:807.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An introduction to statistical learning: with applications in R. Springer Texts in Statistics. Springer, New York.
Junge, A. and Dvorak, J. (2004). Soccer injuries: a review on incidence and prevention. Sports Medicine, 34:929–938.
Kolodziej, M., Groll, A., Nolte, K., Willwacher, S., Alt, T., Schmidt, M., and Jaitner, T. (2023). Predictive modeling of lower extremity injury risk in male elite youth soccer players using least absolute shrinkage and selection operator regression. Scandinavian Journal of Medicine & Science in Sports, 33(6):1021–1033.
Kolodziej, M., Nolte, K., Schmidt, M., Alt, T., and Jaitner, T. (2021). Identification of neuromuscular performance parameters as risk factors of non-contact injuries in maleelite youth soccer players: a preliminary study on 62 players with 25 non-contact injuries. Frontiers in Sports and Active Living, 3:615330.
Kolodziej, M., Willwacher, S., Nolte, K., Schmidt, M., and Jaitner, T. (2022). Biomechanical risk factors of injury-related single-leg movements in male elite youth soccer players. Biomechanics, 2(2):281–300.
Lang, M. and Schratz, P. (2023). mlr3verse: easily install and load the ’mlr3’ package family. R package version 0.2.8.
López-Valenciano, A., Ayala, F., Puerta, J. M., Croix, M. D. S., Vera-García, F.,
Hernández-Sánchez, S., Ruiz-Pérez, I., and Myer, G. (2018). A preventive model for muscle injuries: a novel approach based on learning algorithms. Medicine & Science in Sports & Exercise, 50(5):915–927.
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
McCullagh, P. and Nelder, J. (1989). Generalized linear models. Routledge, New York, second edition.
Meinshausen, N. (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2023). e1071: misc functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-14.
Milborrow, S. (2022). rpart.plot: plot ’rpart’ Models: an enhanced version of ’plot.rpart’. R package version 3.1.1.
Molnar, C., Bischl, B., and Casalicchio, G. (2018). iml: An r package for interpretable machine learning. JOSS, 3(26):786.
Oliver, J. L., Ayala, F., De Ste Croix, M. B., Lloyd, R. S., Myer, G. D., and Read, P. J. (2020). Using machine learning to improve our understanding of injury risk and prediction in elite male youth football players. Journal of Science and Medicine in Sport, 23(11):1044–1048.
Petrie, T. A. and Falkstein, D. L. (1998). Methodological, measurement, and statistical issues in research on sport injury prediction. Journal of Applied Sport Psychology, 10(1):26–45.
Pfirrmann, D., Herbst, M., Ingelfinger, P., Simon, P., and Tug, S. (2016). Analysis of injury incidences in male professional adult and elite youth soccer players: a systematic review. Journal of Athletic Training, 51(5):410–424.
R Core Team (2022). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Read, P. J., Oliver, J. L., De Ste Croix, M. B., Myer, G. D., and Lloyd, R. S. (2018). A prospective investigation to evaluate risk factors for lower extremity injury risk in male youth soccer players. Scandinavian Journal of Medicine & Science in Sports, 28(3):1244–1251.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12:77.
Rommers, N., Rössler, R., Verhagen, E., Vandecasteele, F., Verstockt, S., Vaeyens, R., Lenoir, M., D’Hondt, E., and Witvrouw, E. (2020). A machine learning approach to assess injury risk in elite youth football players. Medicine & Science in Sports & Exercise, 52(8):1745 – 1751.
Rossi, A., Pappalardo, L., and Cintia, P. (2022). A narrative review for a machine learning application in sports: an example based on injury forecasting in soccer. Sports (Basel), 10(1):5.
Rossi, A., Pappalardo, L., Cintia, P., Iaia, F. M., Fern´andez, J., and Medina, D. (2018). Effective injury forecasting in soccer with GPS training data and machine learning. PLoS ONE, 13(7):e0201264.
Schliep, K. and Hechenbichler, K. (2016). kknn: weighted k-nearest neighbors. R package version 1.3.1.
Tay, J. K., Narasimhan, B., and Hastie, T. (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1).
Therneau, T. and Atkinson, B. (2023). rpart: recursive partitioning and regression trees. R package version 4.1.23.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
Tutz, G. (2012). Regression for categorical data. Cambridge University Press.
Van Calster, B., McLernon, D. J., Van Smeden, M., Wynants, L., and Steyerberg, E. W. (2019). Calibration: The Achilles heel of predictive analytics. BMC Medicine, 17(1):230.
Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer, New York, fourth edition.
Wright, M. N. and Ziegler, A. (2017). ranger: a fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.
Yale, K., Nisbet, R., and Miner, G. D. (2018). Handbook of Statistical Analysis and Data Mining Applications. Elsevier, London, San Diego, Cambridge, Oxford, second edition.
Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1):32–35.
Yu, H. and Yang, J. (2001). A direct lda algorithm for high-dimensional data - with application to face recognition. Pattern Recognition, 34(10):2067–2070.
Full Text: pdf