Article Text
Abstract
Objectives Occupational noise-induced hearing loss (ONIHL) represents a prevalent occupational health condition, traditionally necessitating multiple pure-tone audiometry assessments. We have developed and validated a machine learning model leveraging routine haematological and biochemical parameters, thereby offering novel insights into the risk prediction of ONIHL.
Design, setting and participants This study analysed data from 3297 noise-exposed workers in Shenzhen, including 160 ONIHL cases, with the data set divided into D1 (2868 samples, 107 ONIHL cases) and D2 (429 samples, 53 ONIHL cases). The inclusion criteria were formulated based on the GBZ49-2014 Diagnosis of Occupational Noise-Induced Hearing Loss. Model training was performed using D1, and model validation was conducted using D2. Routine blood and biochemical indicators were extracted from the case data, and a range of machine learning algorithms including extreme gradient boosting (XGBoost) were employed to construct predictive models. The model underwent refinement to identify the most representative variables, and decision curve analysis was conducted to evaluate the net benefit of the model across various threshold levels.
Primary outcome measures Model creation data set and validation data sets: ONIHL.
Results The prediction model, developed using XGBoost, demonstrated exceptional performance, achieving an area under the receiver operating characteristic curve (AUC) of 0.942, a sensitivity of 0.875 and a specificity of 0.936 on the validation data set. On the test data set, the model achieved an AUC of 0.990. After implementing feature selection, the model was refined to include only 16 features, while maintaining strong performance on a newly acquired independent data set, with an AUC of 0.872, a balanced accuracy of 0.798, a sensitivity of 0.755 and a specificity of 0.840. The analysis of feature importance revealed that serum albumin (ALB), platelet distribution width (PDW), coefficient of variation in red cell distribution width (RDW-CV), serum creatinine (Scr) and lymphocyte percentage (LYMPHP) are critical factors for risk stratification in patients with ONIHL.
Conclusion The analysis of feature importance identified ALB, PDW, RDW-CV, Scr and LYMPHP as pivotal factors for risk stratification in patients with ONIHL. The machine learning model, using XGBoost, effectively distinguishes patients with ONIHLamong individuals exposed to noise, thereby facilitating early diagnosis and intervention.
- Machine Learning
- Audiology
- Blood bank & transfusion medicine
- Risk Factors
Data availability statement
Data are available upon reasonable request. Data are available upon reasonable request. Original data collected within this study are not publicly available as they might contain sensitive information. De-identified data can be shared based on a reasonable request by sending an email to szpcr@126.com.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
STRENGTHS AND LIMITATIONS OF THIS STUDY
The model predicts occupational noise-induced hearing loss (ONIHL) using routine blood and biochemical indicators, eliminating the need for audiometric tests or direct noise exposure data.
It simplifies the diagnostic process, reducing time, costs and manpower requirements.
It provides an accessible and efficient alternative for early screening and prevention of ONIHL.
The study is limited to the Shenzhen population, and the model’s generalisability to other groups and settings remains uncertain.
The positive-to-negative sample ratio exceeds 1:20, mirroring real-world conditions but limiting predictive accuracy; future integration of additional biomarkers, such as DNA methylation, may improve performance.
Introduction
Occupational noise-induced hearing loss (ONIHL) is characterised as a progressive sensorineural hearing impairment predominantly attributed to damage of the hair cells within the inner ear, consequent to prolonged exposure to high-intensity noise environments.1 As reported by WHO, approximately 10% of the global workforce is impacted by elevated noise levels, with occupational noise exposure accounting for 7–21% of hearing loss among workers.2 A national occupational research agenda says that ONIHL has the highest prevalence of occupational diseases in the USA.3 About 22 million US workers are currently exposed to hazardous occupational noise.4 This incidence is notably higher in developing countries.5 As the largest developing nation, China has witnessed an increasing trend in the incidence of occupational ONIHL in recent years. The prevalence of ONIHL has been reported to be over 20% among noise-exposed workers in China.6 Such hearing loss can result in communication challenges, social isolation, loneliness and depression, thereby adversely impacting patients' quality of life and leading to indirect economic losses for society.7 However, despite being a major global public health issue, early screening methods for ONIHL remain limited.
Currently, pure-tone audiometry (PTA) is regarded as the gold standard for diagnosing ONIHL.8 However, its reliance on costly audiological equipment and the necessity for highly trained professionals restrict its practicality for large-scale ONIHL screening among noise-exposed occupational groups.9 Additionally, PTA relies on subjective auditory feedback and may be influenced by individual auditory adaptation. Consequently, there is a pressing need to develop a practical and user-friendly screening tool specifically designed for patients with ONIHL to prevent the advancement to clinically significant ONIHL. Numerous instances of ONIHL are characterised by an initial deterioration in high-frequency hearing, which gradually progresses to impairments in low-frequency or speech frequency hearing.10 The early identification of individuals at high risk is essential for effective prevention and intervention strategies.
Consequently, the development of predictive models to screen high-risk populations for further evaluation represents a viable alternative approach. The growing volume of data has facilitated the application of machine learning (ML) techniques in the context of ONIHL. At present, a variety of methodologies employing either traditional statistical analysis or ML techniques are used to predict the risk of ONIHL. These methodologies frequently necessitate substantial human resources and present challenges in manual definition.11 The integration of ML within the field of audiology has demonstrated potential, particularly in its capacity to effectively analyse non-linear relationships within data, such as forecasting hearing thresholds for individuals exposed to specific risk factors.12 Abdollahi et al13 constructed eight ML models to forecast sensorineural hearing loss following radiotherapy and chemotherapy, with five of these models demonstrating accuracy and precision exceeding 70%. Comparable levels of accuracy have been reported in other investigations employing ML models to predict sudden sensorineural hearing loss (SSNHL) and ototoxic hearing loss.14 15 Additionally, various studies have documented accuracy rates between 0.64 and 0.99 when using diverse ML algorithms and input parameters to predict ONIHL risk factors.16–20 Among the diverse array of ML techniques, support vector machines (SVM) models, random forest (RF) models and extreme gradient boosting (XGBoost) models have demonstrated superior performance in classification tasks.10 Although these studies demonstrate that ML can effectively predict various types of hearing loss, most existing models primarily rely on audiometric data rather than non-invasive biomarkers.
Established risk factors for ONIHL encompass age, medical history (including conditions such as hypertension and diabetes), history of noise exposure, tinnitus and behavioural factors such as smoking and physical activity.21–24 Furthermore, several biomarkers associated with inflammation, including elevated levels of white blood cells (WBCs), neutrophils (NE), monocytes (MO) and lymphocytes (LY), alongside metabolic parameters such as low-density lipoprotein (LDL) and high-density lipoprotein (HDL), are recognised as risk indicators for hearing loss.25 The chronic alterations in the inflammatory state that occur with ageing, a phenomenon known as inflammaging, may contribute to or expedite long-term auditory system damage.26 Red cell distribution width (RDW), a parameter traditionally used for the classification of anaemia, has recently been identified as being associated with inflammation and microcirculatory disorders.27 HDL and LDL have been reported to influence blood supply, thereby potentially affecting SSNHL.25 While numerous studies have explored the relationship between hearing loss and various blood inflammatory and metabolic parameters, there is a paucity of research employing these parameters to predict ONIHL.
It is noteworthy that individuals exposed to occupational noise are subject to annual medical evaluations, which routinely include blood tests comprising both standard and biochemical analyses.15 Physicians often extract limited information from these routine blood test results. In light of this, our study seeks to comprehensively leverage routine haematological and biochemical indicators, in conjunction with ML methodologies, to construct a risk prediction model for ONIHL. The objective is to facilitate early detection and intervention for ONIHL using data from standard medical examinations.
Methods
Data collection and processing
The medical examination data were obtained from the Shenzhen Prevention and Treatment Center for Occupational Diseases from January 2023 to July 2024. The data were divided into two parts in chronological order: D1 and D2. The first step involved data cleaning, removing samples with erroneous or abnormal values. The inclusion criteria were formulated based on the GBZ49-2014 Diagnosis of Occupational Noise-Induced Hearing Loss: (1) noise exposure duration ≥3 years and (2) bilateral high-frequency (3000 Hz, 4000 Hz, 6000 Hz) average hearing threshold ≥40 dB. Exclusion criteria included pseudohypacusis, exaggerated hearing impairment, drug-induced hearing loss, traumatic hearing loss, infectious hearing loss, hereditary hearing loss, Ménière’s disease, sudden deafness, acoustic neuroma and auditory neuropathy. We divided the samples into two groups: the ONIHL group and the noise-exposed normal hearing group. After preprocessing, a total of 3297 samples were retained, with D1 and D2 consisting of 2868 and 429 samples, respectively. Among them, there were 107 and 53 cases of noise-induced hearing loss, representing the positive samples. We then applied random sampling to split D1 into a training set and a test set at a 7:3 ratio. D2 was used as an independent test set.
All data sets included the following variables: sex, age, total protein (TP), albumin (ALB), glucose (GLU), cholesterol (CHO), triglycerides (TG), HDL, LDL, total bilirubin (TBIL), direct bilirubin (DBIL), indirect bilirubin (IBIL), alanine aminotransferase (ALT), aspartate aminotransferase (AST), blood urea nitrogen (BUN), serum creatinine (Scr), uric acid (UA), globulin (GLB), haemoglobin (Hb), red blood cell count (RBC), haematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular haemoglobin (MCH), mean corpuscular haemoglobin concentration (MCHC), WBC, eosinophil count (EOC), basophil count (BAC), lymphocyte count (LYMPHC), monocyte count (MOC), platelet count (PLT), neutrophil count (GRANC), eosinophil percentage (EOP), basophil percentage (BAP), coefficient of variation in red cell distribution width (RDW-CV), mean platelet volume (MPV), platelet distribution width (PDW), plateletcrit (PCT), neutrophil percentage (GRANP), lymphocyte percentage (LYMPHP), monocyte percentage (MOP), SD in red cell distribution width (RDW-SD), platelet-to-HDL ratio (PLT/HDL), glucose-to-HDL ratio (GLU/HDL), platelet-to-lymphocyte ratio (PLT/LYMPHC), albumin-to-globulin ratio (A/G), neutrophil-to-lymphocyte ratio (S/L), triglyceride-glucose index (TyG) and estimated glomerular filtration rate (eGFR) (the calculation formulas for TyG and eGFR are detailed in online supplemental additional 1).
Supplemental material
In light of the pronounced class imbalance present across all data sets, we employed oversampling of the positive instances within the training set using the ‘ovun.sample()’ function from the ROSE package. This function randomly replicates samples from the minority class, thereby equalising the number of positive and negative samples in the training set and achieving a balanced class distribution.28 This approach effectively increases the sample size of the minority class, mitigating the effects of class imbalance during model training. All data sets underwent Z-score normalisation, using the mean and SD derived from the training set data.
Framework
Employing occupational health examination data, we introduce an integrated framework for the identification of patients with noise-induced hearing loss, as illustrated in figure 1. Initially, we preprocessed two data sets, designated as D1 and D2. Data set D1 was partitioned into training and validation subsets in a 7:3 ratio, while data set D2 served as an independent test set for the evaluation of the final model. Due to the class imbalance present in the data set, we employed an oversampling technique on the training set. Subsequently, we used a comprehensive array of ML algorithms, including XGBoost, logistic regression (LR), RF, SVM and k-nearest neighbour (KNN), to construct predictive models. We then applied feature selection methods to the most optimal predictor among the five to enhance the tool’s feasibility. The performance of the refined model was evaluated using an independent test set. We conducted a feature importance analysis to identify variables correlated with the incidence of noise-induced hearing loss. Additionally, we optimised the model to select the most representative variables and employed decision curve analysis (DCA) to evaluate the net benefit of the model across various threshold levels.
A combined framework for identifying patients with occupational noise-induced hearing loss. AUC, area under the receiver operating characteristic curve; KNN, k-nearest neighbours; LR, logistic regression; mRMR, maximum relevance minimum redundancy; PCA, principal component analysis; PR-AUC, area under the precision–recall curve; RF, random forest; SVM, support vector machine; XGBoost, extreme gradient boosting.
Model construction
In order to construct predictive models, we employed five ML algorithms: LR, RF, SVM, KNN and XGBoost. LR is a form of linear regression that uses the Sigmoid function to convert outputs into probabilities for classification purposes.29 RF comprises an ensemble of independently trained decision trees, with the ultimate prediction being derived through a voting mechanism among these trees, thereby mitigating the risk of overfitting.30 SVM algorithm classifies samples by identifying an optimal hyperplane within the feature space, and it is capable of managing nonlinearly separable data.31 KNN algorithm, an instance-based learning method, classifies samples according to the proximity of their KNN, making it particularly suitable for small data sets and straightforward to implement.32 XGBoost is an ensemble method based on decision trees that enhances model performance through a gradient boosting framework. It constructs decision trees in an iterative manner to minimise model error, demonstrating particular efficacy in handling large-scale, high-dimensional data sets due to its robust generalisation capabilities and computational efficiency.33 All models were developed in R (V.4.3.1) using a standardised fivefold cross-validation framework, with performance evaluated by the area under the receiver operating characteristic curve (AUC). Hyperparameter optimisation was performed via grid search to maximise validation AUC, supported by a heatmap illustrating key parameter interactions in XGBoost (online supplemental figure S1) and boxplots comparing cross-validation stability across models (online supplemental figure S2). For XGBoost, critical parameters included tree depth (max_depth), learning rate (eta) and subsampling ratios, optimised to max_depth=7, eta=0.1 and subsampling ratios of 0.6. RF, SVM and KNN employed targeted tuning strategies—such as feature subset selection, regularisation balancing and dynamic neighbour selection—while LR used L2 regularisation. To ensure reproducibility, data splitting and randomisation were controlled by a global seed (set.seed(123)), with parallel processing (four threads) accelerating computations. We chose the model with the best performance on the validation set for further optimisation.
Model evaluation
To evaluate model performance, considering the class imbalance in the validation and test sets, we used the following metrics to comprehensively assess model performance: sensitivity, specificity, balanced accuracy, AUC, area under the precision–recall curve (PR-AUC), F1 score and precision. These metrics are defined as follows:
The performance of all models was assessed using the 'pROC' package in R to calculate AUC and PR-AUC values.
TP, that is, true positive, is the number of cases of noise-induced hearing loss. FP, that is, false positive, denotes the number of normal subjects incorrectly predicted as having ONIHL. TN, that is, true negative, indicates the number of healthy subjects correctly classified as normal. FN, that is, false negative, refers to the number of cases with ONIHL incorrectly classified as normal. All the above metrics range from 0 to 1.
Feature selection and feature importance analysis
Despite the relatively high performance of the prediction model using 48 features, there remains the possibility of redundant information or noise features that could adversely affect the decision-making process. To enhance the effective utilisation of features and streamline the model, we employed a combination of manual curation, principal component analysis (PCA) and maximum relevance minimum redundancy (mRMR) methods to extract essential features for the final model.34 In the manual curation process, we initially identified features that exhibited significant differences between positive and negative samples. To improve the stability of the predictive model, we eliminated features that contributed to significant collinearity.35 As a result, 16 features were retained. To ensure consistency, the number of feature subsets was also fixed at 16 during the application of PCA and mRMR analysis. PCA selected principal components based on cumulative explained variance, retaining those accounting for up to 80% of the variance to balance dimensionality reduction and information preservation. Meanwhile, mRMR leveraged mutual information to maximise feature relevance while minimising redundancy, ensuring an optimal feature subset. Furthermore, feature selection was conducted on the training set to mitigate the risk of overfitting. The analysis of feature importance facilitates the interpretation of the predictive model and aids in identifying the features most closely associated with ONIHL. In this context, Feature importance was assessed using XGBoost’s weight coefficients, with Gain (an internal statistical metric of XGBoost) highlighting features that maximally improve model performance while minimising redundancy.
Patient and public involvement
Patients and the public were not involved in the design, conduct, reporting, or dissemination plans of this research.
Results
We initially gathered occupational health examination data from the Shenzhen Occupational Disease Prevention and Control Institute for the period spanning 2023–2024, with subgroup D1 comprising 2868 noise-exposed workers. Of these, 107 participants were diagnosed with ONIHL. Table 1 provides a detailed description of the characteristics of both noise-exposed individuals and patients with ONIHL. The five most prominent features exhibiting significant differences between the ONIHL and non-ONIHL samples include ALB, TP, Age, RDW-CV and PDW (online supplemental figure S3).
Statistical characteristics of noise-exposed hearing normal individuals and patients with ONIHL
Performance comparison of the five machine learning methods
Table 2 and online supplemental figure S4A, B show that the XGBoost algorithm has the highest AUC (0.942) and PR-AUC (0.791), as well as high recall (0.875) and balanced accuracy (0.905). The F1 score is only second to that of RF, making its overall performance excellent. The RF algorithm also performs well, with AUC (0.921) and PR-AUC (0.690), and a high balanced accuracy (0.872), indicating an outstanding overall performance. To maximise the identification of patients with ONIHL, the XGBoost algorithm was ultimately selected to further build the prediction model.
Performance comparison of the five machine learning methods
The results of the fivefold cross-validation on the training set show an AUC of 0.999, sensitivity of 0.995 and balanced accuracy of 0.998. Additionally, the XGBoost model demonstrates reliable performance on the test set (AUC=0.900, PR-AUC=0.648), as shown in figure 2A,B.
Performance of the prediction model on the validation set of data set D1 and the test set of data set D2. (A) receiver operating characteristic (ROC) curves; (B) precision–recall curves. AUC, area under the ROC curve; PR-AUC, area under the precision–recall curve.
Feature selection for the final model
Several pairs of features were observed to have high correlations, such as MCH and RDW-CV, and GRANP and LYMPHP, which may introduce redundant information and affect the model’s decision-making and stability (online supplemental figure S5 for related heat maps). Therefore, we used manual curation, PCA and mRMR methods to identify the optimal features. As a result, 16 features were used to reconstruct the XGBoost model from each of manual curation, PCA and mRMR (online supplemental table S1). The PCA and mRMR feature selection methods identified seven shared features, with three of the top five selected features overlapping: ALB, RDW-CV and Scr. The model constructed using 16 features selected by mRMR and PCA showed a slight improvement on the validation set compared with the model built with all 48 features. Specifically, the PCA model achieved an AUC of 0.957 and a PR-AUC of 0.741, while the mRMR model had an AUC of 0.957 and a PR-AUC of 0.720. In contrast, the model based on manual feature selection exhibited a decline in performance, with an AUC of 0.919 and a PR-AUC of 0.540 (figure 3A,B). Similarly, the models constructed using mRMR and PCA demonstrated improvements in sensitivity and balanced accuracy on the validation set, with maximum increases of 29.2% and 3.7%, respectively, under the condition that the threshold of the full features model was set to 0.1. In comparison, the model based on manual curation showed suboptimal performance across all evaluation metrics (figure 3C). We further evaluated the models using an independent test set (D2). In this test set, the manual curation model achieved an AUC of 0.830, the PCA model had an AUC of 0.837 and the mRMR model outperformed both with an AUC of 0.872 (figure 3D). The PR-AUC values were 0.524, 0.540 and 0.594 for the manual curation, PCA and mRMR models, respectively, with mRMR again demonstrating the best performance (figure 3E). Regarding sensitivity, specificity and balanced accuracy, the mRMR model exhibited the highest performance in the test set evaluation (figure 3F). Notably, the lowest sensitivity observed for the mRMR model on D2 was 75.5%, while all specificity scores remained above 78.0%. Overall, the model demonstrated strong performance on the independent test set, indicating that the selected core features are sufficient for detecting noise-induced hearing loss among noise-exposed workers.
Feature selection for the final model using principal component analysis (PCA), manual curation and maximum relevance minimum redundancy (mRMR). (A) Receiver operating characteristic (ROC) curves for models constructed with PCA-selected, manual curation-selected and mRMR-selected features on the validation set of data set D1. (B) Precision–recall curves of the above models. (C) Comparison of sensitivity, specificity and balanced accuracy on the validation set of data set between the model constructed before and after feature selection. (D) ROC curve of models using selected features on the test set of data set D2. (E) Precision–recall curve of the above models. (F) Comparison of other metrics on the independent test set D2. AUC, area under the ROC curve; PR-AUC, area under the precision–recall curve.
Feature importance ranking
To investigate which features contribute the most to the risk of ONIHL, we first used mRMR to select 16 important features and then built an XGBoost model based on these features. Subsequently, we ranked these features according to their weights in the XGBoost model, as shown in figure 4. The feature importance of the predictors based on PCA and manual curation is shown in online supplemental figures S6 and S7. The results indicated that the top five features, in order of importance, were ALB, PDW, RDW-CV, Scr and LYMPHP. Further comparisons between the ONIHL and normal samples revealed significant differences in ALB, TP, Age, RDW-CV and PDW. These findings are highly consistent with the top-ranked results in the XGBoost model, indicating a strong correlation between these indicators (ALB, PDW, RDW-CV, Scr and LYMPHP) and ONIHL.
Feature importance ranking for the model built using features selected by maximum relevance minimum redundancy.
Decision curve analysis
DCA (figure 5) revealed distinct net benefit patterns across threshold probabilities for the three models. The mRMR model demonstrated superior performance, achieving the highest net benefit over a broad threshold range (0.0–0.6), with particularly pronounced advantages at lower thresholds (0.0–0.2). In contrast, the PCA model exhibited competitive efficacy within the moderate threshold interval (0.2–0.4). Notably, both the manual curation model and the ‘All/None’ strategy underperformed: manual curation yielded consistently lower net benefits across all thresholds, and ‘All/None’ resulted in negative net benefits at thresholds below 0.3, indicating clinical impracticality. These findings support a threshold-adaptive selection strategy: prioritising mRMR for thresholds≤0.4 (maximising robustness) and PCA for 0.2–0.4 thresholds (balancing accuracy and efficiency). This approach optimises clinical utility by aligning model strengths with context-specific decision risks.
Decision curve analysis curves for models built using three different feature selection methods. mRMR, maximum relevance minimum redundancy; PCA, principal component analysis.
Discussion
ONIHL represents a significant global public health concern.2 36 Despite its complexity, ONIHL is a preventable condition.37 38 The Occupational Safety and Health Administration requires the implementation of hearing conservation programmes for workers exposed to noise levels of 85 decibels or higher, with the objective of safeguarding auditory health in noisy occupational environments.39 Consequently, the development of a risk screening tool for ONIHL is crucial as a primary strategy for screening and prevention among workers exposed to occupational noise. In this study, we employed five ML algorithms using haematological test results to construct an ONIHL risk screening model. The models demonstrated AUC values exceeding 0.85, with accuracy and sensitivity surpassing 0.75 in both validation and independent test data sets. These results suggest that ML models are capable of accurately identifying patients with ONIHL within the population of noise-exposed workers.
In an evaluation of model performance on the validation set, the XGBoost model exhibited superior efficacy compared with all other algorithms assessed, achieving an AUC of 0.942 and a PR-AUC of 0.791. The specificity and balanced accuracy metrics for the XGBoost model all exceeded 0.8 on the validation set. Furthermore, the XGBoost model maintained consistent performance on the test set, with an AUC of 0.900 and a PR-AUC of 0.648. XGBoost is recognised as an ML technique that efficiently and flexibly manages missing data and integrates weak predictive models into a robust predictive framework.40 As an open-source package, XGBoost has gained significant recognition in various ML and data mining competitions. For example, in 2015, 17 out of the 29 winning solutions featured on Kaggle’s blog used XGBoost, and all of the top 10 winning teams in the 2015 KDD Cup also incorporated XGBoost into their solutions.41 In neurology, XGBoost achieved AUC values of 0.950 (mortality) and 0.958 (functional outcomes) in patients with aneurysmal subarachnoid haemorrhage, outperforming LR.42 XGBoost has been applied to predict 5-year survival in elderly patients with intrahepatic cholangiocarcinoma (AUC=0.713, SEER database) and type 2 diabetes risk (accuracy=89.09%, AUC=0.9182 in Beijing residents).43 44 These advancements highlight XGBoost’s utility in high-dimensional clinical data sets with interpretable feature insights. Furthermore, our findings indicate that the predictive efficacy of the XGBoost model surpasses that of LR, RF, SVM and KNN. This aligns with previous research demonstrating that traditional LR frequently exhibits comparatively lower AUC values in ROC curve analyses, alongside higher prediction errors and inferior performance relative to more contemporary methodologies.45 46
Screening for ONIHL is important, and various methods have been explored for this purpose. Otoacoustic emissions (OAE) testing, particularly distortion product OAE, is a sensitive tool for detecting early cochlear damage before significant hearing threshold shifts appear in PTA.47 It can identify subtle outer hair cell dysfunction in noise-exposed individuals, making it valuable for early intervention and monitoring.48 Auditory brainstem response (ABR) testing, another physiological method, assesses neural integrity and can detect hidden hearing loss even when audiometric thresholds remain normal.49 Despite their advantages, the large-scale application of OAE and ABR in occupational screening is limited by high costs, equipment availability and the need for trained operators. In contrast, our model predicts ONIHL risk solely from routine blood and biochemical indicators, eliminating the need for specialised audiometric assessments or noise exposure data. By analysing markers linked to inflammation, oxidative stress and immune response, it provides a cost-effective, scalable alternative for early screening. Integrating this approach with existing methods like OAE or ABR could further enhance ONIHL risk assessment, enabling earlier interventions before irreversible damage occurs.
Consequently, the early identification and intervention of risk factors identified in our model could have substantial implications for the prevention of ONIHL among workers exposed to noise. The risk factors contributing to the development of ONIHL are varied. We have developed a risk assessment model for ONIHL using clinical data and routine physical examination indicators, employing an ML algorithm. This approach contrasts with most existing methods for predicting ONIHL risk, which predominantly depend on variables such as age, sex, medical history (including conditions like hypertension and diabetes), history of noise exposure and behavioural factors such as smoking and physical activity.16 50 51 For instance, prior research has developed risk models for workers exposed to noise, yielding favourable predictive outcomes. These models primarily incorporate risk factors such as industry type, duration of noise exposure and median peak intensity, which contrast with the physical examination indicators used in our study.20 Wang et al10 formulated an ML-based risk assessment model for high-frequency hearing loss employing routine physical examination data, attaining an AUC of 0.868. This model, however, was principally designed for community residents and incorporated risk factors, including 13 blood test indicators, demographic characteristics, disease-related features, behavioural factors, environmental exposure and auditory cognitive factors, which differ from the population of noise-exposed workers in our study. Our model offers a more comprehensive approach than previous research by integrating a wide range of biochemical and routine blood indicators to assess the risk of ONIHL from multiple dimensions. Unlike models that rely on hearing assessments and direct noise exposure measurements, our model focuses on routine blood and biochemical indicators, reducing the need for specialised equipment and resources. This makes it a more efficient, cost-effective alternative for early detection and prevention of ONIHL, offering personalised risk assessments without the reliance on extensive testing.
Routine blood tests administered at occupational disease prevention clinics are typically conducted on an annual basis. Based on these tests, the application of these indicators can enhance early screening and provide warnings for prevalent occupational diseases. In our study, the developed model demonstrates the significance of haematological test data in screening for ONIHL. This includes variables such as age, sex, inflammatory and immune markers (eg, WBC, LYMPHP, MOC, BAP, EOC and GRANC), as well as oxidative stress and metabolic markers (eg, ALB, Scr, RDW-CV and RDW-SD). Noise exposure influences haematological parameters through complex immunoinflammatory pathways, which may both reflect and exacerbate cochlear damage. Studies have shown that ONIHL is closely associated with systemic immune and inflammatory responses, with WBC serving as a key inflammatory marker linked to ONIHL. A study analysing health examination data from 3508 noise-exposed workers found that WBC levels were significantly higher in the NIHL group compared with those with normal hearing.52 This suggests that noise exposure may trigger chronic inflammatory responses in the body. At the cellular level, noise activates resident macrophages in the cochlea, triggering the release of pro-inflammatory cytokines such as interleukin-1β and tumour necrosis factor-α.53 ,54 This leads to increased permeability of the blood–labyrinth barrier, facilitating the infiltration of systemic immune cells—including MOC, GRANC and adaptive immune lymphocytes—into the inner ear.55 This immune influx amplifies local inflammation, creating a microenvironment that promotes sensory cell apoptosis and spiral ganglion degeneration. Notably, our study identified significant elevations in WBC, MOC and GRANC, aligning with these immunological responses. Chronic noise stress further disrupts systemic immune homeostasis, as demonstrated in animal models.55 Prolonged exposure induces immunosuppressive changes, including a decrease in LYMPHP and a reduced CD4+/CD8+ T cell ratio, which may impair anti-inflammatory responses and regenerative capacity. This systemic immune imbalance is consistent with our findings of decreased LYMPHP in noise-exposed individuals. Additionally, our findings indicate that increased EOC levels may also serve as a risk factor for ONIHL. EOC may play a pathogenic role in noise-induced inner ear vasculitis, a process increasingly recognised as a critical mediator of sensorineural damage. Elevated EOC levels have been associated with SSNHL, suggesting their potential as prognostic indicators in inflammatory hearing disorders.56
Oxidative stress is a key mechanism underlying ONIHL.57 RDW-CV/SD is a crucial marker of RBC oxidative damage, with increased RDW levels indicating decreased membrane stability.58 59 This instability shares a common pathological basis with noise-induced hair cell apoptosis. A positive correlation between RDW parameters (CV and SD) and the average hearing threshold further highlights the role of oxidative stress in ONIHL.60 These findings underscore the need to identify inflammatory conditions when screening workers at risk for chronic inflammation and ONIHL. ALB, a critical antioxidant protein, plays a protective role in maintaining blood–labyrinth barrier integrity.61 Low ALB levels may weaken antioxidant defences and increase susceptibility to hearing loss. Notably, ALB levels in patients with SSNHL were significantly lower than those in the control group (p<0.001).62 Furthermore, studies have shown a positive correlation between reduced eGFR and hearing loss, suggesting that impaired kidney function may contribute to cochlear microcirculatory dysfunction via inflammation-mediated mechanisms.63 Since Scr is a key component in calculating eGFR, our observed reduction in Scr could reflect a broader metabolic or physiological shift rather than direct renal impairment alone. PDW, an indicator of platelet activation, may also play a role in ONIHL by promoting microvascular inflammation. Research has demonstrated a significant association between PDW and the severity of SSNHL.64 This finding is consistent with our study, as we also observed an association between PDW and ONIHL. Given that PDW is an indicator of platelet activation, its potential role in promoting microvascular inflammation may contribute to the pathophysiological mechanisms underlying ONIHL. Our results further support the notion that vascular and inflammatory responses play a crucial role in noise-induced cochlear damage. Age and male gender have been identified as risk factors for hearing loss.65 Leveraging artificial intelligence and big data analysis, haematological parameters can serve as predictive markers for ONIHL. An ML model based on XGBoost integrates inflammatory, oxidative stress and metabolic-related indicators to enhance risk assessment. Feature importance analysis highlights ALB, PDW, RDW-CV, Scr and LYMPHP as key predictors of ONIHL, reinforcing their potential role in early detection and risk stratification.
Although haematological indicators provide a low-cost and accessible approach for ONIHL prediction, their specificity remains limited, necessitating integration with objective auditory assessments such as ABR and OAE to enhance predictive accuracy. Additionally, the generalisability of our model requires further validation as it is currently based on a Shenzhen population and may not fully represent other demographic and occupational groups. The model’s precision and F1 score are also relatively low, primarily due to the severe class imbalance, with ONIHL cases being far less frequent than noise-exposed individuals with normal hearing. Despite these limitations, future studies can address these challenges by conducting large-scale, multicentre validations, employing advanced data-balancing techniques, and incorporating multi-omics data—such as metabolomics and transcriptomics—to unravel the molecular mechanisms linking inflammation, oxidative stress and immune dysregulation in ONIHL. Such advancements will not only optimise predictive models but also facilitate their clinical application in occupational health screening and early intervention, ultimately improving hearing loss prevention strategies.
Conclusion
In this study, we developed five ML models to construct a risk screening model for ONIHL, with the XGBoost-based model demonstrating superior performance. By integrating biochemical and haematological indicators with ML techniques, this model effectively identifies individuals at high risk for ONIHL. This approach not only introduces a novel tool for the early screening of hearing loss but also lays the groundwork for the development of personalised intervention strategies. In the future, the integration of additional biological data is anticipated to further augment the model’s predictive capabilities. Furthermore, this model holds potential for extension to forecast risks associated with other occupational or chronic diseases, thereby offering substantial support for the maintenance and enhancement of public health.
Data availability statement
Data are available upon reasonable request. Data are available upon reasonable request. Original data collected within this study are not publicly available as they might contain sensitive information. De-identified data can be shared based on a reasonable request by sending an email to szpcr@126.com.
Ethics statements
Patient consent for publication
Ethics approval
This study was approved by the Ethics Committee of Shenzhen Prevention and Treatment Center for Occupational Diseases (approval number: LL2020-34; date: 14 December 2020). All methods were carried out in accordance with relevant ethical guidelines and regulations.
Acknowledgments
We thank the Shenzhen Prevention and Treatment Center for Occupational Diseases for the approval of the ethical clearance. We also extend our warm gratitude to the different hospital stakeholders and participants for their valuable contribution during data collection.
References
Footnotes
CL and LS are joint first authors.
Contributors The authors made substantial contributions to the acquisition, analysis and interpretation of the data and the drafting and revision of the manuscript. All authors also approved the final version of the article and agreed to be accountable for all aspects of the work. CL and DW: writing—original draft, Investigation, data curation and conceptualsation. CL, LS and LC: methodology and data curation. DL: data curation, writing—original draft, supervision, project administration, formal analysis and conceptualisation. XY and LZ: investigation. PL: validation and investigation. WZ: validation. YG and NZ: supervision, project administration and conceptualisation. CL is the guarantor of this article.
Funding This work was supported by Science and Technology Planning Project of Shenzhen. Municipality (No.KCXFZ20201221173602007, No.JCYJ20240813162306009), Shenzhen Fund for Guangdong Provincial High- level Clinical Key Specialties (No.SZGSP015).
Disclaimer The content of this study is solely the responsibility of the authors and does not necessarily represent the official views of the Science and Technology Planning Project of Shenzhen Municipality or the Shenzhen Fund for Guangdong Provincial High-Level Clinical Key Specialties.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.