An Analysis of Machine Learning for Detecting Depression, Anxiety, and Stress of Recovered COVID-19 Patients

Objectives: This study explores different machine learning models (KNN: k-nearest neighbor, MLP: Multilayer Perceptron, SVM: Support Vector Machine) to identify the optimal model for accurate and rapid mental health detection among the recovered COVID-19 patients. Other techniques are also investigated, such as feature selection (Recursive Feature Elimination (RFE) and Extra Trees (ET) methods) and hyper-parameter tuning, to achieve a system that could effectively and quickly indicate mental health. Method/Analysis: To achieve the objectives, the study employs a dataset collected from recovered COVID-19 patients, encompassing information related to depression, anxiety


Introduction
The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has left an indelible mark on the global population, affecting millions of lives physically, emotionally, and mentally.Numerous studies have been dedicated to understanding and mitigating the acute health risks of the virus.In recent times, there has been a growing body of research focusing on the long-term implications of COVID-19 recovery on mental and physical health, which has gained increasing significance, primarily due to the effects of COVID-19 on recovered patients [1][2][3].

Data Collection
This study has utilized survey data from [36], performing experiments under pertinent guidelines and regulations (refer to the "Materials and Methods" Section in [36]).The survey was undertaken after obtaining approval from the Human Research Ethics Committee, Walailak University (WU-EC-PU-0-214-65).The dataset comprises information from 549 participants in Dong Thap province, Vietnam, all of whom were previously infected with COVID-19 and had recovered, having been discharged from the hospital for more than six months.

The Proposed Method
Figure 1 illustrates the ML-based framework for depression, anxiety, and stress detection in recovered COVID-19 patients.The framework has four significant phases, including (1) data pre-processing, (2) feature selection, (3) hyper-parameter tuning, and (4) optimal prediction model selection.The input and output of the framework are the mental health dataset (including depression dataset (Data-D), anxiety dataset (Data-A), and stress dataset (Data-S)) and five classes, respectively.

Data Pre-processing
The dataset under examination in this study pertains to the responses to 21 questions (DASS-21) and sociodemographic details of recovered COVID-19 patients, containing 549 rows and 27 columns [36].The dataset encompasses different types of variables, including categorical and ordinal.In this step, these variables are transformed into numerical values, utilizing encoding and normalizing techniques provided by the Scikit-learn library in Google Colab.

Feature Selection
Feature selection contributes considerably to the model's performance improvement by removing unnecessary features [40].After data pre-processing, the framework proceeds to the feature selection phase, employing RFE [41] and ET [42] methods.In this stage, the pre-processed data are inputted into each method to identify the optimal subset of features, aiming to enhance the accuracy of predictions.
For the RFE method, features are ranked in descending order of importance (R = {r1, r2, …, rn} = {ri}, where ri is the feature i th , n is the number of features, n = 27, and ri-1 < ri for i = 2, 3, …, n).This process results in the selection of 26 subsets of features being selected.The rule for feature selection in each subset is as follows: "Each subset contains at least two features, and the features with higher rankings are selected first".For example, the first subset includes the first two features (R2 = {r1, r2}), and the second subset contains the first three features (R3 = {r1, r2, r3}).Subsequently, each subset undergoes evaluation using the DT algorithm with k-fold cross-validation (k = 10).The subset achieving the highest mean accuracy is then chosen for the subsequent phase.For the ET method, the procedure that selects subsets of features is the same as the RFE method.In the ET method, features are scored in descending order of importance (F = {f1, f2, …, fn} = {fi}, where fi is the feature i th , fi-1 < fi).Hence, the selection procedure is based on scores of features.

Hyper-parameters Tuning
Three machine learning models (KNN, MLP, SVM) are deployed in this phase to assess and optimize hyperparameters.The evaluation process utilizes k-fold cross-validation (k = 10).The hyper-parameters for the three machine learning models are presented in Table 1, with values assigned to each option derived from existing studies.Each machine learning model encompasses multiple options, each with several parameters automatically selected and generated in all possible combinations.For KNN, the 'metric' parameter is employed to compute distance, 'n_neighbors' determines the number of neighbors, and the 'algorithm' parameter specifies the algorithm for computing the nearest neighbors.In the case of MLP, 'hidden_layer_sizes' is the number of neurons in each hidden layer, 'activation' represents the activation function used in the hidden layers, and 'solver' determines the optimization algorithm for weight optimization during training.In SVM, 'C' serves as a regularization parameter influencing the trade-off between smooth decision boundaries and accurate classification of training points, the 'kernel' parameter defines the type of kernel function, 'gamma' determines how far the influence of a single training example reaches, and 'degree' is relevant for the polynomial kernel function.For example, an option for KNN is n_neighbors: [5,10,20,50]

Optimal Prediction Model Selection
Six machine learning models from the previous stage, including three with the best hyper-parameters derived from the RFE method-based features and three with the best from the ET method-based features, are evaluated in this phase.The assessment is conducted as follows: "Models utilizing the same machine learning algorithm are compared, and the model with the higher mean accuracy is selected.Subsequently, the optimal prediction model is chosen based on the highest accuracy from the selected models".This process aims to identify the most effective prediction model.
where TP (True positive): Observation is actually positive and is predicted positive.FP (False positive): Observation is actually negative and is predicted positive.TN (True negative): Observation is actually negative and is predicted negative.FN (False negative): Observation is actually positive and is predicted negative.

Depression Prediction for Recovered COVID-19 Patients
Figure 2 illustrates the importance of features in the depression dataset based on the RFE and ET methods.The RFE method identifies the top ten features, including low self-worth (Q17), difficulty initiating tasks (Q5), lack of anticipation (Q10), sense of life being (Q21), feeling down-hearted (Q13), lack of positive feelings (Q3), lack of enthusiasm (Q16), Diabetes, Hypertension, and Cancer, as the most important, while Gender and Age are ranked as the least important features.Meanwhile, the ET method returns the top ten features: difficulty initiating tasks (Q5), low self-worth (Q17), lack of positive feelings (Q3), lack of enthusiasm (Q16), sense of life being (Q21), lack of anticipation (Q10), feeling down-hearted (Q13), non-communicable diseases (No_NCDs), diabetes, and hypertension, as the most important, while kidney disease and cancer are identified as the least important features.
(a) Accuracy with number of features based on RFE

Figure 3. Accuracy with number of features for the depression dataset
In the depression dataset, 14 features based on the RFE method, and 11 features based on the ET method were selected to tune the hyper-parameters of three machine learning models (KNN, MLP, and SVM).The features derived from the ET method exhibited the best hyper-parameters, resulting in the highest mean accuracy compared to the RFE method-based features.The respective mean accuracy for each model was 0.880, 0.980, and 0.984.The best hyperparameters for the three machine learning models were: KNN with the algorithm: 'brute', n_neighbors: 5; MLP with the activation: 'identity', hidden_layer_sizes: 100, solver: 'lbfgs'; and SVM with the C: 1, kernel: 'linear'.The results of hyper-parameter tuning and mean accuracy for these machine learning models are summarized in Table 3.In terms of accuracy, SVM with ET method-based feature selection and the best hyper-parameters emerged as the optimal model for depression prediction in recovered COVID-19 patients (accuracy = 0.984).Meanwhile, MLP, whose feature was selected by the ET method, performed well (accuracy = 0.980 and F1-score = 0.915) in predicting each level of depression in the recovered COVID-19 patients (see Figures 4 and 5).   Figure 6 illustrates the importance of features in the anxiety dataset based on the RFE and ET methods.The RFE method identifies the top ten features, comprising proximity to panic (Q15), heart awareness energy (Q19), trembling (Q7), unexplained fear (Q20), mouth dryness awareness (Q2), breathing difficulty (Q4), worry about social panic (Q9), respiratory disease, diabetes, and cancer as the most important ranking, while gender and BMI are ranked as the least important features.The ET method returns the top ten features with the highest scores, including breathing difficulty (Q4), trembling (Q7), mouth dryness awareness (Q2), heart awareness energy (Q19), worry about social panic (Q9), unexplained fear (Q20), proximity to panic (Q15), non-communicable diseases (No_NCDs), hypertension, and heart disease, as the most important, while kidney disease and cancer are identified as the least important features.Among the features ranked by the RFE method, the first eight ranked features, comprising proximity to panic (Q15), heart awareness energy (Q19), trembling (Q7), unexplained fear (Q20), mouth dryness awareness (Q2), breathing difficulty (Q4), worry about social panic (Q9), and respiratory disease, returned the highest mean accuracy (0.805).Meanwhile, the best performance (0.751) of the features scored by the ET method was observed in the first two: breathing difficulty (Q4) and trembling (Q7) (see Figure 7).In the anxiety dataset, eight features based on the RFE method, and two features based on the ET method were selected to tune the hyper-parameters of KNN, MLP, and SVM.The features derived from the RFE method showed the best hyper-parameters, resulting in the highest mean accuracy compared to ET method-based features in the three machine learning methods (KNN with 0.778, MLP with 1.00, and SVM with 1.00).The best hyper-parameters for the three machine learning models were KNN with the algorithm 'brute', n_neighbors: 10, MLP with the activation 'identity', hidden_layer_sizes: 20, solver: 'lbfgs', and SVM with the C: 1, kernel: 'linear'.Table 3 presents the details of tuned hyper-parameters with accuracy for these models, which are depicted in Table 3.Both models (SVM and MLP with RFE method-based feature selection and the best hyper-parameters) revealed the best results in terms of accuracy (accuracy = 1.00 for both) and F1-score (F1-score >0.99 for both) in predicting anxiety levels from the recovered COVID-19 patients (see Figures 5 and 8).

Stress Prediction for Recovered COVID-19 Patients
Figure 9 illustrates the importance of features in the stress dataset based on the RFE and ET methods.The RFE method identifies the top ten features, including agitation meaningless (Q11), intolerance to interruptions (Q14), feeling of using nervous energy (Q8), difficulty relaxing (Q12), difficulty winding down (Q1), tendency to over-react (Q6), sensitivity or touchiness (Q18), other disease, respiratory disease, and cancer, as the most important, while Gender and Age are ranked as the least important features.The ET method returns the top ten features, comprising the feeling of using nervous energy (Q8), intolerance to interruptions (Q14), tendency to over-react (Q6), difficulty relaxing (Q12), sensitivity or touchiness (Q18), difficulty winding down (Q1), agitation meaningless (Q11), noncommunicable diseases (No_NCDs), hypertension, and diabetes, as the most important, while cancer and heart disease are identified as the least important features.
Among the features ranked by the RFE method, the first nine ranked features, including agitation meaningless (Q11), intolerance to interruptions (Q14), feeling of using nervous energy (Q8), difficulty relaxing (Q12), difficulty winding down (Q1), tendency to over-react (Q6), sensitivity or touchiness (Q18), other disease, and respiratory disease, returned the highest mean accuracy (0.874).Meanwhile, the best performance (0.746) of the features scored by the ET method was observed in the first three features, comprising the feeling of using nervous energy (Q8), intolerance to interruptions (Q14), and tendency to over-react (Q6) (see Figure 10).In the stress dataset, nine features based on the RFE method and three features based on the ET method were selected to tune the hyper-parameters of three machine learning models (KNN, MLP, and SVM).The features derived from the RFE method revealed the best hyper-parameters, resulting in the highest mean accuracy compared to the ET method-based features.The respective mean accuracy for each model was 0.893, 0.989, and 0.991.The best hyper-parameters for the three machine learning models were KNN with the algorithm: 'brute', n_neighbors: 5, MLP with the activation: 'identity', hidden_layer_sizes: 50, solver: 'lbfgs', and SVM with the C: 1, kernel: 'linear'.Table 3 depicts the details of tuned hyper-parameters with accuracy for these models.SVM, which had RFE method-based feature selection and the best hyper-parameters, revealed the best results in terms of accuracy and F1-score (accuracy = 0.991 and F1-score = 0.920) in predicting stress levels of recovered COVID-19 patients (see Figures 5 and 11).

Optimal Machine Learning Models for Depression, Anxiety, and Stress of Recovered COVID-19 Patients
In terms of precision, recall, and F1-score, MLP achieved the highest F1-score (0.915) in the depression dataset compared to SVM and KNN.On the other hand, SVM achieved the highest F1-score (1.00) in the anxiety dataset, while both SVM and MLP shared the top F1-score (0.992) in the stress dataset (see Figure 12).Across all three datasets, SVM emerged with the highest accuracy scores (0.984, 1.00, and 0.991, respectively) (see Figure 5).The optimal models for the depression, anxiety, and stress datasets in recovered COVID-19 patients were SVM with hyperparameters (C: 1 and kernel: 'linear').The depression, anxiety, and stress datasets featured 11, eight, and nine selected features, respectively, with the ET and RFE methods (see Table 4).

Discussion
This study undertook a comprehensive exploration of mental health issues among recovered COVID-19 patients, with a specific focus on depression, anxiety, and stress.Leveraging a dataset that encompassed sociodemographic factors, underlying diseases, and mental health attributes, our analysis utilized machine learning models, including KNN, MLP, and SVM, and revealed promising results in accurately predicting the mental health conditions of recovered COVID-19 patients.SVM emerged as the most effective model across the three datasets.Our findings agree with prior studies on depression, anxiety, and stress, corroborating the importance of understanding mental health among individuals recovered from COVID-19 [25,27,32].
In terms of disease conditions, our results highlighted associations between depression, anxiety, and stress with underlying diseases, such as non-communicable diseases, hypertension, diabetes, heart disease, and respiratory disease.This aligns with existing research emphasizing that underlying diseases are significant risk factors contributing to the severity of symptoms related to depression, anxiety, and stress [36][37][38], underscoring the need for heightened mental health awareness, particularly among those with underlying health issues.
While sociodemographic and COVID-related details did not prove essential for the optimal machine learning models detecting depression, anxiety, and stress in recovered COVID-19 patients, their inclusion exhibited high accuracies (all > 0.700).This observation, portrayed through the number of features and accuracy in the depression, anxiety, and stress datasets, suggests that sociodemographic information and COVID-19-related details may indeed influence the mental well-being of recovered COVID-19 patients, aligning with findings from other studies [28,[36][37][38].
However, our study has notable limitations.Relying on a single-phase data collection approach may overlook the temporal dynamics of mental health conditions, potentially missing nuances in symptom progression.Additionally, the exclusive focus on depression, anxiety, and stress neglects other crucial dimensions of mental health, potentially limiting the model's comprehensive applicability.These limitations should be considered when interpreting the existing findings and planning for future research.

Conclusion
This study proposed the ML-based framework for depression, anxiety, and stress (DAS) detection from a dataset of recovered COVID-19 patients (e.g., sociodemographic factors, underlying diseases, and mental health attributes) with machine learning models (e.g., KNN, MLP, and SVM), which demonstrated accuracy in predicting mental health conditions.The comprehensive exploration of feature selection methods, particularly RFE and ET, underscored their pivotal role in refining the models for accurate mental health predictions.In the experiment, SVM emerged as the optimal model, surpassing 0.984 accuracy, highlighting its robustness in predicting mental health disorders among recovered COVID-19 patients.The ET method is the most effective feature selection method for the anxiety and stress datasets, while the RFE method performs better in the depression dataset.There are intriguing opportunities with markers, such as physiological and biochemical indicators, to provide a more comprehensive understanding of mental health conditions.In the future, we plan to integrate these markers into survey data to enhance mental health support.This integration holds the potential for personalized intervention strategies tailored to individuals based on machine learning predictions.

Figure 1 .
Figure 1.Workflow of proposed method for Depression, Anxiety, and Stress detection based on machine learning

Figure 4 .
Figure 4. Confusion matrices of three machine learning models for the depression dataset

Figure 6 .
Figure 6.Feature importance based on RFE and ET methods for the anxiety dataset

Figure 7 .Figure 8 .
Figure 7. Accuracy with number of features for the anxiety dataset

Figure 9 .
Figure 9. Feature importance based on RFE and ET methods for the stress dataset

Figure 10 .Figure 11 .
Figure 10.Accuracy with number of features for the stress dataset

Table 3 . Hyper-parameters tuning of three machine learning models for the depression, anxiety, and stress datasets
Note: Accuracy is based on the best hyper-parameters