Feature Weighting for Parkinson's Identification using Single Hidden Layer Neural Network

The diagnosis of Parkinson has become easier with the existence of machine learning. It includes using existing features from the biometric dataset generated by the person to identify whether he has Parkinson or not. The features differ in their discrimination capability and they suffer from redundancy. Hence, researchers have recommended using feature selection for Parkinson's identification. The feature selection aims at finding the most important and relevant features to produce an efficient and effective model. In this article, we present entropy-based Parkinson classification. The goal is to select only 50% of the most relevant features for Parkinson prediction. Two variants of neural networks are used for evaluation, the first one is a feed-forward Extreme Learning Machine ELM and the second one is Fast Learning Machine FLN. Also, the K-Nearest Neighbor KNN algorithm is used for evaluation. The results show the superiority of ELM and FLN when the model of feature selection is used with an accuracy of 80% compared with only 78% when the model is not used.


I. INTRODUCTION
ARKINSON is regarded as one of the major diseases that affects the population with a percentage of 2-3% for people over 65 and older [1] as it is provided by Parkinson Disease (PD) foundation, about 7-10 million people worldwide suffer from Parkinson's.The reason for Parkinson is the depletion of dopaminergic nigrostriatal neurons [2,3].It affects not only the articulators but also the voice and speech of the users.Some of the symptoms of Parkinson disease are tremors, gestures loss while talking or communicating, non-capability of doing fast movements, independent joint control, dragging in walking or small steps, and others.Also, it leads to effects on language, cognition, and mood.This disease is classified as neurodegenerative that is caused by a genetic mutation.Diagnosis of this disease is essential to avoid its major development.The voice signal is the most important element for doing early Parkinson diagnostic [4].
The diagnosis of Parkinson disease is becoming easier with the existence of Neural Network (NN) models that are trained on existing datasets of the disease [5][6][7].The datasets can be built from various types of biometric signals recorded from the patients.Considering that voice signals play an important role in the discrimination of the diagnostic disease [8][9][10], it is possible to build models for training on this data and using it for prediction.The features differ in their level of discrimination and they have an issue of redundancy.Hence, selecting the most powerful features is more effective for building an accurate model and assuring more efficiency in terms of computation.The research of feature selection has been applied in many fields for the goal of identification or classification such as Intrusion Detection System (IDS) [11], ear recognition [12,13], face recognition [14][15][16], and gait classification [17,18].In the area of Parkinson disease, applying feature selection for increasing accuracy is promising.
Single Hidden Layer Feed-Forward Neural Network (SLFN) is a neural network combined with an input layer, output layer, and single hidden layer.At the output of the hidden layer, various types of activation functions can be used.This NN has two forms, namely, a simple form without connections between the input and output layers and a parallel form with connections between the input and the output layers [19].The former is trained using Extreme Learning Machine (ELM) [20][21][22][23] and the latter is trained using Fast Learning Machine (FLN) [24,25].The two training approaches are more effective than the classical Back-Propagation (BP) training algorithm that uses the error gradient.
The goal of this article is to explore two classification models with the assistance of feature weighting and selection for Parkinson disease identification.The approach is based on a new method for using entropy for feature selection and testing two models: the first model is based on using an extreme learning machine with whole/half features, the second model is based on using fast learning and KNN with whole/half features.Below is an organization of the remaining text.We outline the literature review in Section 2. The methodology is then presented in Section 3. The experimental strategy and outcomes are then presented in Section 4. Finally, Section 5 provides the conclusion and future work.

II. RELATED WORK
In [26], a non-invasive sensing approach for Parkinson disease was proposed based on gait analysis.The approach is based on the wavelet transform of spatiotemporal gait variables.For features, the algorithm uses computational simplified features such as minimum and maximum values, mean, variance, and energy variables.Furthermore, the approach considers the evaluation of various gait parameters.For classification, the approach uses a support vector machine.However, the approach does not incorporate feature selection.In some approaches genetic optimization of the number of neurons and the parameters of Wavelet Kernel Extreme Learning Machine (WKELM) were considered [27].Comparing this work with an earlier work that had considered the features weighting method called Subtractive Clustering Features Weighting (SCFW) showed an added performance [28].Using voice patterns for the diagnosis of Parkinson disease was considered in [29].The approach is based on eight different pattern ranking methods using a support vector machine.Bayesian optimization technique was used for optimizing the radial basis function of the support vector machine.Some researchers focused on developing a hybrid algorithm for Parkinson disease.For example, Multi-Layer Perceptron (MLP) was integrated with feature selection and their moduli for feature ranking [30].
In [31], the authors used the ensemble methods, categorical boosting method extreme gradient boosting, and random forest to select the most discriminating features for identifying PD.For the best results in PD prediction, the impact of these factors at various thresholds was investigated.
In the study of Yuvaraj et al. [32], Electroencephalography (EEG) signals were used and the usage of Higher-Order Spectra (HOS) for the diagnosis of Parkinson Disease was investigated.The obtained features were sorted and ranked using the t value, and highly ranked features were selected.The latter is a single value, which can distinguish the two classes.As well, the ranked features were fed one by one to the multiple classifiers, namely Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Probabilistic Neural Network, and Decision Tree (DT).
Overall, feature selection using meta-heuristic for Parkinson identification has not been tackled yet in the literature.This article proposes an entropy-based approach for selecting the most important features of Parkinson diagnostics.

III. RESEARCH METHOD
This section presents the developed methodology for Parkinson identification using optimized features selection and classifier structure.We provide the framework in sub-section A. Next, the data description is presented in sub-section B. Afterwards, we provide extracted features in subsection C. The classifiers ELM and FLN are explained in subsection D. Next, the feature weighting and selection is in subsection E. The evaluation metrics are presented in subsection F.

A. FRAMEWORK
The framework of Parkinson identification is provided in Fig. 1.The received data are fed into a feature reduction block which reduces the feature and projects it from its original space to reduced space.Next, the reduced space for important features is provided to the classifier.

B. DATA DESCRIPTION
A dataset for Parkinson disease created by the University of Oxford in collaboration with the national center for voice and speech [33] is used in the article.The data includes 31 people with 23 only having PD.Each column corresponds to one particular voice measure.In addition, the labeling of the data has one of two values: 0 for healthy and 1 for non-healthy.

C. EXTRACTED FEATURES
Extracted several features from the datasets, including: 1-Fundamental frequency (F0) and its variability (standard deviation of F0); 2-Jitter and shimmer (measures of frequency and amplitude variability); 3-Harmonics-to-noise ratio (HNR) which measures the ratio of harmonic to non-harmonic components in the voice signal; 4-Spectral tilt, which measures the slope of the spectral envelope The obtained features were sorted ranking using the t value, and highly ranked features were selected.The obtained features were sorted ranking using the t value, and highly ranked features were selected.The obtained features were sorted ranking using the t value, and highly ranked features were selected.5-Formant frequencies, which represent the resonant frequencies of the vocal tract; 6-Pause and voice onset time (VOT), measures the duration of silence before speaking and the duration between the onset of voicing and the release of a plosive sound.All these features were calculated from the digitally normalized voice signals recorded from the subjects in an IAC sound-treated booth using a head-mounted microphone (AKG C420) positioned at 8 cm from the lips, and processed using CSL 4300B hardware (Kay Elemetrics).

D. EXTREME LEARNING MACHINE
Extreme Learning Machine (ELM) is a machine learning algorithm used for classification, regression, and feature learning tasks.ELM is a type of feedforward neural network with a single hidden layer, and its learning is based on randomly generated input weights and biases.ELM is a popular algorithm in the machine learning community due to its simplicity, fast learning speed, and high accuracy.
The ELM algorithm can be described in three main steps: input layer, hidden layer, and output layer.In the input layer, data is fed into the network.In the hidden layer, the weights and biases of the input data are randomly generated.The output of the hidden layer is then fed into the output layer, which produces the final output.The hidden layer of the ELM is responsible for the non-linear mapping of the input data into a higher-dimensional space, where the linear regression is performed.
One of the most significant advantages of ELM is its fast learning speed.Unlike traditional neural networks, where the weights and biases are adjusted through iterative training, ELM randomly generates input weights and biases and trains the output layer using a linear regression approach.This means that ELM requires fewer iterations to converge and can handle large datasets with high computational efficiency.
Another significant advantage of ELM is its ability to avoid overfitting.Overfitting occurs when a machine learning model is too complex and is tailored too closely to the training data, leading to poor performance on new data.ELM overcomes this problem by generating random input weights and biases, which increases the diversity of the model and reduces the likelihood of overfitting.
ELM has been successfully applied in various fields, including computer vision, speech recognition, and finance.ELM has also been used in medical diagnosis, where it has been shown to have high accuracy in the classification of diseases such as breast cancer, diabetes, and heart disease [ 34,35].
Fast learning machine (FLN) is a new variant of extreme learning machine with weights that connect the input and output layer which enables connecting the input and output layer in a single hidden layer neural network that allows the network to learn complex nonlinear relationships between the input and output variables.This is because the hidden layer serves as a sort of "feature extractor", transforming the input data into a new representation that is better suited for predicting the output.
In particular, the connections between the input layer and the hidden layer allow the network to learn important features of the input data that are relevant to the prediction task, while the connections between the hidden layer and the output layer allow the network to use these features to make accurate predictions.The benefit of this architecture is that it can capture highly nonlinear relationships between the input and output variables, which may not be possible with simpler models like linear regression [36,37].The process of calculating the weights or training is as follows: 1. Calculate the weight in the input-hidden layer randomly.2. Calculate the hidden output matrix using the activation functions at the hidden layer and the input data.3. Calculate the parallel weights using the Moore-Penrose model.

E. FEATURE WEIGHTING and SELECTION
In feature weighting, each feature will be associated with weight.The weight is calculated based on the entropy of the feature.The entropy is calculated in a formula (1) and is used for weighting the features.The features that have more entropy are prioritized over the features that have less entropy.This leads to the concept of half feature classifier, we mean half feature classifier, that classifier will work on the most important 50% of features instead of working on the whole features.Hence, the most important 50% of features are selected.This implies avoidance of over-fitting behavior and more accuracy.
where: i denotes an index of the sample, N denotes the number of samples, P denotes the probability of occurrence of a random sample.Then the features are ordered in descending manner based on the entropy and the most important half of the features are selected.

IV. RESULTS AND ANALYSIS
The evaluation has been conducted based on four activation functions, namely, hardlim, sigmoid, sin, and tribs.The reason for selecting these activation functions is their common usage in ELM classification.Observing Fig. 2, we find that FLN-half has provided higher performance metrics in terms of accuracy, recall, and G-mean.However, we find that ELM-half has VOLUME 22(2), 2023 produced higher precision which indicates the bias toward negative prediction which has been the reason for elevating its precision.For more confirmation, we have included KNN-half and KNN-full which corresponds to the k-nearest neighbor based on half features and full features set respectively.The results reveal that KNN-half has outperformed slightly KNNfull which confirms the effectiveness of our features weighting and selecting.The same behavior that is monitored in hardlim is also monitored in the sigmoid activation functions.However, we find in Fig. 3, that ELM-half has increased its performance compared with the hardlim activation function.This shows the role of deselecting the non-relevant features in increasing the performance of the classifier.Similar to the performance that is monitored for hardlim and sigmoid is also monitored for sin where FLN-half has provided the highest accuracy, recall, F-measure, and G-mean in Fig. 4. We also observed that its precision was lower because of some false positive predictions.The only case where ELM-half has outperformed FLN half is the case of the tribs function depicted in Fig. 5.This is interpreted by the effect of the activation function on the learning capability of the classifier.Our finding is that sigmoid, hardlim, and sin are more suitable for FLN while tribs is more suitable for ELM when the best half of the features are used.

Figure 1 .
Figure 1.A framework for feature weighting and selection for Parkinson classification.

Figure 2 .
Figure 2. Classification measures for evaluation of Parkinson based on ELM, KNN, and FLN for both full and half features using Hardlim

Figure 3 .
Figure 3. Classification measures for evaluation of Parkinson based on ELM, KNN, and FLN for both full and half features using sig

Figure 4 .
Figure 4. Classification measures for evaluation of Parkinson based on ELM, KNN, and FLN for both full and half features using sin

Figure 5 .
Figure 5. Classification measures for evaluation of Parkinson based on ELM, KNN, and FLN for both full and half features using Tribs V. CONCLUSIONS This article has provided feature selection for finding the most important and relevant features to produce an efficient and effective model.The model is based on entropy-based Parkinson classification.It selects the top 50% of features.Two variants of neural networks are used for evaluation, the first one is ELM and the other one is FLN.In addition, the K-Nearest Neighbor KNN algorithm is applied for evaluation.The resultsshow the superiority of ELM and FLN when the model of feature selection is used with an accuracy of 80% compared with only 78% when the model is not used.Future work will be devoted to exploring the effect of the percentage of selected features on the predicted results.