Schizophrenia is a chronic and serious mental disorder which affects how a person thinks, feels, and behaves. Although there have been many studies about psychological and behavioral manifestations of schizophrenia, neuroscientists have yet to determine a set of corresponding neurological biomarkers for this disorder. A functional magnetic resonance imaging (fMRI) can help determine non-invasive biomarkers for schizophrenia in brain function[1][2] and one such fMRI analysis technique called dynamic functional network connectivity (dFNC)[3][4][10] uses K-Means clustering to characterize time-varying connectivity between functional networks. Researchers have worked on finding correlation between schizophrenia and dFNC[1][2][5][6]. Indeed, dFNC features are well-established as features for schizophrenia classification [14].
Despite the utility of dFNC analysis, little work has been done analyzing the choice of clustering algorithm, which in most applications is simply K-Means[7][9]. It is possible that the reliance of K-Means assumptions (sphericality of clusters, reliability of elbow-criterion, etc) have introduced bias into previous results using dFNC, and restricted both biological discovery, and reliability of classifiers used for diagnosis.
Therefore, in this project, we have studied how modifying the clustering technique in the dFNC pipeline can yield dynamic states from fMRI data, and how that choice impacts the accuracy of classifying schizophrenia[8].
We experimented with DBSCAN, Hiearcharial Clustering, Gaussian Mixture Models, and Bayesian Gaussian Mixture Models clustering methods on subject connectivity matrices produced from fMRI data, and use each algorithm’s cluster assignments as features for SVMs, MLP, Nearest Neighbor, and other supervised classification algorithms to classify schizophrenia.
Section II describes the fMRI data used in our experimentation, while Section III summarizes the aforementioned clustering and classification algorithms used in the pipeline. Section IV compares the accuracy of these classifiers, along with presenting a series of charts that analyze the cluster assignments produced on the fMRI data.
All datasets used in this project are derivatives of fMRI data (functional magnetic resonance imaging), which is data measuring brain activity to track a patient’s thought processs over a predefined time period.
The BOLD signal in fMRI data tracks the movement of blood throughout a subject's brain during the scan period. These changes in blood flow can be further analyzed to make conclusions about flow between different regions, regions of concentrated activity, and other conclusions, which can be useful for the diagnosis of mental disorders, among other applications. The full 4-Dimensional volumes are often not used for classification in their raw form, and instead derivatives such as Functional Network Connectivity states can be used for diagnosis.
Data Set Name | Number of Subjects | Number of Healthy Controls | Number of Patients |
---|---|---|---|
FBIRN | 362 | 186 | 176 |
UCLA | 332 | 272 | 50 |
Simulations | 300 | 150 | 150 |
We generated synthetic FNC data which were used for debugging code, and evaluating our method. The details of the simulation are described in the appendix.
We use derivatives from the Phase 3 Dataset of FBIRN (Functional Biomedical Infromatics Research Network Data Repository), which specifically focuses on brain activity maps from patients with schizophrenia. This dataset includes 186 healthy controls and 176 indivduals from schizophrenia from around the United States. Subject participants in this dataset are between the ages of 18-62.
We also use derivatives from the UCLA Consortium for Neuropsychiatric Phenomics archives, which includes neuroimages for roughly 272 participants. The subject population consists of roughly 272 healthy controls, as well as participants with a diagnosis of schizophrenia (50 subjects). Subject participants in this dataset range from 21 to 50 years
As is standard in Dynamic Functional Network Connectivity, Group Independent Component Analysis of subject images is used to compute a set of M statistically independent spatial maps from the data, along with a set of M time-courses, which are used for the dFNC analysis. For the FBIRN dataset, we used pre-computed ICA timecourses provided by Damaraju et al. 2014. These timecourses were computed using Infomax ICA [11] with 100 components, which were then manually selected based on biological relevance for a total of 47 remaining components.
For the UCLA data, because no ICA derivatives were readily available, we used Group Information Guided ICA [12] which is included as part of the GIFT NeuroImaging Analysis Toolbox [15]. As a template for GIG-ICA, we used the NeuroMark template [16], which has been shown to provide reliable estimations of components in many different settings [13]. This provides us with 53 independent components for the UCLA dataset.
For all data-sets we chose a sliding window of size W = 22, following the precedent in Damaraju et al. 2014 [1]. Time-Series data X with size M × T, where M is the number of independent components, and T is the number of timepoints are used. Time series were normalized and convolved with a gaussian window with a kernel of size 0.015, again following the precedent in Damaraju et al. 2014 [1]. For each window on each time-series a number of exemplar data-points were selected by taking the local maximum of the variance from the smoothed time-series.
We computed correlation coefficients across the components, within each time series window to form the FNC matrix with entries i and j given as
Through this process, we generate a total of N × (T − W) window instances, which are used as the input for clustering.
We implemented 5 different clustering algorithms as part of the dFNC pipeline:
KMeans [24.]
DBSCAN [25.]
Gaussian Mixture Models
Bayesian Gaussian Mixture Models
Agglomerative Hierarchical Clustering
For each of these algorithms, we used them for both exemplar clustering, and clustering on the full data. In the case of KMeans, Gaussian Mixture Models, and Bayesian Gaussian Mixture Models, exemplar cluster-centers were used as input for initialization in the second stage of clustering using the same algorithm. In the case of DBSCAN and Agglomerative clustering, the exemplar stage was only used for elbow criterion, and other hyper-parameter tuning. We provide a more exhaustive analysis of our elbow-criteria experiments in Appendix B.
For GMMs and Bayesian GMMs, we elected to use hard-assignment for determining cluster membership, because it lent to simpler integration with other steps in the pipeline. In future work, we would like to try to use the more rich information provided by soft-assignment to somehow improve feature generation, which is further discussed below.
Our chosen task is the binary classification of Schizophrenic Patients from Healthy Controls using Functional Network Connectivity states. For patients with a schizophrenic diagnosis, we assign the label 1, and for all other subjects, we assign the label 0. Given the set of FNC features for all subjects, our goal is to find the classifier which provides the most accurate diagnosis.
For our evaluation metric, we used the Area-Under the “Receiving Operating Characteric” (ROC) curve, or ROC-AUC. This metric, which is often used for binary classifiers, describes the probability that a given classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. The curve is creative by plotting the fraction of true positive out of the positive vs the fraction of false pasitive out of the negatives over various thresholds T. In general, a higher AUC indicates a better binary classification performance.
Classification over raw dynamic FNC states is often poor, and further feature generation is required.
We first experimented with the use of subject state-vectors for classification. Because each time point for each subject is assigned a state during clustering, we can generate a vector of T-W state assignments for each subject to be used as features.
Secondly, we follow the precedent in [14], and compute means of clusters for each class, and use these centers to form a regression matrix of size 2k × (T − W). For each subject, we then linearly regress out these cluster centers from each FNC window, and collect the beta coefficients. The mean of the β-coefficients over all timepoints are then used as features for the final classification.
We computed two-tailed T-Tests over each of the feature spaces used for classification, and also over the centroids for each class prior to regression, in order to visualize statistically significant features which may or may not improve the performance of the classifiers.
For supervised learning, we used a modified version of the Polyssifier classification toolbox, which wraps the majority of the classification algorithms from SKlearn into an end-to-end framework. The modified version of the toolbox, which is being maintained by Brad Baker, . Within polyssifier, we implemented the following binary classifiers:
Multilayer Perceptron
K-Nearest Neighbors Classifier
Support Vector Machines
Decision Tree Classifier
Random Forests [17]
Extra Trees [18]
Ada-Boost Decision Trees [19]
Bagging Decision Trees [20]
Gradient Boosted Decision Trees [21]
Logistic Regression
Passive Aggressive Classifier [22]
Perceptron
Naive Bayes
Voting Classifer
We performed a Grid-Search over all available parameters for each of the classifiers available in Polyssifier. For each set of training data used for cross-validation, we trained the grid-search separately, and took the parameters with the highest AUC.
For the sake of space, we have ommitted the details of the grid-search, but the results have been included in the results section of the repo.
For brevity, we only present results from modifying the dFNC pipeline with the aforementioned clustering and classification algorithms on the FBIRN dataset in this section. We leave our results on the gaussian simulated and UCLA datasets in Appendix D and C respectively, if the reader is interested in viewing them.
For all results, the Area Under Curve (AUC) metric was used since we were trying to classify two possible outcomes. Patients were with either healthy or had schizophrenia. AUC scores near 0.50 meant that the models randomly diagnosed patients as healthy or schizophrenic. This represented the worst possible outcome. Any model with AUC scores between 0.40 and 0.60 was deemed a failed model. AUC scores near 1.0 meant that diagnoses were accurate with few false positives (high specificity) and few false negatives (high recall). There were no observed models that completely reversed the diagnoses with AUC scores near 0.0.
Each classification algorithm was evaluated with 10-fold cross-validation, with a grid-search for classifier parameters performed separately on each validation set.
We initially trained the supervised learning model on the unclustered FBIRN FNC matrices in order to establish a baseline accuracy that we could compare our beta feature generation approach against. All the classifiers failed to consistently achieve AUC scores outside of the 0.40-0.60 range suggesting that all the models produced random diagnoses. This confimed our hypothesis that clustering and feature reduction was paramount to succesful classfication of healthy and schizoprhenic patients. The baseline results are displayed below.
After establishing our baseline results we trained the models on the FBIRN dataset using only cluster assignments. No beta features were generated to reduce the dimensionality of the training dataset. The results are displayed below.
The results indicated that it was possible for some of the clustering algorithms to lift the AUC scores to an average of 0.70 for most supervised models with the exclusion of the Voting learner. The one exception was DBSCAN which failed to move the AUC from 0.50. Regardless, none of these models achieved a high enough score to be used in a clinical environment. It was surmised that the number of features in the datasets had to be reduced from cluster assignments over time to beta features in order for the supervised learning models to accurately diagnose patients.
After performing beta features generation with clustering, the final results for the FBIRN dataset were obtained and are displayed below.
Clustering Algorithm | SVM | Multilayer Perceptron | Logistic Regression | Passive Aggressive Classifier | Perceptron | Random Forest | Extra Trees |
---|---|---|---|---|---|---|---|
kmeans | 0.952 ± 0.036 | 0.92 ± 0.065 | 0.944 ± 0.039 | 0.945 ± 0.035 | 0.902 ± 0.043 | 0.871 ± 0.038 | 0.853 ± 0.04 |
gmm | 0.936 ± 0.054 | 0.946 ± 0.038 | 0.943 ± 0.038 | 0.929 ± 0.031 | 0.882 ± 0.04 | 0.885 ± 0.022 | 0.874 ± 0.026 |
bgmm | 0.955 ± 0.037 | 0.932 ± 0.042 | 0.945 ± 0.038 | 0.939 ± 0.038 | 0.896 ± 0.074 | 0.86 ± 0.039 | 0.87 ± 0.056 |
dbscan | 0.883 ± 0.027 | 0.893 ± 0.031 | 0.892 ± 0.033 | 0.884 ± 0.027 | 0.828 ± 0.064 | 0.805 ± 0.064 | 0.806 ± 0.058 |
hierarchical | 0.957 ± 0.032 | 0.954 ± 0.038 | 0.953 ± 0.038 | 0.951 ± 0.032 | 0.891 ± 0.098 | 0.881 ± 0.032 | 0.872 ± 0.048 |
The reduced number of training features increased the accuracy of all the learners with Support Vector Machines achieving the highest accuracy across all clustering algorithms. Only DBCAN failed to get above an average AUC score of 0.90 for support vector machines and Hierarchical clusterting obtained this highest score of 0.957.
In the following section, we display both 2D and 3D clustering visualizations for KMeans, GMMs, BGMMs, DBSCAN, and Hiearcharial Clustering after performing PCA on the FBIRN dataset. We mark the centroids of each clustering algorithm with a bolded X , and take the average of points of each cluster in the DBSCAN visualizations to represent its cluster's centers.
We also present FNC state diagrams that analyze how each clustering method clustered data related to healthy patients and schizophrenic patients differently. We hope these diagrams can help neuroscience researchers find interesting biomarkers for further work.
Lastly, we present clustering visualizations and FNC State diagrams for the UCLA and gaussian simulated dataset in Appendix C and D respectively.
As a summary of the results posted above, we see that 4 out of 5 tested clustering methods (KMeans, Hiearcharial Clustering, GMM, BGMM) are able to find 4-5 distinct clusters within the FBIRN data, as evidence by their respective 2D and 3D visualizations shown in this section. Interestingly, each of these clustering algorithms produced clusters that are fairly segmented, which indicates that applying clustering on fMRI data is valuable for finding features that more easily separable in classification. DBSCAN only appears to find 2 clusters, with one of them being marked as "noise," and accordingly produces features that perform worse in comparison to other clustering algorithms on the Schizophrenia prediction task.
In terms of the Schizophrenia prediction task, Hiearcharial Clustering appears to produce assignments that give the greatest lift to all classifiers, with SVMs benefitting the most from its cluster assignments. However, the a few of the other clustering algorithms also perform strongly with GMMs and BGMMs performing comparatively, KMeans slightly behind, and DBSCAN performing slightly worse in all experiments.
More biological analysis is required to make any concrete recommendations based off these results, but our results appear to signal that the choice of clustering and classification algorithm in the dFNC pipeline does make a significant difference in the quality of the output states and diagnosis prediction accuracy.
Cluster-Based features greatly improve classification performance for the schizophrenia classification task.
Classification of dFNC states performs comporably when using GMM, Bayesian GMM, Hierarchical Clustering, and KMeans
GMM, bGMM, and Hierarchical clustering all detect states not detected with KMeans, which provide either comparable or better classification of schizophrenia, indicating these states may have biological significance for diagnosis.
Hierarchical Clustering provides the highest classification results for FBIRN, and GMM/bGMM provide the highest results for UCLA data, while also providing slightly different states than are detected with KMeans. These states may be useful for further neurological analysis.
DBSCAN's sensitivity to parameters requires exhaustive hyper-parameter searching, without clear improvements in performance, but because states collected by DBSCAN in FBIRN data provided decent results (~80% accuracy), they may provide some biological insight.
As established in literature [14], SVM classifiers perform most reliably for classifying dFNC states, which is consistent for both real data sets, and all clustering algorithms.
Further neurological analysis of detected states is required for more thorough interpretation of results
Elbow criterion for different clustering algorithms revealed some inconsistency in the use of k = 5 for the number of states. Applying the full classification pipeline to other numbers of states may provide further insight into states, and improve classification even further.
More advanced clustering and classification algorithms using Deep Learning may provide significant lifts classification. Particularly the use of Variational AutoEncoders for clustering seems a particularly promising avenue of future investigation.
[1]. Eswar Damaraju et al. “Dynamic functional connectivity analysis reveals
transient
states
of dyscon-nectivity in schizophrenia”. In:NeuroImage: Clinical5 (2014), pp. 298–308.
[2]. Mustafa S Salman et al. “Group ICA for identifying biomarkers in
schizophrenia:‘Adaptive’networks viaspatially constrained ICA show more sensitivity to group differences
than
spatio-temporal regression”.In:NeuroImage: Clinical22 (2019), p. 101747.
[3]. Elena A Allen et al. “Tracking whole-brain connectivity dynamics in the resting
state”.
In:Cerebralcortex24.3 (2014), pp. 663–676.
[4]. D. Zhi et al., “Abnormal Dynamic Functional Network Connectivity and Graph
Theoretical
Analysis in Major Depressive Disorder,” 2018 40th Annual International Conference of the IEEE Engineering
in
Medicine and Biology Society (EMBC), Honolulu, HI, 2018, pp. 558-561.
[5]. U Sakoglu, AM Michael, and VD Calhoun. “Classification of schizophrenia patients
vs
healthy controlswith dynamic functional network connectivity”. In:Neuroimage47.1 (2009), S39–41.
[6]. Unal Sako ̆glu et al. “A method for evaluating dynamic functional network
connectivity
and task-modulation: application to schizophrenia”. In:Magnetic Resonance Materials in Physics, Biology
andMedicine23.5-6 (2010), pp. 351–366.
[7]. V. M. Vergara, A. Abrol, F. A. Espinoza and V. D. Calhoun, "Selection of
Efficient
Clustering Index to Estimate the Number of Dynamic Brain States from Functional Network
Connectivity*,"
2019
41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
Berlin,
Germany, 2019, pp. 632-635.
[8]. D. K. Saha, A. Abrol, E. Damaraju, B. Rashid, S. M. Plis and V. D. Calhoun,
“Classification As a Criterion to Select Model Order For Dynamic Functional Connectivity States in
Rest-fMRI
Data,” 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 2019,
pp. 1602-1605.
[9]. Pedregosa et al. “2.3. Clustering.” Scikit,
scikit-learn.org/stable/modules/clustering.html.
[10]. Rashid, Barnaly, et al. “Classification of Schizophrenia and Bipolar Patients Using Static and Dynamic Resting-State FMRI Brain Connectivity.” NeuroImage, vol. 134, 2016, pp. 645–657., doi:10.1016/j.neuroimage.2016.04.051.
[11]. Bell, Anthony J., and Terrence J. Sejnowski. "An information-maximization approach to blind separation and blind deconvolution." Neural computation 7.6 (1995): 1129-1159.
[12]. Du, Yuhui, and Yong Fan. "Group information guided ICA for fMRI data analysis." Neuroimage 69 (2013): 157-197.
[13]. Salman, Mustafa S., et al. "Group information guided ICA shows more sensitivity to group differences than dual-regression." 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). IEEE, 2017.
[14]. Rashid, Barnaly, et al. "Classification of schizophrenia and bipolar patients using static and dynamic resting-state fMRI brain connectivity." Neuroimage 134 (2016): 645-657.
[15]. Rachakonda, Srinivas, et al. "Group ICA of fMRI toolbox (GIFT) manual." Dostupnez [cit 2011-11-5] (2007).
[16]. Du, Yuhui, et al. "NeuroMark: an adaptive independent component analysis framework for estimating reproducible and comparable fMRI biomarkers among brain disorders." medRxiv (2019): 19008631.
[17]. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
[18]. Simm J, de Abril I and Sugiyama M (2014). Tree-Based Ensemble Multi-Task Learning Method for Classification and Regression, volume 97 number 6
[19]. Y. Freund, and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, 1997.
[20]. Breiman, Leo. "Bagging predictors." Machine learning 24.2 (1996): 123-140.
[21]. Friedman, Jerome H. "Stochastic gradient boosting." Computational statistics & data analysis 38.4 (2002): 367-378.
[22] “Online Passive-Aggressive Algorithms” K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)
[23]. Wager, Tor D., and Martin A. Lindquist. "Principles of fMRI." New York: Leanpub (2015).
[24]. Lloyd, Stuart P. "Least squares quantization in PCM." Information Theory, IEEE Transactions on 28.2 (1982): 129-137.
[25]. Ester, Martin, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” Kdd. Vol. 96. No. 34. 1996.
We derive our own synthetic dataset in order to test out our clusterers and classifiers on a simulated dataset for sanity checking their implementation.
The data set was generated by a seed set of M = 50 source gaussian signals, which were organized into k = 5 “connectivity states”. Each of the 5 connectivity states is a block-diagonal matrix with a block of size M/k which is located on the i * M/k + M/kth entry with a size of i * M/k. For each state, the set the pairs of relevant source signals to come from the same distribution, so that their correlation without additional additive noise is 1.
For each subject, we generate a source signal of sime M × T − W where T is the length of the time-course, and W is the size of the sliding window. Over each window, we randomly select which state the window belongs to, drawing from a prior distribution of state probabilities for a subject’s class, Pc, k. Additionally, for each class we simulate a transition probability Qc, k for each state, with higher probabilities meaning a higher chance of transitioning out of that state into some other state. Finally, we restrict the simulation such that transitions can only occur after a state has existed for a W-many timepoints. This restrictions allows us to assert with some certainty that each of the seed states will occur within some window for that subject.
For each class, we simulate baseline signal noise from a normal distribution, which the relevant source signal being added to that baseline.
Formally a single subjects signal within a window of size W is given as:
Xi = 𝒩(θc) + 𝒩(θk)
where θc and θk are the parameters for the baseline class signal for class c and the source signal for state k respectively. Because all source signals for state k will have the same distribution, their correlation will be high, providing the same block-matrix correlation for the dFNC analysis.
The probability for entering from state k into a new state
k′ at a timepoint T,
given that
the
last transition occured W time-points ago, is given as the joint
probability of the prior distribution for the class ever entering into state k′, as well as the probability of transitioning out of state k. i.e.
Zc, k, k′(Pc, k′, Qc, k)
The addition of noise, and the fact that windows are created with a size W at each timepoint means that there is a high chance for bleed-over between the actual states detected for an individual subject.
For example the following connectivity matrices were computed from randomly selected subjects from each class, for each state:
Class | State 0 | State 1 | State 2 | State 3 | State 4 |
---|---|---|---|---|---|
0 | |||||
1 |
For our simulation we generate C = 2 classes, K = 5 states, for N = 300 subjects with M = 50 source signals. The parameters Q were set at 0.5 for each class and each state, so that transitioning out of states was equally likely for all classes and states. Baseline noise was set at σc = 1 × 10 − 2 for each class, and the σk for all source signals was set at σk = 1 × 10 −1.
The parameters for P for each class were selected as:
Class Parameter | State 0 | State 1 | State 2 | State 3 | State 4 |
---|---|---|---|---|---|
P(c = 0, k) | 0.4 | 0.1 | 0.1 | 0.3 | 0.1 |
P(c = 1, k) | 0.1 | 0.1 | 0.3 | 0.1 | 0.4 |
For K-Means, GMM, bGMM, and Agglomerative clustering, we measured the elbow criterion on a range of 2-9 components. We measured both the correlation-distance dispersion, as is recommended in Damaraju et al. 2014 [1], as well as the silhouette measure. The results from this analysis are included below. In general, we found either unclear or multiple elbows in the range K = 3, 4, 5, 6 for each data set, with the location of the elbow varying depending on the clustering algorithm used and the data set. Thus, we decided to compromise, and use the choice of K = 5 from Damaraju et al., which is an established result for the FBIRN data set, and which is a common number of clusters chosen elswhere in the literature.
Clustering Algorithm/ Data Set | Gaussian | FBIRN | UCLA |
---|---|---|---|
KMeans | |||
GMM | |||
bGMM | |||
Agglomerative |
We initially trained the UCLA data on unclustered FNC data to establish an accuracy baseline The results are displayed below.
Surprisingly, the UCLA was much easier to classify without any clustering nor beta feature generation. The AUC scores were within 0.70 and 0.80 range for most classifiers which meant that using FNC matrices alone enabled supervised learning models to classify healthy and schizophrenic patients with some efficacy. We noticed that the PCA feature space of the UCLA dataset was much more separable than the FBIRN dataset which be believed led to the relatively high baseline AUC scores. After analyzing the relatively high AUC scores of the UCLA datasets, we performed our usual analysis. We applied our clustering algorithms and trained our supervised learning models on the beat features. The final results are displayed below.
Clustering Algorithm | SVM | Multilayer Perceptron | Logistic Regression | Passive Aggressive Classifier | Perceptron | Extra Trees | Random Forest |
---|---|---|---|---|---|---|---|
kmeans | 0.907 ± 0.057 | 0.907 ± 0.057 | 0.904 ± 0.06 | 0.896 ± 0.08 | 0.799 ± 0.116 | 0.724 ± 0.168 | 0.746 ± 0.133 |
gmm | 0.91 ± 0.059 | 0.909 ± 0.07 | 0.908 ± 0.071 | 0.885 ± 0.087 | 0.886 ± 0.058 | 0.795 ± 0.095 | 0.785 ± 0.108 |
bgmm | 0.909 ± 0.075 | 0.907 ± 0.081 | 0.908 ± 0.08 | 0.877 ± 0.105 | 0.879 ± 0.081 | 0.741 ± 0.157 | 0.705 ± 0.166 |
dbscan | 0.409 ± 0.118 | 0.467 ± 0.131 | 0.69 ± 0.096 | 0.667 ± 0.122 | 0.5 ± 0.0 | 0.643 ± 0.171 | 0.649 ± 0.125 |
hierarchical | 0.886 ± 0.054 | 0.889 ± 0.07 | 0.9 ± 0.069 | 0.883 ± 0.071 | 0.826 ± 0.122 | 0.829 ± 0.099 | 0.792 ± 0.114 |
Accuracy from the baseline cases with AUC scores on average increasing from 0.70. to 0.90 for multiple learners such as Support Vector Machines and Multilayer Perceptron. Interestingly not all learners improved their AUC scores even after performing the clustering and beta feature generation. For instance, the DBSCAN reduced the AUC scores from the baseline 0.70 to 0.50. We believe that the DBSCAN's sensitivity to hyperparameter changes led it to over classify different clusters leading to poor classification results. Furthermore, there was increased variability in the AUC scores especially for Decision Tree, Random Forest, and Bagging. In brief, all the clustering algorithms with the exclusion of DBSCAN showed improvement over KMeans clustering and the baseline case of no clustering. The Gaussian Mixture Model had the highest accuracy beating out Hierarchical Clustering which performed the best on the FBIRN dataset.
We initially trained the supervised learning algorithms using simulated Gaussian datasets since we did not have access to the real patient data until later in the semester. We first trained the supervised learning models without any clustering in order to establish a baseline AUC scores for our clustering algorithms to improve upon. These baseline results are displayed below.
The key result from this experiment was that without any clustering, no supervised learning algorithm could consistently achieve AUC score above 0.60. Some learners such as the Multilayer Perceptron, Passive Aggressive Classifier, and Bernoulli Naive Bayes surprisingly scored below 0.40 suggesting that they consistently misdiagnose patients. We attributed these results as being due to random error.
After establishing the baseline AUC scores, we performed clustering and beta feature generation for the simulated Gaussian datasets. Using these clustered and reduced datasets, we trained across the same classifiers to for each clustering algorithm (K-Means, Gaussian Mixture Model, Bayesian Gaussian Mixture Model, DBSCAN, and Hierarchical). The results are displayed below.
Clustering Algorithm | Multilayer Perceptron | Nearest Neighbors | SVM | Random Forest | Extra Trees | Gradient Boost | Logistic Regression | Passive Aggressive Classifier |
---|---|---|---|---|---|---|---|---|
kmeans | 0.972 ± 0.026 | 0.96 ± 0.021 | 0.972 ± 0.023 | 0.962 ± 0.027 | 0.966 ± 0.024 | 0.954 ± 0.031 | 0.971 ± 0.022 | 0.974 ± 0.02 |
gmm | 0.963 ± 0.028 | 0.954 ± 0.03 | 0.972 ± 0.025 | 0.967 ± 0.022 | 0.976 ± 0.017 | 0.928 ± 0.046 | 0.962 ± 0.028 | 0.97 ± 0.024 |
bgmm | 0.966 ± 0.023 | 0.949 ± 0.028 | 0.974 ± 0.027 | 0.966 ± 0.026 | 0.963 ± 0.028 | 0.962 ± 0.03 | 0.974 ± 0.02 | 0.972 ± 0.024 |
dbscan | 0.971 ± 0.024 | 0.952 ± 0.025 | 0.972 ± 0.022 | 0.961 ± 0.024 | 0.965 ± 0.023 | 0.962 ± 0.022 | 0.967 ± 0.025 | 0.968 ± 0.026 |
hierarchical | 1.0 ± 0.0 | 1.0 ± 0.0 | 1.0 ± 0.0 | 1.0 ± 0.0 | 1.0 ± 0.0 | 1.0 ± 0.001 | 1.0 ± 0.0 | 1.0 ± 0.0 |
Clustering Algorithm | Perceptron | Gaussian Process | Ada Boost | Voting | Bernoulli Naive Bayes | Bagging | Decision Tree |
---|---|---|---|---|---|---|---|
kmeans | 0.947 ± 0.062 | 0.955 ± 0.027 | 0.957 ± 0.032 | 0.92 ± 0.037 | 0.948 ± 0.022 | 0.941 ± 0.035 | 0.938 ± 0.036 |
gmm | 0.962 ± 0.028 | 0.952 ± 0.029 | 0.955 ± 0.031 | 0.92 ± 0.045 | 0.938 ± 0.028 | 0.917 ± 0.036 | 0.923 ± 0.033 |
bgmm | 0.969 ± 0.029 | 0.955 ± 0.029 | 0.958 ± 0.029 | 0.923 ± 0.045 | 0.949 ± 0.028 | 0.931 ± 0.043 | 0.933 ± 0.026 |
dbscan | 0.957 ± 0.035 | 0.96 ± 0.032 | 0.967 ± 0.028 | 0.913 ± 0.034 | 0.93 ± 0.031 | 0.93 ± 0.036 | 0.943 ± 0.022 |
hierarchical | 1.0 ± 0.0 | 1.0 ± 0.001 | 0.999 ± 0.002 | 0.993 ± 0.014 | 0.991 ± 0.009 | 0.983 ± 0.016 | 0.969 ± 0.035 |
The clustering combined with the beta feature generation dramatically improved the AUC scores on the simulated Gaussian datasets. All clustering algorithms produced AUC scores above 0.90 with standard deviations below 0.05 across all supervised models. The hierarchical clustering even produced perfect predictions. These results confirmed our suspicions that in order to accurately diagnose patients, we needed to perform clustering and reduce the number of features we trained our models on. These initial results using simulated data helped us tremendously when deploying our models on real patient data.
In order to evaluate the predictive capacities of the features produced by each clustering algorithm, a two-tailed t-test was performed comparing the healthy control and the schizophrenic patients.
The first t-test comparision was performed using the cluster assignments for each patient across the time domain where each time slot represented a feature of the training data. For each time slot (feature), the average cluster assignment for all the healthy control patients was compared to the average cluster assignment for all the schizophrenic patients. The corresponding p-values were then tested at a significance level of 0.10.
The results displayed below highlight the points in time when there was less that a 10% chance that the observed difference in a healthy control’s cluster assignment and a schizophrenic’s cluster assignment was due to normal random variation. In theory, the more points of significance across time the more likely a trained model would accurately diagnose a subject. The results indicated that both K-Means and Gaussian Mixture Models failed to produce statistically different cluster assignments across time. The Bayesian Gaussian Mixture Model produced some significant differences while the Hierarchical clustering was significant at every time point.
These results initially suggested that Hierarchical clustering would outperform all the other clustering algorithms, but the subsequent testing disporved this hypothesis.
Given the lack of improvement in accuracy across all clustering algorithms, it was believed that training supervised models using time points as features would require much more data for successful classification. Using time slots as features meant that there were 130 features in the data. Since there were only 267 patients in the UCLA data set, it was surmised that the dimensionality of the data was too high. To reduce the dimensionality of the datasets, the beta coefficients were calculated reducing the number of training features form 130 to 10.
Each clustering algorithm produced statistically different beta coefficients between the health control and schizophrenic patients. With the reduced dimenstionality of the data, we successfully improved the accuracy of out diagnoses across all supervised learning algorithms for each clustering algorithm. These results are discussed later in the report.