Contents

Appendices

A: Simulation Details
B: Elbow Criterion
C: UCLA Data Results
D: Simulated Data Results
E: Statistical Analysis of Features

Introduction

**Figure:** Overview of our contribution with this project. The standard pipeline for dFNC uses only KMeans clustering as part of the pipeline, which introduces implicit assumptions on the organization of the FNC feature-space. We test four other clustering algorithms, and evaluate how they produce FNC states, and how these states aid in schizophrenia classification.

Schizophrenia is a chronic and serious mental disorder which affects how a person thinks, feels, and behaves. Although there have been many studies about psychological and behavioral manifestations of schizophrenia, neuroscientists have yet to determine a set of corresponding neurological biomarkers for this disorder. A functional magnetic resonance imaging (fMRI) can help determine non-invasive biomarkers for schizophrenia in brain function[1][2] and one such fMRI analysis technique called dynamic functional network connectivity (dFNC)[3][4][10] uses K-Means clustering to characterize time-varying connectivity between functional networks. Researchers have worked on finding correlation between schizophrenia and dFNC[1][2][5][6]. Indeed, dFNC features are well-established as features for schizophrenia classification [14].

Despite the utility of dFNC analysis, little work has been done analyzing the choice of clustering algorithm, which in most applications is simply K-Means[7][9]. It is possible that the reliance of K-Means assumptions (sphericality of clusters, reliability of elbow-criterion, etc) have introduced bias into previous results using dFNC, and restricted both biological discovery, and reliability of classifiers used for diagnosis.

Therefore, in this project, we have studied how modifying the clustering technique in the dFNC pipeline can yield dynamic states from fMRI data, and how that choice impacts the accuracy of classifying schizophrenia[8].

We experimented with DBSCAN, Hiearcharial Clustering, Gaussian Mixture Models, and Bayesian Gaussian Mixture Models clustering methods on subject connectivity matrices produced from fMRI data, and use each algorithm’s cluster assignments as features for SVMs, MLP, Nearest Neighbor, and other supervised classification algorithms to classify schizophrenia.

Section II describes the fMRI data used in our experimentation, while Section III summarizes the aforementioned clustering and classification algorithms used in the pipeline. Section IV compares the accuracy of these classifiers, along with presenting a series of charts that analyze the cluster assignments produced on the fMRI data.

Section II: Data

All datasets used in this project are derivatives of fMRI data (functional magnetic resonance imaging), which is data measuring brain activity to track a patient’s thought processs over a predefined time period.

The BOLD signal in fMRI data tracks the movement of blood throughout a subject's brain during the scan period. These changes in blood flow can be further analyzed to make conclusions about flow between different regions, regions of concentrated activity, and other conclusions, which can be useful for the diagnosis of mental disorders, among other applications. The full 4-Dimensional volumes are often not used for classification in their raw form, and instead derivatives such as Functional Network Connectivity states can be used for diagnosis.

**Figure:** Visualization of a functional MRI time-series volume. Image from Chapter 2 of "Principles of fMRI" by Wager and Lindquist [23]

Data Set Name	Number of Subjects	Number of Healthy Controls	Number of Patients
FBIRN	362	186	176
UCLA	332	272	50
Simulations	300	150	150

Gaussian Simulated Dataset

We generated synthetic FNC data which were used for debugging code, and evaluating our method. The details of the simulation are described in the appendix.

FBIRN Dataset

We use derivatives from the Phase 3 Dataset of FBIRN (Functional Biomedical Infromatics Research Network Data Repository), which specifically focuses on brain activity maps from patients with schizophrenia. This dataset includes 186 healthy controls and 176 indivduals from schizophrenia from around the United States. Subject participants in this dataset are between the ages of 18-62.

UCLA Dataset

We also use derivatives from the UCLA Consortium for Neuropsychiatric Phenomics archives, which includes neuroimages for roughly 272 participants. The subject population consists of roughly 272 healthy controls, as well as participants with a diagnosis of schizophrenia (50 subjects). Subject participants in this dataset range from 21 to 50 years

Section III: Methods

Overview

Preprocessing

As is standard in Dynamic Functional Network Connectivity, Group Independent Component Analysis of subject images is used to compute a set of M statistically independent spatial maps from the data, along with a set of M time-courses, which are used for the dFNC analysis. For the FBIRN dataset, we used pre-computed ICA timecourses provided by Damaraju et al. 2014. These timecourses were computed using Infomax ICA [11] with 100 components, which were then manually selected based on biological relevance for a total of 47 remaining components.

For the UCLA data, because no ICA derivatives were readily available, we used Group Information Guided ICA [12] which is included as part of the GIFT NeuroImaging Analysis Toolbox [15]. As a template for GIG-ICA, we used the NeuroMark template [16], which has been shown to provide reliable estimations of components in many different settings [13]. This provides us with 53 independent components for the UCLA dataset.

Sliding Window Analysis

For all data-sets we chose a sliding window of size W = 22, following the precedent in Damaraju et al. 2014 [1]. Time-Series data X with size M × T, where M is the number of independent components, and T is the number of timepoints are used. Time series were normalized and convolved with a gaussian window with a kernel of size 0.015, again following the precedent in Damaraju et al. 2014 [1]. For each window on each time-series a number of exemplar data-points were selected by taking the local maximum of the variance from the smoothed time-series.

We computed correlation coefficients across the components, within each time series window to form the FNC matrix with entries i and j given as

Through this process, we generate a total of N × (T − W) window instances, which are used as the input for clustering.

Clustering Details

We implemented 5 different clustering algorithms as part of the dFNC pipeline:

KMeans [24.]
DBSCAN [25.]
Gaussian Mixture Models
Bayesian Gaussian Mixture Models
Agglomerative Hierarchical Clustering

For each of these algorithms, we used them for both exemplar clustering, and clustering on the full data. In the case of KMeans, Gaussian Mixture Models, and Bayesian Gaussian Mixture Models, exemplar cluster-centers were used as input for initialization in the second stage of clustering using the same algorithm. In the case of DBSCAN and Agglomerative clustering, the exemplar stage was only used for elbow criterion, and other hyper-parameter tuning. We provide a more exhaustive analysis of our elbow-criteria experiments in Appendix B.

For GMMs and Bayesian GMMs, we elected to use hard-assignment for determining cluster membership, because it lent to simpler integration with other steps in the pipeline. In future work, we would like to try to use the more rich information provided by soft-assignment to somehow improve feature generation, which is further discussed below.

Classification Details

Task Formulation

Our chosen task is the binary classification of Schizophrenic Patients from Healthy Controls using Functional Network Connectivity states. For patients with a schizophrenic diagnosis, we assign the label 1, and for all other subjects, we assign the label 0. Given the set of FNC features for all subjects, our goal is to find the classifier which provides the most accurate diagnosis.

Evaluation Metric

For our evaluation metric, we used the Area-Under the “Receiving Operating Characteric” (ROC) curve, or ROC-AUC. This metric, which is often used for binary classifiers, describes the probability that a given classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. The curve is creative by plotting the fraction of true positive out of the positive vs the fraction of false pasitive out of the negatives over various thresholds T. In general, a higher AUC indicates a better binary classification performance.

Feature Generation

Classification over raw dynamic FNC states is often poor, and further feature generation is required.

We first experimented with the use of subject state-vectors for classification. Because each time point for each subject is assigned a state during clustering, we can generate a vector of T-W state assignments for each subject to be used as features.

Secondly, we follow the precedent in [14], and compute means of clusters for each class, and use these centers to form a regression matrix of size 2k × (T − W). For each subject, we then linearly regress out these cluster centers from each FNC window, and collect the beta coefficients. The mean of the β-coefficients over all timepoints are then used as features for the final classification.

We computed two-tailed T-Tests over each of the feature spaces used for classification, and also over the centroids for each class prior to regression, in order to visualize statistically significant features which may or may not improve the performance of the classifiers.

Chosen Algorithms

For supervised learning, we used a modified version of the Polyssifier classification toolbox, which wraps the majority of the classification algorithms from SKlearn into an end-to-end framework. The modified version of the toolbox, which is being maintained by Brad Baker, . Within polyssifier, we implemented the following binary classifiers:

Multilayer Perceptron
K-Nearest Neighbors Classifier
Support Vector Machines
Decision Tree Classifier
Random Forests [17]
Extra Trees [18]
Ada-Boost Decision Trees [19]
Bagging Decision Trees [20]
Gradient Boosted Decision Trees [21]
Logistic Regression
Passive Aggressive Classifier [22]
Perceptron
Naive Bayes
Voting Classifer

Grid Search

We performed a Grid-Search over all available parameters for each of the classifiers available in Polyssifier. For each set of training data used for cross-validation, we trained the grid-search separately, and took the parameters with the highest AUC.

For the sake of space, we have ommitted the details of the grid-search, but the results have been included in the results section of the repo.

Section IV: Results & Discussion

For brevity, we only present results from modifying the dFNC pipeline with the aforementioned clustering and classification algorithms on the FBIRN dataset in this section. We leave our results on the gaussian simulated and UCLA datasets in Appendix D and C respectively, if the reader is interested in viewing them.

Evaluation Setup

For all results, the Area Under Curve (AUC) metric was used since we were trying to classify two possible outcomes. Patients were with either healthy or had schizophrenia. AUC scores near 0.50 meant that the models randomly diagnosed patients as healthy or schizophrenic. This represented the worst possible outcome. Any model with AUC scores between 0.40 and 0.60 was deemed a failed model. AUC scores near 1.0 meant that diagnoses were accurate with few false positives (high specificity) and few false negatives (high recall). There were no observed models that completely reversed the diagnoses with AUC scores near 0.0.

Each classification algorithm was evaluated with 10-fold cross-validation, with a grid-search for classifier parameters performed separately on each validation set.

Quantitative Analysis: AUC Analysis on FBIRN Dataset

We initially trained the supervised learning model on the unclustered FBIRN FNC matrices in order to establish a baseline accuracy that we could compare our beta feature generation approach against. All the classifiers failed to consistently achieve AUC scores outside of the 0.40-0.60 range suggesting that all the models produced random diagnoses. This confimed our hypothesis that clustering and feature reduction was paramount to succesful classfication of healthy and schizoprhenic patients. The baseline results are displayed below.

**Figure:** AUC scores for 10-fold cross-validation of multiple classifiers using only the concatenated FNC matrices as features. The y axis is the AUC metric for each classifier, and the x-axis separates the classifiers.

After establishing our baseline results we trained the models on the FBIRN dataset using only cluster assignments. No beta features were generated to reduce the dimensionality of the training dataset. The results are displayed below.

**Figure:** FBIRN Schizophrenia Classification AUC scores using subject state-vectors as features. The y-axis provides the AUC metric and the x-axis separates the scores obtained over 10-fold cross-validation for each classifier.

The results indicated that it was possible for some of the clustering algorithms to lift the AUC scores to an average of 0.70 for most supervised models with the exclusion of the Voting learner. The one exception was DBSCAN which failed to move the AUC from 0.50. Regardless, none of these models achieved a high enough score to be used in a clinical environment. It was surmised that the number of features in the datasets had to be reduced from cluster assignments over time to beta features in order for the supervised learning models to accurately diagnose patients.

After performing beta features generation with clustering, the final results for the FBIRN dataset were obtained and are displayed below.

Clustering Algorithm	SVM	Multilayer Perceptron	Logistic Regression	Passive Aggressive Classifier	Perceptron	Random Forest	Extra Trees
kmeans	0.952 ± 0.036	0.92 ± 0.065	0.944 ± 0.039	0.945 ± 0.035	0.902 ± 0.043	0.871 ± 0.038	0.853 ± 0.04
gmm	0.936 ± 0.054	0.946 ± 0.038	0.943 ± 0.038	0.929 ± 0.031	0.882 ± 0.04	0.885 ± 0.022	0.874 ± 0.026
bgmm	0.955 ± 0.037	0.932 ± 0.042	0.945 ± 0.038	0.939 ± 0.038	0.896 ± 0.074	0.86 ± 0.039	0.87 ± 0.056
dbscan	0.883 ± 0.027	0.893 ± 0.031	0.892 ± 0.033	0.884 ± 0.027	0.828 ± 0.064	0.805 ± 0.064	0.806 ± 0.058
hierarchical	0.957 ± 0.032	0.954 ± 0.038	0.953 ± 0.038	0.951 ± 0.032	0.891 ± 0.098	0.881 ± 0.032	0.872 ± 0.048

**Figure:** FBIRN Schizophrenia classification AUC scores using \beta-coefficients as features. The y-axis provides the AUC metric and the x-axis separates the scores obtained over 10-fold cross-validation for each classifier.

The reduced number of training features increased the accuracy of all the learners with Support Vector Machines achieving the highest accuracy across all clustering algorithms. Only DBCAN failed to get above an average AUC score of 0.90 for support vector machines and Hierarchical clusterting obtained this highest score of 0.957.

Qualitative Analysis: FNC State and Clustering Visualization Analysis

In the following section, we display both 2D and 3D clustering visualizations for KMeans, GMMs, BGMMs, DBSCAN, and Hiearcharial Clustering after performing PCA on the FBIRN dataset. We mark the centroids of each clustering algorithm with a bolded X , and take the average of points of each cluster in the DBSCAN visualizations to represent its cluster's centers.

We also present FNC state diagrams that analyze how each clustering method clustered data related to healthy patients and schizophrenic patients differently. We hope these diagrams can help neuroscience researchers find interesting biomarkers for further work.

Lastly, we present clustering visualizations and FNC State diagrams for the UCLA and gaussian simulated dataset in Appendix C and D respectively.

Kmeans Clustering

**Figure:** Visualization of FBIRN cluster-assignments to FNC windows for PCA-reduced windows in 2-D and 3-D. Each color corresponds to one of the k = 5 states.

GMM Clustering

Bayesian GMM

DBSCAN Clustering

Agglomerative Hierarchical Clustering

**Figure:** The FNC states detected for the FBIRN data using hierarchical clustering. We observe that although some similarity in connectivity is shared between these states and the states resulted from KMeans, the centers are still quite different than the standard states detected for this data set. Since the resulting states provide the best classification of the data, further neurological interpretation may confirm that these connectivity patterns are useful biomarkers for schizophrenia diagnosis.

**Figure:** Visualization of clusters with Hierarchical clustering in 2-d and 3-d with FBirn Data. We observe that the partitioning of the clusters is not clear under PCA reduction, like it is for the other clustering algorithms. Because hierarchical clustering provides the best states for producing classification features, this may mean it is better partitioning distinctive states in the higher-dimensional space.

As a summary of the results posted above, we see that 4 out of 5 tested clustering methods (KMeans, Hiearcharial Clustering, GMM, BGMM) are able to find 4-5 distinct clusters within the FBIRN data, as evidence by their respective 2D and 3D visualizations shown in this section. Interestingly, each of these clustering algorithms produced clusters that are fairly segmented, which indicates that applying clustering on fMRI data is valuable for finding features that more easily separable in classification. DBSCAN only appears to find 2 clusters, with one of them being marked as "noise," and accordingly produces features that perform worse in comparison to other clustering algorithms on the Schizophrenia prediction task.

In terms of the Schizophrenia prediction task, Hiearcharial Clustering appears to produce assignments that give the greatest lift to all classifiers, with SVMs benefitting the most from its cluster assignments. However, the a few of the other clustering algorithms also perform strongly with GMMs and BGMMs performing comparatively, KMeans slightly behind, and DBSCAN performing slightly worse in all experiments.

More biological analysis is required to make any concrete recommendations based off these results, but our results appear to signal that the choice of clustering and classification algorithm in the dFNC pipeline does make a significant difference in the quality of the output states and diagnosis prediction accuracy.

Section V: Conclusion

Cluster-Based features greatly improve classification performance for the schizophrenia classification task.
Classification of dFNC states performs comporably when using GMM, Bayesian GMM, Hierarchical Clustering, and KMeans
GMM, bGMM, and Hierarchical clustering all detect states not detected with KMeans, which provide either comparable or better classification of schizophrenia, indicating these states may have biological significance for diagnosis.
Hierarchical Clustering provides the highest classification results for FBIRN, and GMM/bGMM provide the highest results for UCLA data, while also providing slightly different states than are detected with KMeans. These states may be useful for further neurological analysis.
DBSCAN's sensitivity to parameters requires exhaustive hyper-parameter searching, without clear improvements in performance, but because states collected by DBSCAN in FBIRN data provided decent results (~80% accuracy), they may provide some biological insight.
As established in literature [14], SVM classifiers perform most reliably for classifying dFNC states, which is consistent for both real data sets, and all clustering algorithms.

Further Work

Further neurological analysis of detected states is required for more thorough interpretation of results
Elbow criterion for different clustering algorithms revealed some inconsistency in the use of k = 5 for the number of states. Applying the full classification pipeline to other numbers of states may provide further insight into states, and improve classification even further.
More advanced clustering and classification algorithms using Deep Learning may provide significant lifts classification. Particularly the use of Variational AutoEncoders for clustering seems a particularly promising avenue of future investigation.

References

[1]. Eswar Damaraju et al. “Dynamic functional connectivity analysis reveals transient states of dyscon-nectivity in schizophrenia”. In:NeuroImage: Clinical5 (2014), pp. 298–308.

[2]. Mustafa S Salman et al. “Group ICA for identifying biomarkers in schizophrenia:‘Adaptive’networks viaspatially constrained ICA show more sensitivity to group differences than spatio-temporal regression”.In:NeuroImage: Clinical22 (2019), p. 101747.

[3]. Elena A Allen et al. “Tracking whole-brain connectivity dynamics in the resting state”. In:Cerebralcortex24.3 (2014), pp. 663–676.

[4]. D. Zhi et al., “Abnormal Dynamic Functional Network Connectivity and Graph Theoretical Analysis in Major Depressive Disorder,” 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, 2018, pp. 558-561.

[5]. U Sakoglu, AM Michael, and VD Calhoun. “Classification of schizophrenia patients vs healthy controlswith dynamic functional network connectivity”. In:Neuroimage47.1 (2009), S39–41.

[6]. Unal Sako ̆glu et al. “A method for evaluating dynamic functional network connectivity and task-modulation: application to schizophrenia”. In:Magnetic Resonance Materials in Physics, Biology andMedicine23.5-6 (2010), pp. 351–366.

[7]. V. M. Vergara, A. Abrol, F. A. Espinoza and V. D. Calhoun, "Selection of Efficient Clustering Index to Estimate the Number of Dynamic Brain States from Functional Network Connectivity*," 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 2019, pp. 632-635.

[8]. D. K. Saha, A. Abrol, E. Damaraju, B. Rashid, S. M. Plis and V. D. Calhoun, “Classification As a Criterion to Select Model Order For Dynamic Functional Connectivity States in Rest-fMRI Data,” 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 2019, pp. 1602-1605.

[9]. Pedregosa et al. “2.3. Clustering.” Scikit, scikit-learn.org/stable/modules/clustering.html.

[10]. Rashid, Barnaly, et al. “Classification of Schizophrenia and Bipolar Patients Using Static and Dynamic Resting-State FMRI Brain Connectivity.” NeuroImage, vol. 134, 2016, pp. 645–657., doi:10.1016/j.neuroimage.2016.04.051.

[11]. Bell, Anthony J., and Terrence J. Sejnowski. "An information-maximization approach to blind separation and blind deconvolution." Neural computation 7.6 (1995): 1129-1159.

[12]. Du, Yuhui, and Yong Fan. "Group information guided ICA for fMRI data analysis." Neuroimage 69 (2013): 157-197.

[13]. Salman, Mustafa S., et al. "Group information guided ICA shows more sensitivity to group differences than dual-regression." 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). IEEE, 2017.

[14]. Rashid, Barnaly, et al. "Classification of schizophrenia and bipolar patients using static and dynamic resting-state fMRI brain connectivity." Neuroimage 134 (2016): 645-657.

[15]. Rachakonda, Srinivas, et al. "Group ICA of fMRI toolbox (GIFT) manual." Dostupnez [cit 2011-11-5] (2007).

[16]. Du, Yuhui, et al. "NeuroMark: an adaptive independent component analysis framework for estimating reproducible and comparable fMRI biomarkers among brain disorders." medRxiv (2019): 19008631.

[17]. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

[18]. Simm J, de Abril I and Sugiyama M (2014). Tree-Based Ensemble Multi-Task Learning Method for Classification and Regression, volume 97 number 6

[19]. Y. Freund, and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, 1997.

[20]. Breiman, Leo. "Bagging predictors." Machine learning 24.2 (1996): 123-140.

[21]. Friedman, Jerome H. "Stochastic gradient boosting." Computational statistics & data analysis 38.4 (2002): 367-378.

[22] “Online Passive-Aggressive Algorithms” K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)

[23]. Wager, Tor D., and Martin A. Lindquist. "Principles of fMRI." New York: Leanpub (2015).

[24]. Lloyd, Stuart P. "Least squares quantization in PCM." Information Theory, IEEE Transactions on 28.2 (1982): 129-137.

[25]. Ester, Martin, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” Kdd. Vol. 96. No. 34. 1996.

Appendices

Class Parameter	State 0	State 1	State 2	State 3	State 4
P_{(c = 0, k)}	0.4	0.1	0.1	0.3	0.1
P_{(c = 1, k)}	0.1	0.1	0.3	0.1	0.4

Appendix A: Gaussian Simulation Details

We derive our own synthetic dataset in order to test out our clusterers and classifiers on a simulated dataset for sanity checking their implementation.

The data set was generated by a seed set of M = 50 source gaussian signals, which were organized into k = 5 “connectivity states”. Each of the 5 connectivity states is a block-diagonal matrix with a block of size M/k which is located on the i * M/k + M/kth entry with a size of i * M/k. For each state, the set the pairs of relevant source signals to come from the same distribution, so that their correlation without additional additive noise is 1.

For each subject, we generate a source signal of sime M × T − W where T is the length of the time-course, and W is the size of the sliding window. Over each window, we randomly select which state the window belongs to, drawing from a prior distribution of state probabilities for a subject’s class, P_c, k. Additionally, for each class we simulate a transition probability Q_c, k for each state, with higher probabilities meaning a higher chance of transitioning out of that state into some other state. Finally, we restrict the simulation such that transitions can only occur after a state has existed for a W-many timepoints. This restrictions allows us to assert with some certainty that each of the seed states will occur within some window for that subject.

For each class, we simulate baseline signal noise from a normal distribution, which the relevant source signal being added to that baseline.

Formally a single subjects signal within a window of size W is given as:

X_i = 𝒩(θ_c) + 𝒩(θ_k)

where θ_c and θ_k are the parameters for the baseline class signal for class c and the source signal for state k respectively. Because all source signals for state k will have the same distribution, their correlation will be high, providing the same block-matrix correlation for the dFNC analysis.

The probability for entering from state k into a new state k′ at a timepoint T, given that the last transition occured W time-points ago, is given as the joint probability of the prior distribution for the class ever entering into state k′, as well as the probability of transitioning out of state k. i.e.

Z_{c, k, k′}(P_c, k′, Q_c, k)

The addition of noise, and the fact that windows are created with a size W at each timepoint means that there is a high chance for bleed-over between the actual states detected for an individual subject.

For example the following connectivity matrices were computed from randomly selected subjects from each class, for each state:

Class State 0 State 1 State 2 State 3 State 4

0

1

Fig: Examples of the 5 states for our simulated data set. Each of the states is a block matrix, with some mixture from adjacent states depending on the probability for that subject. E.g. in class one state 0 is highly probably, so it tends to appear mixed into other states, while in class 2, state 4 is highly probable, so it tends to mix in with other states.

For our simulation we generate C = 2 classes, K = 5 states, for N = 300 subjects with M = 50 source signals. The parameters Q were set at 0.5 for each class and each state, so that transitioning out of states was equally likely for all classes and states. Baseline noise was set at σ_c = 1 × 10^− 2 for each class, and the σ_k for all source signals was set at σ_k = 1 × 10⁻¹.

The parameters for P for each class were selected as:

Class Parameter State 0 State 1 State 2 State 3 State 4

P_{(c = 0, k)} 0.4 0.1 0.1 0.3 0.1

P_{(c = 1, k)} 0.1 0.1 0.3 0.1 0.4

Appendix B: Elbow Criterion

For K-Means, GMM, bGMM, and Agglomerative clustering, we measured the elbow criterion on a range of 2-9 components. We measured both the correlation-distance dispersion, as is recommended in Damaraju et al. 2014 [1], as well as the silhouette measure. The results from this analysis are included below. In general, we found either unclear or multiple elbows in the range K = 3, 4, 5, 6 for each data set, with the location of the elbow varying depending on the clustering algorithm used and the data set. Thus, we decided to compromise, and use the choice of K = 5 from Damaraju et al., which is an established result for the FBIRN data set, and which is a common number of clusters chosen elswhere in the literature.

Clustering Algorithm/ Data Set Gaussian FBIRN UCLA

KMeans

GMM

bGMM

Agglomerative

Fig 3: Table of elbow-criterion evaluations for all clustering algorithms using the correlation distance distortion, and the silhouette coefficients. In general, we see either no clear elbow, or multiple elbows, depending on the choice of clustering algorithm or data set.

Appendix C: UCLA Dataset Results

We initially trained the UCLA data on unclustered FNC data to establish an accuracy baseline The results are displayed below.

Surprisingly, the UCLA was much easier to classify without any clustering nor beta feature generation. The AUC scores were within 0.70 and 0.80 range for most classifiers which meant that using FNC matrices alone enabled supervised learning models to classify healthy and schizophrenic patients with some efficacy. We noticed that the PCA feature space of the UCLA dataset was much more separable than the FBIRN dataset which be believed led to the relatively high baseline AUC scores. After analyzing the relatively high AUC scores of the UCLA datasets, we performed our usual analysis. We applied our clustering algorithms and trained our supervised learning models on the beat features. The final results are displayed below.

Clustering Algorithm SVM Multilayer Perceptron Logistic Regression Passive Aggressive Classifier Perceptron Extra Trees Random Forest

kmeans 0.907 ± 0.057 0.907 ± 0.057 0.904 ± 0.06 0.896 ± 0.08 0.799 ± 0.116 0.724 ± 0.168 0.746 ± 0.133

gmm 0.91 ± 0.059 0.909 ± 0.07 0.908 ± 0.071 0.885 ± 0.087 0.886 ± 0.058 0.795 ± 0.095 0.785 ± 0.108

bgmm 0.909 ± 0.075 0.907 ± 0.081 0.908 ± 0.08 0.877 ± 0.105 0.879 ± 0.081 0.741 ± 0.157 0.705 ± 0.166

dbscan 0.409 ± 0.118 0.467 ± 0.131 0.69 ± 0.096 0.667 ± 0.122 0.5 ± 0.0 0.643 ± 0.171 0.649 ± 0.125

hierarchical 0.886 ± 0.054 0.889 ± 0.07 0.9 ± 0.069 0.883 ± 0.071 0.826 ± 0.122 0.829 ± 0.099 0.792 ± 0.114

Accuracy from the baseline cases with AUC scores on average increasing from 0.70. to 0.90 for multiple learners such as Support Vector Machines and Multilayer Perceptron. Interestingly not all learners improved their AUC scores even after performing the clustering and beta feature generation. For instance, the DBSCAN reduced the AUC scores from the baseline 0.70 to 0.50. We believe that the DBSCAN's sensitivity to hyperparameter changes led it to over classify different clusters leading to poor classification results. Furthermore, there was increased variability in the AUC scores especially for Decision Tree, Random Forest, and Bagging. In brief, all the clustering algorithms with the exclusion of DBSCAN showed improvement over KMeans clustering and the baseline case of no clustering. The Gaussian Mixture Model had the highest accuracy beating out Hierarchical Clustering which performed the best on the FBIRN dataset.

UCLA FNC State Analysis

KMeans Clustering

Visualization of states with KMeans clustering with UCLA Data

Visualization of clusters with KMeans clustering in 2-d and 3-d with UCLA Data

GMM Clustering

Visualization of states with GMM clustering with UCLA Data

Visualization of clusters with GMM clustering in 2-d and 3-d with UCLA Data

Bayesian GMM Clustering

Visualization of states with BGMM clustering with UCLA Data

Visualization of clusters with BGMM clustering in 2-d and 3-d with UCLA Data

DBSCAN Clustering

Visualization of clusters with DBSCAN clustering in 2-d and 3-d with UCLA Data

Agglomerative Hierarchical Clustering

Visualization of states with Hierarchical clustering with UCLA Data

Visualization of clusters with Hierarchical clustering in 2-d and 3-d with UCLA Data

Appendix D: Gaussian Simulated Dataset Results

We initially trained the supervised learning algorithms using simulated Gaussian datasets since we did not have access to the real patient data until later in the semester. We first trained the supervised learning models without any clustering in order to establish a baseline AUC scores for our clustering algorithms to improve upon. These baseline results are displayed below.

The key result from this experiment was that without any clustering, no supervised learning algorithm could consistently achieve AUC score above 0.60. Some learners such as the Multilayer Perceptron, Passive Aggressive Classifier, and Bernoulli Naive Bayes surprisingly scored below 0.40 suggesting that they consistently misdiagnose patients. We attributed these results as being due to random error.

After establishing the baseline AUC scores, we performed clustering and beta feature generation for the simulated Gaussian datasets. Using these clustered and reduced datasets, we trained across the same classifiers to for each clustering algorithm (K-Means, Gaussian Mixture Model, Bayesian Gaussian Mixture Model, DBSCAN, and Hierarchical). The results are displayed below.

Clustering Algorithm Multilayer Perceptron Nearest Neighbors SVM Random Forest Extra Trees Gradient Boost Logistic Regression Passive Aggressive Classifier

kmeans 0.972 ± 0.026 0.96 ± 0.021 0.972 ± 0.023 0.962 ± 0.027 0.966 ± 0.024 0.954 ± 0.031 0.971 ± 0.022 0.974 ± 0.02

gmm 0.963 ± 0.028 0.954 ± 0.03 0.972 ± 0.025 0.967 ± 0.022 0.976 ± 0.017 0.928 ± 0.046 0.962 ± 0.028 0.97 ± 0.024

bgmm 0.966 ± 0.023 0.949 ± 0.028 0.974 ± 0.027 0.966 ± 0.026 0.963 ± 0.028 0.962 ± 0.03 0.974 ± 0.02 0.972 ± 0.024

dbscan 0.971 ± 0.024 0.952 ± 0.025 0.972 ± 0.022 0.961 ± 0.024 0.965 ± 0.023 0.962 ± 0.022 0.967 ± 0.025 0.968 ± 0.026

hierarchical 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.001 1.0 ± 0.0 1.0 ± 0.0

Clustering Algorithm Perceptron Gaussian Process Ada Boost Voting Bernoulli Naive Bayes Bagging Decision Tree

kmeans 0.947 ± 0.062 0.955 ± 0.027 0.957 ± 0.032 0.92 ± 0.037 0.948 ± 0.022 0.941 ± 0.035 0.938 ± 0.036

gmm 0.962 ± 0.028 0.952 ± 0.029 0.955 ± 0.031 0.92 ± 0.045 0.938 ± 0.028 0.917 ± 0.036 0.923 ± 0.033

bgmm 0.969 ± 0.029 0.955 ± 0.029 0.958 ± 0.029 0.923 ± 0.045 0.949 ± 0.028 0.931 ± 0.043 0.933 ± 0.026

dbscan 0.957 ± 0.035 0.96 ± 0.032 0.967 ± 0.028 0.913 ± 0.034 0.93 ± 0.031 0.93 ± 0.036 0.943 ± 0.022

hierarchical 1.0 ± 0.0 1.0 ± 0.001 0.999 ± 0.002 0.993 ± 0.014 0.991 ± 0.009 0.983 ± 0.016 0.969 ± 0.035

The clustering combined with the beta feature generation dramatically improved the AUC scores on the simulated Gaussian datasets. All clustering algorithms produced AUC scores above 0.90 with standard deviations below 0.05 across all supervised models. The hierarchical clustering even produced perfect predictions. These results confirmed our suspicions that in order to accurately diagnose patients, we needed to perform clustering and reduce the number of features we trained our models on. These initial results using simulated data helped us tremendously when deploying our models on real patient data.

Simulated State Analysis

Kmeans Clustering

Visualization of states with KMeans clustering with Gaussian Simulated Data

Visualization of clusters with KMeans clustering in 2-d and 3-d with Gaussian Simulated Data

GMM Clustering

Visualization of states with GMM clustering with Gaussian Simulated Data

Visualization of clusters with GMM clustering in 2-d and 3-d with Gaussian Simulated Data

Bayesian GMM Clustering

Visualization of states with BGMM clustering with Gaussian Simulated Data

Visualization of clusters with BGMM clustering in 2-d and 3-d with Gaussian Simulated Data

DBSCAN Clustering

Visualization of states with DBSCAN clustering with Gaussian Simulated Data

Visualization of clusters with DBSCAN clustering in 2-d and 3-d with Gaussian Simulated Data

Agglomerative Hierarchical Clustering

Visualization of states with Hierarchical clustering with Gaussian Simulated Data

Visualization of clusters with Hierarchical clustering in 2-d and 3-d with Gaussian Simulated Data

Appendix E: Statistical Analysis of Features

In order to evaluate the predictive capacities of the features produced by each clustering algorithm, a two-tailed t-test was performed comparing the healthy control and the schizophrenic patients.

The first t-test comparision was performed using the cluster assignments for each patient across the time domain where each time slot represented a feature of the training data. For each time slot (feature), the average cluster assignment for all the healthy control patients was compared to the average cluster assignment for all the schizophrenic patients. The corresponding p-values were then tested at a significance level of 0.10.

The results displayed below highlight the points in time when there was less that a 10% chance that the observed difference in a healthy control’s cluster assignment and a schizophrenic’s cluster assignment was due to normal random variation. In theory, the more points of significance across time the more likely a trained model would accurately diagnose a subject. The results indicated that both K-Means and Gaussian Mixture Models failed to produce statistically different cluster assignments across time. The Bayesian Gaussian Mixture Model produced some significant differences while the Hierarchical clustering was significant at every time point.

These results initially suggested that Hierarchical clustering would outperform all the other clustering algorithms, but the subsequent testing disporved this hypothesis.

Figure: Significance testing results for the four clustering algorithms on the UCLA data when using subject state-vectors as features. The x-axis is time, and the y-axis is whether or not a given timepoint was significant.

Given the lack of improvement in accuracy across all clustering algorithms, it was believed that training supervised models using time points as features would require much more data for successful classification. Using time slots as features meant that there were 130 features in the data. Since there were only 267 patients in the UCLA data set, it was surmised that the dimensionality of the data was too high. To reduce the dimensionality of the datasets, the beta coefficients were calculated reducing the number of training features form 130 to 10.

Figure: Significance testing on the UCLA dataset using \beta-coefficients for four of the clustering algorithms. We notice that all clustering algorithms find significant features, though which features are significant varies with the choice of algorithm. The x axis is the index of the feature in the UCLA data set, it was surmised that the dimensionality of the data was too high. To reduce the dimensionality of the datasets, the beta coefficients were calculated reducing the number of training features form 130 to 10.

Figure: Significance testing on the UCLA dataset using \beta-coefficients for four of the clustering algorithms. We notice that all clustering algorithms find significant features, though which features are significant varies with the choice of algorithm. The x axis is the index of the feature in the 2 * k \beta-space. The y axis is whether or not a given \beta-feature was significant.

Each clustering algorithm produced statistically different beta coefficients between the health control and schizophrenic patients. With the reduced dimenstionality of the data, we successfully improved the accuracy of out diagnoses across all supervised learning algorithms for each clustering algorithm. These results are discussed later in the report.

Clustering Algorithm	SVM	Multilayer Perceptron	Logistic Regression	Passive Aggressive Classifier	Perceptron	Extra Trees	Random Forest
kmeans	0.907 ± 0.057	0.907 ± 0.057	0.904 ± 0.06	0.896 ± 0.08	0.799 ± 0.116	0.724 ± 0.168	0.746 ± 0.133
gmm	0.91 ± 0.059	0.909 ± 0.07	0.908 ± 0.071	0.885 ± 0.087	0.886 ± 0.058	0.795 ± 0.095	0.785 ± 0.108
bgmm	0.909 ± 0.075	0.907 ± 0.081	0.908 ± 0.08	0.877 ± 0.105	0.879 ± 0.081	0.741 ± 0.157	0.705 ± 0.166
dbscan	0.409 ± 0.118	0.467 ± 0.131	0.69 ± 0.096	0.667 ± 0.122	0.5 ± 0.0	0.643 ± 0.171	0.649 ± 0.125
hierarchical	0.886 ± 0.054	0.889 ± 0.07	0.9 ± 0.069	0.883 ± 0.071	0.826 ± 0.122	0.829 ± 0.099	0.792 ± 0.114

Clustering Algorithm	Multilayer Perceptron	Nearest Neighbors	SVM	Random Forest	Extra Trees	Gradient Boost	Logistic Regression	Passive Aggressive Classifier
kmeans	0.972 ± 0.026	0.96 ± 0.021	0.972 ± 0.023	0.962 ± 0.027	0.966 ± 0.024	0.954 ± 0.031	0.971 ± 0.022	0.974 ± 0.02
gmm	0.963 ± 0.028	0.954 ± 0.03	0.972 ± 0.025	0.967 ± 0.022	0.976 ± 0.017	0.928 ± 0.046	0.962 ± 0.028	0.97 ± 0.024
bgmm	0.966 ± 0.023	0.949 ± 0.028	0.974 ± 0.027	0.966 ± 0.026	0.963 ± 0.028	0.962 ± 0.03	0.974 ± 0.02	0.972 ± 0.024
dbscan	0.971 ± 0.024	0.952 ± 0.025	0.972 ± 0.022	0.961 ± 0.024	0.965 ± 0.023	0.962 ± 0.022	0.967 ± 0.025	0.968 ± 0.026
hierarchical	1.0 ± 0.0	1.0 ± 0.0	1.0 ± 0.0	1.0 ± 0.0	1.0 ± 0.0	1.0 ± 0.001	1.0 ± 0.0	1.0 ± 0.0

Clustering Algorithm	Perceptron	Gaussian Process	Ada Boost	Voting	Bernoulli Naive Bayes	Bagging	Decision Tree
kmeans	0.947 ± 0.062	0.955 ± 0.027	0.957 ± 0.032	0.92 ± 0.037	0.948 ± 0.022	0.941 ± 0.035	0.938 ± 0.036
gmm	0.962 ± 0.028	0.952 ± 0.029	0.955 ± 0.031	0.92 ± 0.045	0.938 ± 0.028	0.917 ± 0.036	0.923 ± 0.033
bgmm	0.969 ± 0.029	0.955 ± 0.029	0.958 ± 0.029	0.923 ± 0.045	0.949 ± 0.028	0.931 ± 0.043	0.933 ± 0.026
dbscan	0.957 ± 0.035	0.96 ± 0.032	0.967 ± 0.028	0.913 ± 0.034	0.93 ± 0.031	0.93 ± 0.036	0.943 ± 0.022
hierarchical	1.0 ± 0.0	1.0 ± 0.001	0.999 ± 0.002	0.993 ± 0.014	0.991 ± 0.009	0.983 ± 0.016	0.969 ± 0.035