# General Linear Model

Second-level analyses allow researchers to make inferences about properties of groups or populations, by generalizing from the observations of only a subset of subjects in a study. CONN uses the General Linear Model (GLM) for all second-level analyses of functional connectivity data. This section describes the mathematics behind the General Linear Model, including model definition, parameter estimation, and hypothesis testing framework. It also includes several practical examples and general guidelines aimed at helping researchers use this method to answer their specific research questions

## Definition and estimation

The General Linear Model defines a multivariate linear association between a set of explanatory/independent measures X, and a set of outcome/dependent measures Y. In the context of functional connectivity MRI analyses, an outcome variable y[n] will typically take the form of a row vector encoding functional connectivity values recorded from the n-th subject in a study across one or multiple experimental conditions, and the explanatory variable x[n] will be a row vector encoding one or several group, behavioral, or demographic variables for that same subject (see numerical examples section below).

The (generally unknown) effective association in the population between each of the explanatory measures in X and each of the outcome measures in Y is characterized by the matrix B in the GLM equation. The vector epsilon represents the cumulative contribution on the outcome measure Y of all other unspecified factors beyond those that can be predicted from the knowledge of X. (i.e. the model error term). Based on the Central Limit Theorem, it is often reasonable to model this cumulative contribution as a Normally distributed random term. GLM assumes this term is independent across subjects and follows a multivariate normal distribution with mean zero and an arbitrary variance-covariance structure across outcome measures.

We are typically interested in quantifying or estimating the values of the matrix B, characterizing the net effect of each individual explanatory measure in X on each individual outcome measure in Y. Because this matrix is constant across subjects, acquiring enough subjects' data enables us to compute a reasonable unbiased estimate of B using an Ordinary Least Squares (OLS) solution.

For example, from N subjects' data, we would typically construct the data and design matrices, respectively, as X = [x1' x2' x3' ... xN']' and Y = [y1' y2' y3' ... yN']' , resulting from vertically concatenating the corresponding x[n] and y[n] row vectors across all relevant subjects, and then use the OLS equation above to compute the best linear unbiased estimator of the unknown matrix B from the observed X and Y data.

## Hypothesis testing

In addition to estimating an approximation of the matrix B, we would often also like to evaluate specific hypotheses about this unknown matrix B given our available data. In particular, GLM allows us to use a standard Likelihood Ratio Test to specify and evaluate any hypothesis of the form "CBM'=D" for any user-defined arbitrary contrast matrices C, M and D.

Choosing different forms of the matrix C allows us to construct hypotheses that address specific combinations of explanatory measures X, as each column in the vector/matrix C is paired with the same column of the design matrix X. Similarly, choosing different forms of the matrix M allows us to construct hypotheses that address specific combinations of outcome measures Y, as each column in the contrast vector/matrix M is paired with the same column of the data matrix Y. Last, the choice of contrast matrix D determines the hypothesized net effect of the selected combination of explanatory measures on the selected combination of outcome measures (e.g. D is set to zero in many standard null-hypothesis scenarios).

For any user-defined contrast matrices C, M and D, the associated CBM'=D hypothesis is evaluated using a Wilks' Lambda statistic, defined in the context of a Likelihood Ratio Test by comparing a model that is constrained by this hypothesis (i.e. a model where CBM' equals precisely D) to an unconstrained model (i.e. a model where CBM' may take any value). In particular, Wilks lambda values range between 0 and 1, and are computed as the ratio of the residual errors of the unconstrained model over those of the constrained model. Low values (close to 0) indicate that the tested hypothesis may be false (i.e. it is appropriate to conclude from our observations that CBM' is likely not equal or close to D), while high values (close to 1) typically indicate that there is not enough evidence in our data to reject the tested hypothesis (CBM' might be precisely D or, perhaps more likely, simply close enough so that we still need more data if we hope to find a significant departure from D).

In order to more precisely define whether a particular value of lambda (e.g. 0.1) is low-enough to warrant our rejection of the tested hypothesis CBM'=D, the observed lambda value is typically compared to the distribution of lambda values that we could expect if the tested hypothesis was actually true (a Wilks' Lambda distribution), choosing to reject our hypothesis if the observed lambda value is below a pre-specified false-positive level (e.g. using a p < 0.05 threshold means that we will reject our hypothesis if the observed lambda value falls below the 5% percentile of the Wilks' Lambda distribution).

Wilks' Lambda distributions have three parameters: the number of dimensions a, the error degrees of freedom b, and the hypothesis degrees of freedom c, which are fully determined from the dimensionality and rank of the original data and choice of contrast matrices. Unfortunately, Wilks' Lambda distributions are only tabulated for a limited number of scenarios/dimensions, so CONN GLM implementation uses the following transformations in order to derive appropriate statistics and p-values for any tested hypothesis, depending on the specific values of a, b, and c:

Case 1. Statistics based on Student's t-distribution, when a=1 and c=1 (e.g. both M and C are vectors, and D is a scalar)

Examples: two-sample t-test, linear regression

Case 2. Statistics based on F-distribution, when a>1 and c=1 (e.g. M is a matrix, and both C and D are vectors)

Examples: Hotelling's two sample t-square test, repeated measures ANOVA, multivariate regression

Case 3. Statistics based on F-distribution, when a=1 and c>1 (e.g. C is a matrix, and both M and D are vectors)

Examples: ANOVA, ANCOVA, multiple regression omnibus test

Case 4. Statistics based on Rao's approximating F-distribution, when a>1 and c>1 (e.g. all M, C, and D are matrices)

Examples: MANOVA, MANCOVA, multivariate regression omnibus test

note-1: it could be argued that two-sided hypotheses of the form "CBM'=0 " are almost surely false in real world data, where an effect may be arbitrarily small but almost never precisely zero. Because of this, failure to reject a hypothesis of this form often simply indicates that the effect being evaluated (e.g. difference in connectivity between two groups) is too small to be detectable with the current experimental setup (e.g. with the current acquisition parameters and number of subjects) rather than truly non-existent. In this context, it is generally recommended, and not a bad idea at all, to attempt to always quantify and report the effects measured (e.g. report the estimated B values) instead of only relying on and reporting the significance of the hypotheses being evaluated. This can help build increasingly better model-based estimates of these effects, going beyond the initial but limited question of whether they "exist" (i.e. are they non-zero) or not. Last, testing one-tailed hypotheses of the form "CBM'>D" (these are available in CONN for univariate effects using LRT case 1) can also be used as a tool to combine "practical" significance (is an effect large?) with "statistical" significance (how confident are we of this, given the available data?)

note-2: in some cases, when the number of conditions is large and the sample size is too small (a>b), it is not possible to use statistics based on Wilks' Lambda, since the unknown error covariance W cannot be properly estimated. This is common, for example, in the context of omnibus tests, where we wish to evaluate whether some effect is present over a potentially large number of individual cases. In those scenarios, one alternative available in CONN is to use statistics based on a conservative Satterthwaite F-distribution approximation.

Another common alternative in these scenarios is to apply first a linear dimensionality reduction step by projecting the original data Y over the subspace spanned by the first few singular vectors from a model-agnostic Singular Value Decomposition (Strang 2007). As long as the dimensionality of this subspace is chosen to be smaller than the error degrees of freedom b, the full GLM error covariance over this subspace can be properly estimated, allowing the evaluation of Likelihood Ratio Tests using the standard Wilks' Lambda statistics (LRT case 2 or case 4).

## GLM model and contrast specification

Using the same General Linear Model framework it is possible to specify a very large array of classical analyses, including bivariate, multiple, and multivariate regression models, one-sample, two-sample, and paired t-tests, mixed within- and between- subject n-way ANOVAs, MANOVAs, etc. To define any particular analysis it is only necessary to specify four items, associated with X, Y, C, and M matrices, respectively, in the GLM framework:

1. Subject-effects: what is the list of explanatory/independent measures that we would like to include in this analysis? (i.e. what are the columns of X?)

This is typically defined simply by listing a series of subject-level covariates (e.g. age, IQ). In CONN these variables are defined in Setup.Covariates (2nd-level). In addition to continuous variables, such as age, or IQ, dummy-coded group variables are often useful to identify groups of subjects in our studies (e.g. a Patients covariate may take a value of 1 for patients, and 0 for controls). One such dichotomous variable identifying the entire group of subjects in our study is often used in simple designs where constant effects across all subjects are needed (in CONN this variable is automatically created and named AllSubjects, containing the value 1 for every subject)

2. Between-subjects contrasts: among these explanatory/independent measures, which one(s) or which combination of them do we want to evaluate/test? (i.e. what is the C vector/matrix?)

In the simplest scenario this contrast is just a vector, with as many elements as explanatory/independent measures, and having a value of 1 for the individual effect that we would like to evaluate/test and 0's for all other elements (e.g. if we have entered two subject-effects, characterizing Patient and Control subjects, respectively, a contrast vector with values [1, 0] would specify that we would like to evaluate/test the effect in Patients only. Other simple scenarios involve a contrast that acts to compare two effects, which is defined simply by entering a 1 and a -1 in the two elements that we would like to compare, and 0's in all other elements -if there are any- (e.g. a contrast vector with values [-1, 1] in the previous example would compare the effect in Controls to that in Patients). More complex contrast vectors can be specified simply as the weights of any desired linear combination of our model subject-effects (e.g. a contrast vector with values [0.5, 0.5] in the previous example would estimate the average effect across both Patients and Control subjects). Last, contrast matrices can be used to evaluate/test multiple effects jointly, where each individual effect is defined in the regular manner as an contrast vector, and those vectors are simply concatenated into a matrix (e.g. in a model with three groups instead of two, a contrast matrix [1, -1, 0; 0, 1, -1; 1, 0, -1] evaluates/tests the presence of any differences between the three groups)

3. Conditions (also known as measures or outcomes): what is the list of outcome/dependent measures that we would like to include in this analysis? (i.e. what are the columns of Y?)

This is typically defined by listing the individual outcome variables that we would like to investigate (e.g. SBC maps during a single rest condition in a standard resting state analysis, or SBC maps during Pre and Post conditions in an intervention design). In CONN this is defined by a combination of choosing which particular first-level functional connectivity measures, and which experimental conditions (if applicable), we would like to evaluate (e.g. SBC maps with one or several seeds during rest)

4. Between-conditions contrasts (also known as between-measures or within-subjects contrast): among these outcome/dependent measures, which one(s) or which combination of them do we want to evaluate/test ? (i.e. what is the M vector/matrix?)

These contrasts are defined in the same way as the between-subject contrasts above, now spanning across conditions/sources instead of across subject-effects (e.g. if we have selected two conditions, characterizing SBC maps pre- and post- intervention, a contrast vector with values [-1, 1] would compare the connectivity values across these two conditions)

Behind the apparent simplicity of these choices, a perhaps surprising array of different analyses can be specified using this framework. Some examples are shown in the table below

## Numerical examples

While generally, in the context of fcMRI analyses, we are interested in simultaneously evaluating or testing thousands of individual measures (e.g. SBC maps containing one measure of interest at each voxel), in this section we will consider, for simplicity and illustration purposes, just a single measure, and proceed to manually define a General Linear Model and use it to test some simple properties of this measure across subjects.

For these examples, imagine we have 10 subjects, and for each subject we have computed two functional connectivity measures of interest (e.g. connectivity strength between two a priori ROIs, estimated pre- and post- treatment). The data matrix would look something like the example to the right (Y matrix), where the first column in Y represents the connectivity values pre-treatment and the second column the values post-treatment for each of the 10 subjects.

Imagine also that these 10 subjects were divided in two groups (e.g. patients from two different clinics undergoing different types of treatments). In order to encode this information we would create a design matrix like the one in the example to the right (X matrix), where the first column indicates those subjects from clinic #1, and the second column those subjects from clinic #2.

Example 1: Imagine now we would like to quantify and evaluate potential differences in functional connectivity values between the patients from the two clinics, either before or after intervention. To do this we want to define the matrices C, M, and D as shown to the right. In particular the contras C is defined as [-1 1] in order to compare the effect of the two explanatory measures (the two clinics), and the contrast M is defined as the identity matrix in order to evaluate the effect on any of the outcome measures (either pre- or post- conditions). Rejecting this hypothesis would allow us to conclude that mean connectivity values in the two clinics are unlikely to be the same at either pre- or post- intervention.

In order to evaluate this hypothesis we could manually compute the lambda value, and compare that to the Wilks' Lambda distribution with 2, 8, and 1 degrees of freedom, or we could, for example, use the syntax:

[h, f, p, dof] = conn_glm( X, Y, C, M, D )

and CONN will use case-2 transformations to evaluate this hypothesis, returning the F- statistics and associated p-values shown here.

These results indicate that the functional connectivity trajectories before and after intervention in the two groups are significantly different (with higher connectivity in clinic #1 compared to clinic #2; 0.15 higher pre-intervention and 0.26 higher post-intervention). Note that GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential main effects of clinic (a between-subjects factor).

Example 2: Let's say now that we would like to quantify and evaluate potential differences in functional connectivity values between the two timepoints (pre- vs. post- intervention) in any of the two groups/clinics. To do this, we could define the matrices C, M, and D as shown to the right. In particular the matrix C is defined as the identity matrix in order to evaluate the effect of any of the two explanatory measures (the two clinics), while the matrix M is defined as [1 -1] in order to compare the two different outcome measures (pre- and post- conditions). This has the net effect of evaluating whether the connectivity values are different pre- vs. post- intervention in either of the two groups.

As before, we could evaluate this hypothesis by manually computing the lambda value, and comparing that to the Wilks' Lambda distribution with now 1, 8, and 2 degrees of freedom. Equivalently, if we use instead a conn_glm call, CONN will use case-3 transformations to evaluate this hypothesis, returning the F- statistics and associated p-values shown here.

These results indicate that there are significant functional connectivity changes post- vs. pre- intervention in our subjects (with general increases in connectivity post- intervention; a 0.23 increase in clinic #1 and a 0.12 increase in clinic #2). Note that GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential main effects of treatment (a within-subjects factor).

Example 3: Last, let's evaluate whether these increases in connectivity with intervention (a 0.23 increase in clinic #1 vs. a 0.12 increase in clinic #2) are significantly different between in the two clinics; or equivalently whether the difference in connectivity between the two clinics (a 0.15 difference pre-intervention vs. a 0.26 difference post-intervention) are significantly different between the two timepoints. To do this we would want now to define the matrices C, M, and D as shown to the right. In this case the matrix C is set to [-1 1] as in Example 1 in order to compare the contribution of the two predictor measures (the two clinics), and the matrix M is set to [-1 1] as in Example 2 in order to compare the effect across the two outcome measures (pre- and post- conditions). This has the net effect of comparing the between-group differences in connectivity between the two time-points.

As before, we could evaluate this hypothesis by manually computing the lambda value, and comparing that to the Wilks' Lambda distribution with now 1, 8, and 1 degrees of freedom, or simply use a conn_glm call. CONN will use case-1 transformations to evaluate this hypothesis, returning the T- statistics and associated p-values shown here.

These results indicate that, if there is a clinic by treatment interaction, the effect is relatively small and cannot be detected with this study sample size. Note that, as before, GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential interactions between clinic (a between-subjects factor) and treatment (a within-subjects factor).

## References

Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. Probability and Mathematical Statistics, London: Academic Press, 1979.

Rao, C. R. (1951). An asymptotic expansion of the distribution of Wilks’ criterion. Bulletin of the International Statistical Institute, 33(2), 177-180.

Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics bulletin, 2(6), 110-114.

## How to run CONN General Linear Model analyses

CONN's second-level analyses can be run using any of the following options:

### Option 1: using CONN's gui

If you have analyzed your data in CONN, in the Results window you may define a new second-level analyses by: a) selecting in the Subject effects list the desired set of independent measures X in your model (e.g. group variables MALES and FEMALES in the example below); b) entering or selecting an associated between-subjects contrast C across the selected measures (e.g. [1 -1] in the example below to compare the two groups); d) selecting in the Conditions list the desired set of dependent measures Y in your model (e.g. REST condtion in the example below); and d) entering or selecting an associated between-conditions contrast M across the selected measures (e.g. 1 in the example below to look only at connectivity measures during the REST condtion). After this simply click on 'Results explorer' to have CONN compute these analyses and display the results across the entire brain (see cluster-level inferences for additional details). Optionally select 'user-defined 2nd-level model' and rename this model for easy access to this analysis from CONN's gui in the future

### Option 2: using CONN's batch commands

Similarly, if you have analyzed your data in CONN, you may also run the same second-level analysis shown in the example above using Matlab command syntax:

conn_batch( 'filename',                                '/data/Cambridge/conn_Cambridge.mat', ... 
            'Results.analysis_number',                 'SBC', ...
            'Results.between_subjects.effect_names',   {'MALE','FEMALE'}, ...
            'Results.between_subjects.contrast',       [-1 1], ...
            'Results.between_conditions.effect_names', {'REST'}, ...
            'Results.between_conditions.contrast',     [1], ...
            'Results.between_sources.effect_names',    {'networks.DefaultMode.MPFC'}, ...
            'Results.between_sources.contrast',        [1], ...
            'Results.display',                         true )

optionally adding to this command any desired alternative field name/value pairs (see doc conn_batch for additional details)

### Option 3: using CONN's modular functions

If you have not analyzed your data in CONN but still would like to run CONN's second-level GLM analyses on data from other sources (e.g. to analyze a set of nifti contrast images computed using any arbitrary software package), you may do so using the following Matlab command syntax:

conn module glm

to manually specify a new second-level GLM analysis. Optionally add to this command any desired field name/value pairs (see doc conn_module for additional options). For example:

conn_module( 'glm' , ...
             'data', Y, ...
             'design_matrix', X, ...
             'contrast_between', C, ...
             'contrast_within', M, ...
             'folder', outputfoldername )

where X is the GLM design matrix (matrix with one row per subject and one column per modeled effect), Y is the input nifti image files (a cell matrix containing a list of filenames, with one row per subject and one column per measure/condition), C is the between-subjects contrast (one vector/matrix with the same number of elements/columns as X), and M is the within-subjects contrast (one vector/matrix with the same number of elements/columns as Y)

note: input NIFTI images can be standard 3d volumes (e.g. .nii files) for voxel-based analyses, fsaverage volumes (e.g. .surf.nii files) for surface-based analyses, or 2d matrices (e.g. .mtx.nii files) for ROI-to-ROI analyses. See conn_surf_write, conn_surf_curv2nii, or conn_surf_gii2nii documentation for help creating fsaverage nifti files from your data. See conn_mtx_write documentation for help creating matrix nifti files from your data.

see cluster-level inferences for additional information on how to make inferences and interpret second-level GLM results