# General Linear Model

Second-level analyses allow researchers to make inferences about properties of groups or populations by generalizing from the observations of only a subset of subjects in a study. CONN uses the General Linear Model (GLM) for all second-level analyses of functional connectivity data

## Definition and estimation

The General Linear Model defines a multivariate linear association between a subject's "outcome/dependent" measures Y and a set of "explanatory/independent" measures X about this same subject. In the context of functional connectivity MRI analyses an outcome variable y[n] will typically take the form of a row vector encoding functional connectivity values recorded from the n-th subject in a study across several experimental conditions, and the explanatory variable x[n] will be a row vector encoding several group, behavioral, or demographic variables for that same subject (see numerical examples section below).

The (generally unknown) association in the population between each of the explanatory measures in X and each of the outcome measures in Y is characterized by the matrix B in this equation. The vector epsilon represents the cumulative contribution of all other unspecified factors on the outcome measure Y beyond those that can be predicted from the knowledge of X. (i.e. the model error term). Because of the Central Limit Theorem, it is often reasonable to model this cumulative contribution as a Normally distributed unknown term. GLM assumes this term is independent across subjects and follows a multivariate normal distribution with mean zero and an arbitrary variance-covariance structure across measures.

We are typically interested in quantifying or estimating the values of the matrix B. Because this matrix is constant across subjects (unlike x[n], y[n], or epsilon[n], which are different for each subject n), acquiring enough subjects' data enables us to compute a reasonable unbiased estimate of B using an Ordinary Least Squares (OLS) solution.

For example, from N subjects' data, we construct the data and design matrices, respectively, X = [x1' x2' x3' ... xN']' and Y = [y1' y2' y3' ... yN']' , resulting from vertically concatenating the corresponding x[n] and y[n] row vectors across all relevant subjects, and then use the OLS equation above in order to compute the best linear unbiased estimator of the unknown matrix B from the observed X and Y data.

## Hypothesis testing

In addition to computing an estimate or approximation of the matrix B, we often would also like to evaluate specific hypotheses about this unknown matrix B from our available data. In particular, GLM allows us to use a standard Likelihood Ratio Test to specify and evaluate any hypothesis of the form "CBM'=D" for any user-defined arbitrary contrast matrices C, M and D.

These hypotheses are evaluated using the Wilks' Lambda statistics. Wilks lambda values range between 0 and 1. Low values (close to 0) typically indicate that the tested hypothesis is likely false (i.e. CBM' is likely not equal to D), while high values (close to 1) typically indicate that there is not enough evidence in our data to reject the tested hypothesis (CBM' might be exactly D or perhaps close enough so that we still need more data to find a significant departure from D).

In order to more precisely define whether a particular value of lambda (e.g. 0.1) is low-enough to warrant our rejection of the tested hypothesis CBM'=D, the observed lambda value is typically compared to the distribution of lambda values that we could expect if the tested hypothesis was actually true (a Wilks' Lambda distribution), choosing to reject our hypothesis if the observed lambda value is below a pre-specified false-positive level (e.g. a p < 0.05 threshold means that we will reject our hypothesis if the observed lambda value falls below the 5% percentile of the Wilks' Lambda distribution).

Since Wilks' Lambda distributions are only tabulated for a limited number of scenarios/dimensions, CONN GLM implementation uses the following transformations in order to derive appropriate statistics and p-values for any tested hypothesis, depending on the dimensionality/rank of the user-defined matrices C, M, and D:

Case 1. Statistics based on Student's t-distribution, when a=1 and c=1 (e.g. both M and C are vectors, and D is a scalar)

Examples: two-sample t-test, linear regression

Case 2. Statistics based on F-distribution, when a>1 and c=1 (e.g. M is a matrix, and both C and D are vectors)

Examples: Hotelling's two sample t-square test, repeated measures ANOVA, multivariate regression

Case 3. Statistics based on F-distribution, when a=1 and c>1 (e.g. C is a matrix, and both M and D are vectors)

Examples: ANOVA, ANCOVA, multiple regression omnibus test

Case 4. Statistics based on Rao's approximating F-distribution, when a>1 and c>1 (e.g. all M, C, and D are matrices)

Examples: MANOVA, MANCOVA, multivariate regression omnibus test

## Numerical examples

For these examples, imagine we have 10 subjects, and for each subject we have computed two functional connectivity measures of interest (e.g. connectivity strength between two a priori ROIs pre- and post- treatment). The data matrix would look something like the example to the right (Y matrix), where the first column in Y represents the connectivity values pre-treatment and the second column the values post-treatment for each of the 10 subjects.

Imagine also that these 10 subjects were divided in two groups of 5 subjects each (e.g. patients from two different clinics undergoing different types of treatments), and we also know the subject's clinical scores at baseline, which we would like to use as a control covariate in our analyses since we believe that it may also influence a subject's functional connectivity values. In order to encode this information we would create a design matrix like the one in the example to the right (X matrix), where the first column indicates those subjects from clinic #1, the second column those subjects from clinic #2. and the third column the subjects clinical scores.

Example 1: Let's say that we would like to quantify and evaluate potential differences in functional connectivity values between the patients from the two clinics, either before or after intervention. To do this we want to define the matrices C, M, and D as shown to the right. This has the effect of defining our test hypothesis as stating that the second row of B (B21 B22 values, defining connectivity values in clinic #2 pre- and post- intervention, respectively) is equal to the first row of B (B11 B12; equivalent connectivity values in clinic #1). Rejecting this hypothesis would allow us to conclude that mean connectivity values in the two clinics (at the same level of subject clinical scores) are unlikely to be the same at either pre- or post- intervention.

In order to evaluate this hypothesis we could manually compute the lambda value, and compare that to the Wilks' Lambda distribution with 2, 7, and 1 degrees of freedom, or we could, for example, use the syntax:

[h, f, p, dof] = conn_glm( X, Y, C, M, D )

and CONN will use case-2 transformations to evaluate this hypothesis, returning the F- statistics and associated p-values shown here.

These results indicate that the functional connectivity trajectories before and after intervention in the two groups are significantly different (with higher connectivity in clinic #2, perhaps particularly higher post-intervention). Note that GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential main effects of clinic (a between-subjects factor).

Example 2: Let's say now that we would like to quantify and evaluate potential differences in functional connectivity values between the two timepoints (pre- vs. post- intervention) in any of the two groups/clinics. To do this, we could define the matrices C, M, and D as shown to the right. This has the effect of evaluating whether the second column of B (B12 B22; connectivity values post-intervention in each of the two groups) is equal to the first column of B (B11 B21; pre-intervention values), or, in other words, evaluating whether the connectivity values are different pre- vs. post- intervention in either of the two groups.

As before, we could evaluate this hypothesis by manually computing the lambda value, and comparing that to the Wilks' Lambda distribution with now 1, 7, and 2 degrees of freedom. Equivalently, if we use instead a conn_glm call, CONN will use case-3 transformations to evaluate this hypothesis, returning the F- statistics and associated p-values shown here.

These results indicate that there are significant functional connectivity changes post- vs. pre- intervention in our subjects (with general increases in connectivity post- intervention, perhaps with larger increases in clinic #1). Note that GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential main effects of treatment (a within-subjects factor).

Example 3: Last, let's evaluate whether that difference in connectivity between the two groups is the same pre- vs. post- intervention (or equivalently whether the increases in connectivity with intervention are the same in the two clinics). To do this we would want now to define the matrices C, M, and D as shown to the right. This has the effect of comparing the between-group differences in connectivity between the two time-points.

As before, we could evaluate this hypothesis by manually computing the lambda value, and comparing that to the Wilks' Lambda distribution with now 1, 7, and 1 degrees of freedom, or, use a conn_glm call. CONN will use case-1 transformations to evaluate this hypothesis, returning the T- statistics and associated p-values shown here.

These results indicate that, if there is a clinic by treatment interaction, the effect is relatively small and cannot be detected with this study sample size. Note that, as before, GLM analyses in this context are exactly equivalent to those from a mixed-model two-way ANCOVA evaluating potential interactions between clinic (a between-subjects factor) and treatment (a within-subjects factor).

## Software

See these tutorials for a description of the use of second-level analyses using CONN's main gui

See CONN documentation for a description of the use of second-level analyses using CONN's batch scripts

In addition, you may also use CONN's GLM implementation outside of the usual CONN framework (e.g. for your own numerical or imaging analyses) directly accessing CONN's conn_glm or conn_module_glm Matlab functions. See below for additional information about these functions:

### conn_glm (to run a GLM analysis on your own numerical data)

conn_glm General Linear Model estimation and hypothesis testing. [h,F,p,dof]=conn_glm(X,Y,C,M,D) estimates a linear model of the form Y = X*B + E where Y is an observed matrix of response or output variables (rows are observations, columns are output variables) X is an observed design or regressor matrix (rows are observations, columns are predictor variables) B is a matrix of unknown regression parameters (rows are predictor variables, columns are output variables) E is an matrix of unobserved multivariate normally distributed disturbances with zero mean and unknown covariance. and tests a general null hypothesis of the form C*B*M' = D where C is matrix or vector of "predictor" contrasts (rows are contrasts, columns are predictor variables, defaults to C=eye(size(X,2)) ) M is matrix or vector of "outcome" contrasts (rows are contrasts, columns are output variables, defaults to M=eye(size(Y,2)) ) D is matrix or vector of "baseline" values (rows are predictor contrasts, columns are outcome contrasts, defaults to D=0 )
conn_glm returns the following information: h: a matrix of estimated contrast effect sizes (h = C*B*M' - D) F: the test statistic(s) (T,F,or Chi2 value, depending on whether h is a scalar, a vector, or a matrix. See below) p: p-value of the test(s) dof: degrees of freedom Additional information: By default conn_glm will use a T, F, or a Chi2 statistic for hypothesis testing depending on the size of h=C*B*M'. The default options are: when size(h)=[1,1] -> T statistic (note: one-sided t-test) Examples of use: one-sided two-sample t-test, linear regression when size(h)=[1,Ns] -> F statistic (note: equivalent to two-sided t-test when Ns=1) Examples of use: Hotelling's two sample t-square test, repeated measures ANOVA, multivariate regression when size(h)=[Nc,1] -> F statistic (note: equivalent to two-sided t-test when Nc=1) Examples of use: ANOVA, ANCOVA, linear regression omnibus test when size(h)=[Nc,Ns] -> Wilks' Lambda statistic Examples of use: MANOVA, MANCOVA, multivariate regression omnibus test, likelihood ratio test The default option can be changed using the syntax conn_glm(X,Y,C,M,opt) where opt is one of the following character strings: conn_glm(X,Y,C,M,'collapse_none') will perform a separate univariate test (T-statistic) on each of the elements of the matrix h=C*B*M' conn_glm(X,Y,C,M,'collapse_outcomes') will perform a separate multivariate test (F-statistic) on each of the rows of the matrix h (collapsing across multiple outcome variables or outcome contrasts) conn_glm(X,Y,C,M,'collapse_predictors') will perform a separate multivariate test (F-statistic) on each of the columns of the matrix h (collapsing across multiple predictor variables or predictor contrasts) conn_glm(X,Y,C,M,'collapse_all_rao') will perform a single omnibus multivariate test (Wilks Lambda statistic, Rao's F approximation; no assumptions on form of M*E'*E*M covariance) on the entire matrix h conn_glm(X,Y,C,M,'collapse_all_bartlett') will perform a single omnibus multivariate test (Wilks Lambda statistic, Bartlett's Chi2 approximation; no assumptions on form of M*E'*E*M covariance) conn_glm(X,Y,C,M,'collapse_all_satterthwaite') will perform a single omnibus univariate test (F-statistic with conservative Satterthwaite dof correction; no assumptions on form of M*E'*E*M covariance) conn_glm(X,Y,C,M,'collapse_all_sphericity') will perform a single omnibus univariate test (F-statistic assumming sphericity M*E'*E*M' = sigma*I) conn_glm(X,Y,C,M,'collapse_all') same as 'collapse_all_rao' Example of use: % MANOVA (three groups, two outcome variables) % Data preparation N1=10;N2=20;N3=30; Y1=randn(N1,2)+repmat([0,0],[N1,1]); % data for group 1 (N1 samples, population mean = [0,0]) Y2=randn(N2,2)+repmat([0,1],[N2,1]); % data for group 2 (N2 samples, population mean = [0,1]) Y3=randn(N3,2)+repmat([1,0],[N3,1]); % data for group 2 (N3 samples, population mean = [1,0]) Y=cat(1,Y1,Y2,Y3); X=[ones(N1,1),zeros(N1,2); zeros(N2,1),ones(N2,1),zeros(N2,1); zeros(N3,2),ones(N3,1)]; % Sample data analyses [h,F,p,dof]=conn_glm(X,Y,[1,-1,0;0,1,-1]); disp(['Multivariate omnibus test of non-equality of means across the three groups:']);disp([' F(',num2str(dof(1)),',',num2str(dof(2)),') = ',num2str(F),' p = ',num2str(p)]); [h,F,p,dof]=conn_glm(X,Y,[1,-1,0]); disp(['Multivariate test of non-equality of means between groups 1 and 2:']);disp([' F(',num2str(dof(1)),',',num2str(dof(2)),') = ',num2str(F),' p = ',num2str(p)]); [h,F,p,dof]=conn_glm(X,Y,[-1,1,0],eye(2),'collapse_none'); disp(['Univariate one-sided test of non-equality of means between groups 1 and 2 on each outcome variable:']);disp([' T(',num2str(dof),') = ',num2str(F(:)'),' p = ',num2str(p(:)')]);

### conn_module_glm (to run CONN's second-level analysis GLM implementation on your own neuroimaging measures)

conn_module_glm second-level model estimation conn_module_glm(X,Y,c1,c2,folder) X : design matrix (Nsubjects x Neffects) Y : data files (cell array Nsubjects x Nmeasures) c1 : contrast between-subjects (Nc1 x Neffects) (default eye(size(X,2))) c2 : contrast between-measures (Nc2 x Nmeasures) (default eye(size(Y,2))) folder: folder where analysis are stored (default current folder) eg: conn_module_glm( ... [1; 1; 1; 1] ,... {'subject1.img'; 'subject2.img'; 'subject3.img'; 'subject4.img'} ); performs a one-sample t-test and stores the analysis results in the current folder eg: conn_module_glm( ... [1 0; 1 0; 0 1; 0 1; 0 1],... {'subject1_group1.img'; 'subject2_group1.img'; 'subject1_group2.img'; 'subject2_group2.img'; 'subject3_group2.img'},... [1 -1]); performs a two-sample t-test and stores the analysis results in the current folder eg: conn_module_glm( ... [1; 1; 1; 1],... {'subject1_time1.img', subject1_time2.img'; 'subject2_time1.img', subject2_time2.img'; 'subject3_time1.img', subject3_time2.img'; 'subject4_time1.img', subject4_time2.img'},... 1,... [1 -1]); performs a paired t-test and stores the analysis results in the current folder

## Notes

note-1: it could be argued that two-sided hypotheses of the form "CBM'=0 " are almost surely false in real world data, where an effect may be arbitrarily small but almost never precisely zero. Because of this, failure to reject a hypothesis of this form often simply indicates that the effect being evaluated (e.g. difference in connectivity between two groups) is too small to be detectable with the current experimental setup (e.g. with the current acquisition parameters and number of subjects) rather than truly non-existent. In this context, it is generally recommended, and not a bad idea at all, to attempt to always quantify and report the effects measured (e.g. report the estimated B values) instead of only relying on and reporting the significance of the hypotheses being evaluated. This can help build increasingly better model-based estimates of these effects, going beyond the initial but limited question of whether they "exist" (i.e. are they non-zero) or not. Last, testing one-tailed hypotheses of the form "CBM'>D" (these are available in CONN for univariate effects using LRT case 1) can also be used as a tool to combine "practical" significance (is an effect large?) with "statistical" significance (how confident are we of this, given the available data?)

note-2: in some cases, when the number of conditions is large and the sample size is too small (a>b), it is not possible to use statistics based on Wilks' Lambda, since the unknown error covariance W cannot be properly estimated. In those scenarios CONN also offers the computation of statistics based on the following Satterthwaite F-distribution approximation: