Examples

From Linear Mixed Models Toolbox
Revision as of 12:50, 2 December 2020 by Boerner (talk | contribs)
Jump to navigation Jump to search

The lmt parameter file is in written in “eXtensible Markeup Language” (xml). For under- standing the examples you may want to get a jump start in xml file structure first. Don’t be scared. Even without any previous knowledge you’ll be able to understand the xml structure used for lmt in less then half an hour. Please consult the Internet to find suitable introduction into xml

Estimating a mean

Consider the linear model $$y=Xb+e$$ where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$b$$ is a fixed factor (mean), $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix. Since the only source of variation in $$y$$ is $$e$$, $$R$$ is equivalent to the residual co-variance matrix. The generalized least square solution for $$b$$ would be obtained from $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$. Note that in simple cases $$R=I\sigma_e^2$$, where $$I$$ is an $$n \times n$$ identity matrix, and estimating $$b$$ simplifies to $$b=(X'X)^{-1}X'y$$, which is equivalent to $$b=\frac{\sum_{i=1}^n y_i}{n}$$. These formulas seem to be a bit overshot for the task of estimating a mean, however, they should raise the awareness that a sound general model always needs a residual co-variance matrix.

Assume we have a data file "data.csv" with content:

#y,mu
25.0,1
33.1,1
36.0,1
28.3,1

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. A valid lmt xml parameter file would look like:

<root>
  <data>
    file: data.csv
  </data>
  <vars>
    <res>
      <sigma>
        <matrix>
          5.0
        </matrix>
      </sigma>
    </res>
  </vars>
  <model>
    <eqn attributes="strings">
      y=mu*b
    </eqn>
  </model>
  <jobs>
    jobs: solve
    <solve>
      solver: mysolver
    </solve>
  </jobs>
  <solvers>
    solvers: mysolver
    <mysolver>
      type: pcgiod
      <pcgiod>
         rounds: 10000
      </pcgiod>
    </mysolver>
  </solvers>
</root>

Note the hierarchical nesting structure in the above parameter file. Tags , <vars>, <model>, <jobs> and <solvers> are all nested inside tag <root>. However, all those tags may contain nested tags as well. It is crucial that nested tags are placed in the right position. The most important aspect is the model definition in tag <eqn>, nested inside tag <model> in line 16, which is $$y=mu*b$$. The variable names used here are either defined by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data file header, whereas $$b$$ is a user-defined variable name. Translated this means that the content of the data column named $$y$$ should be regressed on the content of the data column named $$mu$$ with the regression coefficient named $$b$$. Since there are no further specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is continuous , $$mu$$ is a classification variable, and $$b$$ is fixed factor. Since linear mixed models are used to analyse sources of data variation we need to define the causal variances. These variances are defined inside tag <vars>. In our example this tag spans from line 5 to line 13. lmt requires one compulsory variance, the residual variance, which must be specified via tag <res>. This is sufficient for our model as we don’t have any random effects. The actual residual variance is specified inside tag <sigma>. The reason for having an extra tag <sigma> inside <res> is directly related to our model variance $$I\sigma_e^2$$. The variance has two components $$I$$ and $$\sigma_e^2$$. Since $$I$$ is an identity matrix it is omitted from the variance declaration, but $$\sigma_e^2$$ must be declared which happens inside tag <sigma>.

Note that the spelling of all tags used in the above parameter file is determined by lmt and must be abide by. However, red-coloured words are user defined, but once the word is user-defined any subsequent use the spelling must be the same. For example look at the solver tag spanning from line 25 to 33. It defines a solver named "mysolver" being of type <pcgiod> which will run till convergence but not more than 10000 rounds. Subsequently the name "mysolver" is used in job solve to provide the job with a solver to fulfil the task. However, if mysolver would have been spelled MYSOLVER in line 22, lmt would have stopped with an error message.

Estimating a mean and genetic effects

Consider the linear model $$y=Xb+Zu+e$$ where all variables are those declared in ..., $$u$$ is vector of length $$m$$ of random direct genetic effects and $$Z$$ is a design matrix of dimension $$n \times m$$ linking genetic effects to their respective observations. Note that $$u\sim N(0,A\sigma_a^2)$$ where $$A$$ is the pedigree-derived relationship matrix. A possible data file for such mode may look like:

#y,mu,id
25.0,1,5
33.1,1,6
36.0,1,7
28.3,1,8

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. Further assume a pedigree in a file called "ped.csv" with content:

1,0,0
2,0,0
3,1,0
4,0,2
5,3,4
6,0,4
7,5,4
8,5,7


and a valid lmt parameter file:

<root>
  <data>
    file: data.csv
  </data>
  <pedigrees>
    pedigrees: myped
      <myped>
        file: ped.csv
      </myped>
  </pedigrees>
  <vars>
    <res>
      <sigma>
        <matrix>
          5.0
        </matrix>
      </sigma>
    </res>
    vars: myvar
    <myvar>
      <sigma>
        file: mysigma.csv
      </sigma>
      <gamma>
        <A>
          pedigree: myped
        </A>
      </gamma>
    </myvar>
  </vars>
  <model>
    <eqn attributes="strings">
      y = mu*b + id*u(v(myvar(1)))
    </eqn>
  </model>
  <jobs>
    jobs: solve
    <solve>
      solver: mysolver
    </solve>
  </jobs>
  <solvers>
    solvers: mysolver
    <mysolver>
      type: pcgiod
      <pcgiod>
         rounds: 10000
      </pcgiod>
    </mysolver>
  </solvers>
</root>

Compared with the parameter file in example Estimating a mean the one above contains only a few extra elements. One this the <pedigrees> tag spanning from line 5 to 10 and nested inside tag <root>. This tag contains a keystring pedigrees: myped, where the user-defined variables behind pedigrees: are a comma-separated list of tags nested inside <pedigrees>. The provision of that list triggers lmt to search and evaluate those tags. This concept allows to supply several pedigrees to lmt (e.g. a normal pedigree and a genetic group pedigree). In our example we have only one pedigree named myped, with tag <myped> containing the information about this pedigree. Another additional element is the keystring vars: myvar in line 19. Similar to tag <pedigrees>, tags nested inside tag <vars> are only evaluated if nominated behind vars: as a comma-separated list (e.g. vars: a,b,c), except for tag <res>, which is compulsory and evaluated automatically. Tag <myvar> consist of two structural components: <sigma> and <gamma>. We know <sigma> already from tag <res>. To understand the requirement for tag <gamma> we need to acknowledged that the variance for a random factor can be generalized as a Kronecker product


Contrarily to the residual variance $$I\sigma_e^2$$, where $$I$$ can be safely omitted, $$A\sigma_a^2$$ has two components $$A$$ and $$\sigma_a^2$$ which both need to be declared. The section in the parameter file where the variances are declared spans from line 11 to line 30. It contains two variances, <res> and <myvar>. Since <res> is the residual variance it is compulsory and if missing lmt will stop. All other variances are only evaluated if the appear in in keystring vars: var1,var2,...,varN which in the above example is vars: myvar. Variance "myvar" contains two components: <sigma> and <gamma>