Difference between revisions of "Examples"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
Line 68: Line 68:
residual variance, which must be specified via tag <res>. This is sufficient for our model
residual variance, which must be specified via tag <res>. This is sufficient for our model
as we don’t have any random effects. The actual residual variance is specified inside tag
as we don’t have any random effects. The actual residual variance is specified inside tag
<sigma>. The reason for having an extra tag <sigma> inside <res> will become clear
<sigma>. The reason for having an extra tag <sigma> inside <res> is directly related to our model variance $$I\sigma_e^2$$.
in the next example.
 
Note that the spelling of all tags used in the above parameter file is determined by
Note that the spelling of all tags used in the above parameter file is determined by
{{lmt}} and must be abide by. However, red-coloured words are user defined, but once the
{{lmt}} and must be abide by. However, red-coloured words are user defined, but once the

Revision as of 12:07, 2 December 2020

The lmt parameter file is in written in “eXtensible Markeup Language” (xml). For under- standing the examples you may want to get a jump start in xml file structure first. Don’t be scared. Even without any previous knowledge you’ll be able to understand the xml structure used for lmt in less then half an hour. Please consult the Internet to find suitable introduction into xml

Estimating a mean

Consider the linear model $$y=Xb+e$$ where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$b$$ is a fixed factor (mean), $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix. Since the only source of variation in $$y$$ is $$e$$, $$R$$ is equivalent to the residual co-variance matrix. The generalized least square solution for $$b$$ would be obtained from $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$. Note that in simple cases $$R=I\sigma_e^2$$, where $$I$$ is an $$n \times n$$ identity matrix, and estimating $$b$$ simplifies to $$b=(X'X)^{-1}X'y$$, which is equivalent to $$b=\frac{\sum_{i=1}^n y_i}{n}$$. These formulas seem to be a bit overshot for the task of estimating a mean, however, they should raise the awareness that a sound general model always needs a residual co-variance matrix.

Assume we have a data file "data.csv" with content:

#y,mu
25.0,1
33.1,1
36.0,1
28.3,1

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. A valid lmt xml parameter file would look like:

<root>
  <data>
    file: data.csv
  </data>
  <vars>
    <res>
      <sigma>
        <matrix>
          5.0
        </matrix>
      </sigma>
    </res>
  </vars>
  <model>
    <eqn attributes="strings">
      y=mu*b
    </eqn>
  </model>
  <jobs>
    jobs: solve
    <solve>
      solver: mysolver
    </solve>
  </jobs>
  <solvers>
    solvers: mysolver
    <mysolver>
      type: pcgiod
      <pcgiod>
         rounds: 10000
      </pcgiod>
    </mysolver>
  </solvers>
</root>

Note the hierarchical nesting structure in the above parameter file. Tags , <vars>, <model>, <jobs> and <solvers> are all nested inside tag <root>. However, all those tags may contain nested tags as well. It is crucial that nested tags are placed in the right position. The most important aspect is the model definition in tag <eqn>, nested inside tag <model> in line 16, which is $$y=mu*b$$. The variable names used here are either defined by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data file header, whereas $$b$$ is a user-defined variable name. Translated this means that the content of the data column named $$y$$ should be regressed on the content of the data column named $$mu$$ with the regression coefficient named $$b$$. Since there are no further specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is continuous , $$mu$$ is a classification variable, and $$b$$ is fixed factor. Since linear mixed models are used to analyse sources of data variation we need to define the causal variances. These variances are defined inside tag <vars>. In our example this tag spans from line 5 to line 13. lmt requires one compulsory variance, the residual variance, which must be specified via tag <res>. This is sufficient for our model as we don’t have any random effects. The actual residual variance is specified inside tag <sigma>. The reason for having an extra tag <sigma> inside <res> is directly related to our model variance $$I\sigma_e^2$$.

Note that the spelling of all tags used in the above parameter file is determined by lmt and must be abide by. However, red-coloured words are user defined, but once the word is user-defined any subsequent use the spelling must be the same. For example look at the solver tag spanning from line 25 to 33. It defines a solver named "mysolver" being of type <pcgiod> which will run till convergence but not more than 10000 rounds. Subsequently the name "mysolver" is used in job solve to provide the job with a solver to fulfil the task. However, if mysolver would have been spelled MYSOLVER in line 22, lmt would have stopped with an error message.

Estimating a mean and genetic effects

Consider the linear model $$y=Xb+Zu+e$$ where all variables are those declared in ..., $u$ is vector of length $$m$$ of genetic effects and $Z$ is a design matrix of dimension $$n \times m$$ linking genetic effects to their respective observations. Note that the variation in $$y$$ is caused by