Difference between revisions of "Examples"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
Line 9: Line 9:
Consider the linear model $$y=Xb+e$$ where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix. Since the only source of variation in $$y$$ is $$e$$, $$R$$ is equivalent to the residual co-variance matrix. The generalized least square solution for $$b$$ would be obtained from $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$. Note that in simple cases $$R=I\sigma_e^2$$, where $$I$$ is an $$n \times n$$ identity matrix, and estimating $$b$$ simplifies to $$b=(X'X)^{-1}X'y$$, which is equivalent to $$b=\frac{\sum_{i=1}^n y_i}{n}$$. These formulas seem to be a bit overshot for the task of estimating a mean, however, they should raise the awareness that a sound general model always needs a residual co-variance matrix.
Consider the linear model $$y=Xb+e$$ where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix. Since the only source of variation in $$y$$ is $$e$$, $$R$$ is equivalent to the residual co-variance matrix. The generalized least square solution for $$b$$ would be obtained from $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$. Note that in simple cases $$R=I\sigma_e^2$$, where $$I$$ is an $$n \times n$$ identity matrix, and estimating $$b$$ simplifies to $$b=(X'X)^{-1}X'y$$, which is equivalent to $$b=\frac{\sum_{i=1}^n y_i}{n}$$. These formulas seem to be a bit overshot for the task of estimating a mean, however, they should raise the awareness that a sound general model always needs a residual co-variance matrix.


Assume we have a data file:
Assume we have a data file "data.csv" with content:
  #y,mu
  #y,mu
  25.0 ,1
  25.0,1
  33.1 ,1
  33.1,1
  36.0 ,1
  36.0,1
  28.3 ,1
  28.3,1
where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records.
where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records.
 
A valid {{lmt}} xml parameter file would look like:
<code>
<syntaxhighlight lang="xml" line>
<root>
<root>
   <data>
   <data>
Line 52: Line 52:
   </solvers>
   </solvers>
</root>
</root>
</syntaxhighlight>
Note the hierarchical nesting structure in the above parameter file. Tags <data>, <vars>, <model>, <jobs> and <solvers> are all nested inside tag <root>. However, all those tags may contain nested tags as well. It is crucial that nested tags are placed in the right position.
The most important aspect is the model definition in tag <eqn>, nested inside tag
<model> in line 16, which is $$y=mu*b$$. The variable names used here are either defined
by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data
file header, whereas $$b$$ is a user-defined variable name. Translated this means that the
content of the data column named $$y$$ should be regressed on the content of the data
column named $$mu$$ with the regression coefficient named $$b$$. Since there are no further
specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is continuous , $$mu$$ is a
classification variable, and $$b$$ is fixed factor.
Since linear mixed models are used to analyse sources of data variation we need to
define the causal variances. These variances are defined inside tag <vars>. In our
example this tag spans from line 5 to line 13. {{lmt}} requires one compulsory variance, the
residual variance, which must be specified via tag <res>. This is sufficient for our model
as we don’t have any random effects. The actual residual variance is specified inside tag
<sigma>. The reason for having an extra tag <sigma> inside <res> will become clear
in the next example.
Note that the spelling of all tags used in the above parameter file is determined by
{{lmt}} and must be abide by. However, red-coloured words are user defined, but once the
word is user-defined any subsequent use the spelling must be the same. For example
look at the solver tag spanning from line 25 to 33. It defines a solver named "mysolver"
being of type <pcgiod> which will run till convergence but not more than 10000 rounds.
Subsequently the name "mysolver" is used in job solve to provide the job with a solver to
fulfil the task. However, if mysolver would have been spelled MYSOLVER in line 22,
{{lmt}} would have stopped with an error message.


</code>
== Estimating a mean and genetic effects ==

Revision as of 11:56, 2 December 2020

The lmt parameter file is in written in “eXtensible Markeup Language” (xml). For under- standing the examples you may want to get a jump start in xml file structure first. Don’t be scared. Even without any previous knowledge you’ll be able to understand the xml structure used for lmt in less then half an hour. Please consult the Internet to find suitable introduction into xml

Estimating a mean

Consider the linear model $$y=Xb+e$$ where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix. Since the only source of variation in $$y$$ is $$e$$, $$R$$ is equivalent to the residual co-variance matrix. The generalized least square solution for $$b$$ would be obtained from $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$. Note that in simple cases $$R=I\sigma_e^2$$, where $$I$$ is an $$n \times n$$ identity matrix, and estimating $$b$$ simplifies to $$b=(X'X)^{-1}X'y$$, which is equivalent to $$b=\frac{\sum_{i=1}^n y_i}{n}$$. These formulas seem to be a bit overshot for the task of estimating a mean, however, they should raise the awareness that a sound general model always needs a residual co-variance matrix.

Assume we have a data file "data.csv" with content:

#y,mu
25.0,1
33.1,1
36.0,1
28.3,1

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. A valid lmt xml parameter file would look like:

<root>
  <data>
    file: data.csv
  </data>
  <vars>
    <res>
      <sigma>
        <matrix>
          5.0
        </matrix>
      </sigma>
    </res>
  </vars>
  <model>
    <eqn attributes="strings">
      y=mu*b
    </eqn>
  </model>
  <jobs>
    jobs: solve
    <solve>
      solver: mysolver
    </solve>
  </jobs>
  <solvers>
    solvers: mysolver
    <mysolver>
      type: pcgiod
      <pcgiod>
         rounds: 10000
      </pcgiod>
    </mysolver>
  </solvers>
</root>

Note the hierarchical nesting structure in the above parameter file. Tags , <vars>, <model>, <jobs> and <solvers> are all nested inside tag <root>. However, all those tags may contain nested tags as well. It is crucial that nested tags are placed in the right position. The most important aspect is the model definition in tag <eqn>, nested inside tag <model> in line 16, which is $$y=mu*b$$. The variable names used here are either defined by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data file header, whereas $$b$$ is a user-defined variable name. Translated this means that the content of the data column named $$y$$ should be regressed on the content of the data column named $$mu$$ with the regression coefficient named $$b$$. Since there are no further specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is continuous , $$mu$$ is a classification variable, and $$b$$ is fixed factor. Since linear mixed models are used to analyse sources of data variation we need to define the causal variances. These variances are defined inside tag <vars>. In our example this tag spans from line 5 to line 13. lmt requires one compulsory variance, the residual variance, which must be specified via tag <res>. This is sufficient for our model as we don’t have any random effects. The actual residual variance is specified inside tag <sigma>. The reason for having an extra tag <sigma> inside <res> will become clear in the next example. Note that the spelling of all tags used in the above parameter file is determined by lmt and must be abide by. However, red-coloured words are user defined, but once the word is user-defined any subsequent use the spelling must be the same. For example look at the solver tag spanning from line 25 to 33. It defines a solver named "mysolver" being of type <pcgiod> which will run till convergence but not more than 10000 rounds. Subsequently the name "mysolver" is used in job solve to provide the job with a solver to fulfil the task. However, if mysolver would have been spelled MYSOLVER in line 22, lmt would have stopped with an error message.

Estimating a mean and genetic effects