Examples

From Linear Mixed Models Toolbox
Jump to navigation Jump to search

The examples provided in this section are meant to provide a practical examples about the lmt facilities and the parameter file syntax. It is assumed that the reader is familiar with section

Solving linear mixed model equations

Estimating a mean in a uni-variate model

Estimating a mean is equivalent to obtaining the generalized least square solution $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$ for model $$y=Xb+e$$, where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$b$$ is a fixed factor (mean), $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix.

From the above it follows that for task of solving for $$b$$ lmt needs following information:

the data
the residual variance $$R$$
the model
the solver

Assume we have a data file "data.csv" with content:

#y,mu
25.0,1
33.1,1
36.0,1
28.3,1

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. A valid lmt xml parameter file would look like:

<root>
  <models>
    <eqn attributes="strings">
      y=mu*b
    </eqn>
  </models>
  <data>
    datafile: data.csv
    missingthreshold: -50.0
  </data>
  <vars>
    <res>
      <sigma>
        <matrix attributes="matrix">
          5.0
        </matrix>
      </sigma>
    </res>
  </vars>
</root>

Following the introduced parameterfile terminology tags , <vars> and <model> are automatic-compulsory. Since solve is the default job and we are using the default solver in default parameterization no further information about the job or solver is required.

The most important aspect is the model definition in tag <eqn> , nested inside tag <model> $$y=mu*b$$. The variable names used here are either defined by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data file header, whereas $$b$$ is a user-defined factor name. Translated this means that the content of the data column named $$y$$ should be regressed on the content of the data column named $$mu$$ with the regression coefficient named $$b$$.

Since there are no further specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is a continuous variable, $$mu$$ is a classification variable, and $$b$$ is fixed factor. The necessary variances are defined by the content of the automatic-compulsory tag <vars> . lmt requires one compulsory variance, the residual variance, which must be specified via tag <res> . Therefore tag res is automatic-compulsory.

The default lmt variance structure is $$\Gamma\otimes\Sigma$$, where $$\Gamma$$ and $$\Sigma$$ are specified inside tags <gamma> and <sigma> , respectively. However, only tag <sigma> is automatic-compulsory, whereas tag <gamma> is automatic-optional. A missing <gamma> tag implies that $$\Gamma = I$$. Note that for lmt $$\Sigma$$ is always a matrix, that is a scalar $$\sigma^2$$ is treated as a matrix $$1 \times 1$$ matrix.

For the above example, the variance specification inside <res> implies that $$\Gamma\otimes \Sigma \equiv I\otimes \Sigma$$. Since $$\Sigma$$ is a $$1\times 1$$ matrix with $$\Sigma[1,1]=\sigma_e^2$$, $$R$$ reduces to $$I\sigma_e^2$$.

Note tag <matrix> nested in tag <sigma> . The content of tag <matrix> does not comply with the formatting rules as pointed o ut above. That is 5.0 is not a valid key string. To let lmt know that the content of tag <matrix> should not be evaluated as a key string, with a subsequent error message, the tag must have attributes. In this example matrix attributes="matrix" escapes the content of tag <matrix> from the formatting rules.

Further, tag <matrix> is automatic-optional. This might be confusing because, as pointed out above, $$\Sigma$$ forms an indispensable part of $$\Gamma\otimes \Sigma$$. However, tag <matrix> belongs to a group of mutually exclusive information sources of which members are tag <matrix> and key string file: yourfilename . That is, $$\Sigma$$ maybe either embedded in the parameter file or supplied via an external file.

Note that the spelling of most tags used in the above parameter file is determined by lmt and must be abide by.

Estimating a fixed mean and a random genetic effect in a uni-variate model

Consider the linear model $$y=Xb+Zu+e$$ where all variables are those declared in #Estimating a mean, $$u$$ is vector of length $$m$$ of random direct genetic effects and $$Z$$ is a design matrix of dimension $$n \times m$$ linking genetic effects to their respective observations. Note that $$u\sim N(0,A\sigma_a^2)$$ where $$A$$ is the pedigree-derived relationship matrix and forms the $$\Gamma$$ part in $$\Gamma\otimes\Sigma$$. A possible data file for such mode may look like:

#y,mu,id
25.0,1,5
33.1,1,6
36.0,1,7
28.3,1,8

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. Further assume a pedigree in a file called "ped.csv" with content:

1,0,0
2,0,0
3,1,0
4,0,2
5,3,4
6,0,4
7,5,4
8,5,7


and a valid lmt parameter file:

<root>
  <data>
    file: data.csv
  </data>
  <pedigrees>
    pedigrees: my_ped
      <my_ped>
        file: ped.csv
      </my_ped>
  </pedigrees>
  <vars>
    <res>
      <sigma>
        <matrix attributes="matrix">
          5.0
        </matrix>
      </sigma>
    </res>
    vars: my_var
    <my_var>
      <sigma>
        file: sigma.csv
      </sigma>
      <gamma>
        <A>
          pedigree: my_ped
        </A>
      </gamma>
    </my_var>
  </vars>
  <model>
    <eqn attributes="strings">
      y = mu*b + id*u(v(my_var(1)))
    </eqn>
  </model>
</root>

Compared with the parameter file in example #Estimating a mean the one above contains only a few extra elements. One this the automatic-optional <pedigrees> nested inside tag <root> . This tag contains a keystring pedigrees: myped , where the user-defined variable behind pedigrees: is the name of a nominated-compulsory tag nested inside tag {cc|<pedigrees>}}. This concept allows to supply several pedigrees to lmt (e.g. a normal pedigree and a genetic group pedigree). In our example we have only one pedigree named my_ped, with tag <my_ped> containing the information about this pedigree. Another additional element is the key string vars: my_var nested in tag <vars> where the variable of key string vars: my_var provides the tag names of nominated-compulsory tags, in this example tag <my_var> .

Tag <myvar> consist of two structural components: the automatic-compulsory tag <sigma> and the automatic-optional <gamma> . Since the the variance of $$u=A\sigma_a^2$$, where $$A=\Gamma$$ and $$\sigma_a^2=\Sigma$$, a <gamma> tag must be supplied to fully specify the variance. Note that if the <gamma> tag is missing or miss-spelled lmt will assume that the variance of $$u=I\sigma_a^2$$. Tag <gamma> contains a automatic-compulsory tag <A> which specifies the $$\Gamma=A$$. Since $$A$$ is build from a pedigree, tag <A> contains a compulsory key string pedigree: my_ped which nominates pedigree in tag <my_ped> to be used for building $$A$$.

Note the difference between the tags <sigma> nested in tag <res> and <my_var> . The former specifies <sigma> to be provided by tag <matrix attributes="matrix"> , whereas the latter specifies <sigma> to be provided by a file.

The model section in the above parameter file need to communicate to to lmt that $$u$$ is a random factor with a variance $$A\sigma_a^2$$. This is done by extending the u.d. factor name u in y = mu*b + id*u(v(my_var(1))) by a specifier (v(my_var(1))) . Note that without a specifier u would be regarded as a fixed factor. The specifier u(v) communicates that u has a variance assigned. Further, v has a specifier assigned via v(my_var) which communicates that the name of the variance is my_var . The variance in tag <my_var> contains a <gamma> and a <sigma> component. The integer number inside bracket my_var(1) communicates that $$\sigma_a^2$$ of u is located in the first diagonal element of $$\Sigma$$.

In summary construct u(v(my_var(1))) communicates that

  • u has a variance assigned
  • the variance is named my_var
  • the variance is located in the first diagonal element of the $$\Sigma$$ matrix specified in tag <sigma> nested in tag <my_var>>

Estimating fixed means and a random genetic effects in a multi-variate model

Consider the linear model

$$ \left( \begin{array}{c} y_1 \\ y_2 \end{array} \right)= \left( \begin{array}{cc} X_1 & 0 \\ 0 & X_2 \\ \end{array} \right) \left( \begin{array}{c} b_1 \\ b_2 \end{array} \right) + \left( \begin{array}{cc} Z & 0 \\ 0 & Z \end{array} \right) \left( \begin{array}{c} u_1 \\ u_2 \end{array} \right) + \left( \begin{array}{cc} I & 0 \\ 0 & I \end{array} \right) \left( \begin{array}{c} e_1 \\ e_2 \end{array} \right) $$

where all variables are those declared in above, and subscripts $$1$$ and $$2$$ index trait $$1$$ and $$2$$, respectively.

Note that $$[u_1,u_2]\sim N([0,0],A\otimes \Sigma_a)$$ where $$A$$ is the pedigree-derived relationship matrix and $$ \Sigma_a= \left(\begin{array}{cc} \sigma_{a_1}^2 & \sigma_{a_1,a_2}\\ \sigma_{a_2,a_1} & \sigma_{a_2}^2 \end{array}\right) $$ Further, $$[e_1,e_2]\sim N([0,0],I\otimes \Sigma_e)$$ with $$ \Sigma_e= \left(\begin{array}{cc} \sigma_{e_1}^2 & \sigma_{e_1,e_2}\\ \sigma_{e_2,e_1} & \sigma_{e_2}^2 \end{array}\right) $$.

A possible data file for such mode may look like:

#y1,y2,mu,id
25.0,0.8,1,5
33.1,0.5,1,6
36.0,1.5,1,7
28.3,3.6,1,8

and the pedigree files is that provided in example #Estimating a mean and a random genetic effect in a uni-variate model.


and a valid lmt parameter file:

<root>
  <data>
    file: data.csv
  </data>
  <pedigrees>
    pedigrees: my_ped
      <my_ped>
        file: ped.csv
      </my_ped>
  </pedigrees>
  <vars>
    <res>
      <sigma>
        <matrix attributes="matrix">
          5.0,0.8
          0.8,1.2
        </matrix>
      </sigma>
    </res>
    vars: my_var
    <my_var>
      <sigma>
        file: sigma.csv
      </sigma>
      <gamma>
        <A>
          pedigree: my_ped
        </A>
      </gamma>
    </my_var>
  </vars>
  <model>
    <eqn attributes="strings">
      y1 = mu*b1 + id*u1(v(my_var(1)))
      y2 = mu*b2 + id*u2(v(my_var(2)))
    </eqn>
  </model>
</root>

Example code chunks

The following code chunks are only subset of a full parameter file. It is assumed that all other parts of the instruction file are functional and all necessary input data available.

Providing several pedigrees

<root>
  ...
  <pedigrees>
    pedigrees: a,b
    <a>
      ...
    </a>
    <b>
      ...
    </b>
  </pedigrees>
 ...
</root>

Providing a pedigree containing genetic groups

<root>
  ...
  <pedigrees>
    pedigrees: a
    <a>
      phantomparents: 2
      ...
    </a>
  </pedigrees>
 ...
</root>

Providing a probabilistic pedigree

<root>
  ...
  <pedigrees>
    pedigrees: a
    <a>
      switch: probabilistic
      ...
    </a>
  </pedigrees>
 ...
</root>

Providing several genotype files

<root>
  ...
  <genotypes>
    genotypes: a,b
    <a>
      ...
    </a>
    <b>
      ...
    </b>
  </genotypes>
 ...
</root>

Constructing several GRMs

<root>
  ...
  <genotypes>
    genotypes: a,b
    <a>
      ...
    </a>
    <b>
      ...
    </b>
  </genotypes>
  <grms>
    grms: x,y
    <x>
      genotype: a
      ...
    </x>
    <y>
      genotype: b
      ...
    </y>
  </grms>
 ...
</root>

Running a model with two ssGBLUP factors

<root>
  <pedigrees>
    pedigrees: a
    <a>
      ...
    </a>
  <pedigrees>
  <genotypes>
    genotypes: a,b
    <a>
      ...
      pedigree: a
    </a>
    <b>
      ...
      pedigree: a
    </b>
  </genotypes>
  <grms>
    grms: x,y
    <x>
      genotype: a
      ...
    </x>
    <y>
      genotype: b
      ...
    </y>
  </grms>
  <vars>
    ...
    vars: g1,g2
    <g1>
     ...
     <gamma>
       <H>
         ...
         grm: x
       </H>
     </gamma>
    </g1>
    <g2>
     ...
     <gamma>
       <H>
         ...
         grm: y
       </H>
     </gamma>
    </g2>
  </vars>
  ...
  <models>
    <eqn>
      y1=mu*b1+id*u11(v(g1(1))+id*u21(v(g2(1))
      y2=mu*b2+id*u12(v(g1(2))+id*u22(v(g2(2))
    </eqn>
  </models>
</root>

Defining a ssSNPBLUP variance

<root>
  <pedigrees>
    pedigrees: a
    <a>
      ...
    </a>
  <pedigrees>
  <genotypes>
    genotypes: a
    <a>
      ...
      pedigree: a
    </a>
  <vars>
    ...
    vars: g
    <g>
      type: snpblup1
      genotype: a
      aweight: 0.05
      switch: adjustg2a
      <sigma>
    	file: covG.csv
      </sigma>
      <marker_sb1>
	    <sigma>
	      file: covG.csv
	    </sigma>
      </marker_sb1>
    </g>
  </vars>
  ...
  <models>
    <eqn>
      y1=mu*b1+individual*u1(v(g(1))+dam*m1(v(g(2))
      y2=mu*b2+individual*u2(v(g(3))+dam*m2(v(g(4))
    </eqn>
  </models>
</root>

Defining a random-regression test-day model

Override the default job parameters

<root>
  ...
  <jobs>
    jobs: default
    <default>
      conv: -9.21 <! log(10e-5)>
    </default>
  </jobs>
 ...
</root>

Use job "solve" instead of "default"

<root>
  ...
  <jobs>
    jobs: solve
    <solve>
      solver: x
    </solve>
  </jobs>
  <solvers>
    solvers: x
    <x>
    </x>
  </solvers>
 ...
</root>

Use a direct solver in stead of the default solver

<root>
  ...
  <jobs>
    jobs: solve
    <solve>
      solver: x
    </solve>
  </jobs>
  <solvers>
    solvers: x
    <x>
      <direct>
      </direct>
    </x>
  </solvers>
 ...
</root>