Examples

From Linear Mixed Models Toolbox
Revision as of 04:25, 25 December 2020 by Boerner (talk | contribs)
Jump to navigation Jump to search

The lmt parameter file must in written in “eXtensible Markeup Language” (xml). For under- standing the examples you may want to get a jump start in xml file structure first. Don’t be scared. Even without any previous knowledge you’ll be able to understand the xml structure used for lmt in less then half an hour. Please consult the Internet to find suitable introduction into xml. Bare in mind that the lmt parameter file is case sensitive, that is Hello is not the same as hello</hello>.

Unsupported xml features

Currently unsupported xml features are character entities and complete empty-element tags. Further, start tags and end tags must not occur in the same line of the parameter file.

Parameter file terminology

The lmt parameter file has only two major structural components: tags and key strings. An easy to follow explanation of tags can be found under the above link. The content of each tag can be another tag or a key string.

Tag names

The tag names is the character string between the arrow brackets of a tag. For example, the tag name of <hello> is hello . Depending on the specific location and function of a tag the name can be hard-coded or user-defined(u.d.). U.d. tag names can be as short as possible. e.g. a single character, but must not contain any white space. While a tag name my contain any ascii character it is strongly recommended to use only alpha-numeric characters and underscores.

Key strings

Key strings have always the format keyword:variable . Keyword is a character string which is either hard coded or user defined. The spelling is therefore defined by the hardcoded value, or by the user, must be abide by. Variable refers to a character string or comma-separated list of strings which may be hardcoded or user-defined, or a single or a comma-separated list of numeric values.

Parsing logic

lmt parse the parameter file starting at the compulsory <root> tag. Nested tags can be either automatic-compulsory, automatic-optional and nominated. Automatic-optional tags will be searched for by default and are evaluated if they are present. Their absence will not case an error message at parsing time. However, that does not mean that the information they should provide is not necessary later its absence may cause an error stop. Automatic-compulsory tags will be search for by default and their absence will cause an error stop. Nominated tags are always compulsory but are only searched for if they have been nominated by a key string variable, where the key string variable becomes the tag name.

Depending on the host tag key strings can be optional or compulsory.

<root>
  <nest_1>
  </nest_1>
  <nest_2>
    <x>
    </x>
    others: y,z
    <y>
    </y>
    <z>
    </z>
  </nest_2>
</root>

In the above example nested tags <nest_1> and <nest_2> maybe optional or compulsory, but both are automatic, that is lmt will evaluate them only if the are present. Tag <x> , nested in tag <nest_2> , is automatic and may be optional or compulsory as well. However, tags <y> and <z> are nominated and therefore compulsory. The nomination is triggered by providing key string other: y,z where the variable y,z provides a comma-separated list of names of the nominated tags. That is, after evaluating key string others: y,z lmt will search for tags <y> and <z> , and their absence will cause an error stop.


Solving linear mixed model equations

Estimating a mean

Estimating a mean is equivalent to obtaining the generalized least square solution $$b=(X'R^{-1}X)^{-1}X'R^{-1}y$$ for model $$y=Xb+e$$, where $$y$$ is a vector of $$n$$ observations, $$X$$ is as single column matrix of $$1$$, $$b$$ is a fixed factor (mean), $$e$$ is the residual and $$y\sim N(Xb,R)$$ where $$R$$ is a $$n \times n$$ co-variance matrix.

From the above it follows that for task of solving for $$b$$ lmt needs following information:

the data
the residual variance $$R$$
the model
the solver

Assume we have a data file "data.csv" with content:

#y,mu
25.0,1
33.1,1
36.0,1
28.3,1

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. A valid lmt xml parameter file would look like:

<root>
  <jobs>
    jobs: solve
    <solve>
      solver: my_solver
    </solve>
  </jobs>
  <models>
    <eqn attributes="strings">
      y=mu*b
    </eqn>
  </models>
  <data>
    datafile: data.csv
    missingthreshold: -50.0
  </data>
  <vars>
    <res>
      <sigma>
        <matrix attributes="matrix">
          5.0
        </matrix>
      </sigma>
    </res>
  </vars>
  <solvers>
    solvers: my_solver
    <my_solver>
    </my_solver>
  </solvers>
</root>

Note the hierarchical nesting structure in the above parameter file. Tags , <vars> , <model> , <jobs> and <solvers> are all nested inside tag <root> . However, all those tags may contain nested tags as well. It is crucial that nested tags are placed in the right position. The most important aspect is the model definition in tag <eqn> , nested inside tag <model> in line 9, which is $$y=mu*b$$. The variable names used here are either defined by the data file header, or by the user. That is, $$y$$ and $$mu$$ are defined in the data file header, whereas $$b$$ is a user-defined variable name. Translated this means that the content of the data column named $$y$$ should be regressed on the content of the data column named $$mu$$ with the regression coefficient named $$b$$. Since there are no further specifications supplied about $$y$$, $$mu$$ and $$b$$, it is assumed that $$y$$ is continuous , $$mu$$ is a classification variable, and $$b$$ is fixed factor. The necessary variances are defined inside tag <vars> . lmt requires one compulsory variance, the residual variance, which must be specified via tag <res> . This is sufficient for our model as we don’t have any random effects.

The default lmt variance structure is $$\Gamma\otimes\Sigma$$, where $$\Gamma$$ and $$\Sigma$$ are specified inside tags <gamma> and <sigma> , respectively. However, only tag <sigma> is compulsory as a missing <gamma> tag implies that $$\Gamma = I$$. For the above example, the variance specification inside <res> implies $$R=I\sigma_e^2$$. Note tag <matrix> nested in tag <sigma> . The content of tag <matrix> does not comply with the formatting rules as pointed out above. That is 5.0 is not a key string. To let lmt know that the content of tag <matrix> should not be evaluated as a key string, with a subsequent error message, the tag must have attributes. In this example <matrix attributes="matrix"> .

The number of nested tags inside a host tag lmt is searching for depends on nature the host tag.

Note that the spelling of all tags used in the above parameter file is determined by lmt and must be abide by. However, all words starting with my_ are user defined, but once the word is user-defined in any subsequent use the spelling must be the same. For example the tag solver . It defines a solver named "mysolver" being of type <pcgiod> which will run till convergence but not more than 10000 rounds. Subsequently the name "mysolver" is used in job solve to provide the job with a solver to fulfil the task. However, if mysolver would have been spelled MYSOLVER in line 22, lmt would have stopped with an error message.

Estimating a mean and genetic effects

Consider the linear model $$y=Xb+Zu+e$$ where all variables are those declared in #Estimating a mean, $$u$$ is vector of length $$m$$ of random direct genetic effects and $$Z$$ is a design matrix of dimension $$n \times m$$ linking genetic effects to their respective observations. Note that $$u\sim N(0,A\sigma_a^2)$$ where $$A$$ is the pedigree-derived relationship matrix. A possible data file for such mode may look like:

#y,mu,id
25.0,1,5
33.1,1,6
36.0,1,7
28.3,1,8

where the columns are comma-separated, the first row is commented out with “#” but contains the header, and all other rows contain data records. Further assume a pedigree in a file called "ped.csv" with content:

1,0,0
2,0,0
3,1,0
4,0,2
5,3,4
6,0,4
7,5,4
8,5,7


and a valid lmt parameter file:

<root>
  <data>
    file: data.csv
  </data>
  <pedigrees>
    pedigrees: myped
      <myped>
        file: ped.csv
      </myped>
  </pedigrees>
  <vars>
    <res>
      <sigma>
        <matrix>
          5.0
        </matrix>
      </sigma>
    </res>
    vars: myvar
    <myvar>
      <sigma>
        file: mysigma.csv
      </sigma>
      <gamma>
        <A>
          pedigree: myped
        </A>
      </gamma>
    </myvar>
  </vars>
  <model>
    <eqn attributes="strings">
      y = mu*b + id*u(v(myvar(1)))
    </eqn>
  </model>
  <jobs>
    jobs: solve
    <solve>
      solver: mysolver
    </solve>
  </jobs>
  <solvers>
    solvers: mysolver
    <mysolver>
      type: pcgiod
      <pcgiod>
         rounds: 10000
      </pcgiod>
    </mysolver>
  </solvers>
</root>

Compared with the parameter file in example #Estimating a mean the one above contains only a few extra elements. One this the <pedigrees> tag spanning from line 5 to 10 and nested inside tag <root> . This tag contains a keystring pedigrees: myped , where the user-defined variables behind pedigrees: are a comma-separated list of tags nested inside <pedigrees>. The provision of that list triggers lmt to search and evaluate those tags. This concept allows to supply several pedigrees to lmt (e.g. a normal pedigree and a genetic group pedigree). In our example we have only one pedigree named myped, with tag <myped> containing the information about this pedigree. Another additional element is the keystring vars: myvar in line 19. Similar to tag <pedigrees> , tags nested inside tag <vars> are only evaluated if nominated behind vars: as a comma-separated list (e.g. vars: a,b,c), except for tag <res> , which is compulsory and evaluated automatically. Tag <myvar> consist of two structural components: <sigma> and <gamma> . We know <sigma> already from tag <res> . To understand the requirement for tag <gamma> we need to acknowledged that the variance for a random factor can be generalized as a Kronecker product.


Contrarily to the residual variance $$I\sigma_e^2$$, where $$I$$ can be safely omitted, $$A\sigma_a^2$$ has two components $$A$$ and $$\sigma_a^2$$ which both need to be declared. The section in the parameter file where the variances are declared spans from line 11 to line 30. It contains two variances, <res> and <myvar> . Since <res> is the residual variance it is compulsory and if missing lmt will stop. All other variances are only evaluated if the appear in in keystring vars: var1,var2,...,varN which in the above example is vars: myvar. Variance "myvar" contains two components: <sigma> and <gamma>