Difference between revisions of "The Linear Mixed Models Toolbox"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
 
(69 intermediate revisions by 2 users not shown)
Line 2: Line 2:


The <b>L</b>inear mixed <b>M</b>odels <b>T</b>oolbox ({{lmt}}) is a stand-alone single executable software for for large scale linear mixed model analysis.
The <b>L</b>inear mixed <b>M</b>odels <b>T</b>oolbox ({{lmt}}) is a stand-alone single executable software for for large scale linear mixed model analysis.
It is the successor of DMU, the well-known and
widely used software package for linear mixed model analysis developed and maintained
by Per Madsen and Just Jensen.


Since the early days of software development in statistics and quantitative genetics
{{lmt}} supports all models commonly used in genetic evaluation and has various options to handle genomic markers.
time has moved on in terms of what programming languages are capable of and therefore
DMU has been given a thorough overhaul.


One result of the overhaul is the new name, {{lmt}}, resulting from the difficulty to translate
{{lmt}} has been used successfully for genetic evaluation data sets with >>200k genotyped animals, >>15m animals, >>500m equations.
the acronym DMU into something which is generally meaningful throughout time. For
those who prefer the acronym DMU, they may refer to {{lmt}} as <b>DMU-next</b>.


The second area of the overhaul is the parameter file interface. {{lmt}} now comes with
{{lmt}} is only available for 64 bit Linux operation systems, is run from the Linux command line, and uses an [https://www.w3schools.com/xml/ xml] style parameter file which is supposed to allow for an easy understanding by the user. Further using [https://www.w3schools.com/xml/ xml] comes with support for automated commenting, uncommenting, indentation, code-folding and syntax highlighting by almost every editor,
an xml style parameter file which is supposed to allow for a much easier understanding
thus easing to follow the structure of the parameter file even if it spans several tens of lines of code.
by the user. Further using xml comes with support for automated commenting, un-
commenting, indentation, code-folding and syntax highlighting by almost every editor,
thus easing to follow the structure of the parameter file even if it spans several tens of
lines of code.


The third area of the overhaul is the program structure. DMU was structured into
== Conditions of use ==
several programs (<i>DMU1, DMU4, DMU5, DMUAI, RJMC</i>). In contrast, {{lmt}} is meant
{{lmt}} can be used by the scientific community free of charge, but users must credit {{lmt}}
to provide the functionalities all those programs via a single parameter file and a single
in any publications.
executable.
Commercial users must obtain the explicit approval of the author before using {{lmt}} and must credit {{lmt}} in any publication in scientific journals.
 
If {{lmt}} cannot be credited via citation the author must become a co-author.
While {{lmt}} is finally meant to be a full scale successor of DMU, it does not yet provide
all its functionalities in some areas, in others it already provides more. More specifi-
cally, there no REML facilities available yet, but large scale linear mixed model solving
provides Single-Step-T-BLUP facilities, uploading of genotypes and building of genomic
relationship matrices on the fly etc etc.
 
==Supported features ==
 
=== Supported operations ===
 
Currently {{lmt}} support the following operations on linear mixed models:


*Solving for BLUP and BLUE solutions conditional on supplied variances for random and fixed factor, respectively;
== How to get it ==
*Gibbs sampling of variance components in single pass and blocked mode;
*MC-EM-REML estimation of variance components
*Sampling elements of the inverse of the mixed model coefficient matrix


=== Supported factors and variables ===
{{lmt}} can be obtained '''on request''' from the [mailto:vinzent.boerner@qgg.au.dk author].
{{lmt}} supports
*fixed
*random factors
*classification variables
*continuous co-variables, which can be nested. For continuous co-variables {{lmt}} support user-defined polynomials and hard coded [https://en.wikipedia.org/wiki/Legendre_polynomials Legendre polynomials] up to order 6.
*genetic group co-variables
 
All classification and co-variables can be associated to a fixed or random factor.
 
=== Supported variance structures ===
For random factor {{lmt}} supports variance structures of
*structure [https://en.wikipedia.org/wiki/Kronecker_product $$\Gamma\otimes\Sigma$$], where $$\Sigma$$ is an dense symmetric positive definite matrix, and
*$$\Theta_L(\Gamma\otimes I_{\Sigma})\Theta_L^{'}$$, where $$\Theta$$ is symmetric positive definite [https://en.wikipedia.org/wiki/Block_matrix#Block_diagonal_matrices block-diagonal matrix] of $$n$$ symmetric positive definite martices $$\Sigma_i, i=1,..,n$$, $$\Theta_L$$ is the lower [https://en.wikipedia.org/wiki/Cholesky_decomposition Cholesky factor] of $$\Theta$$ and $$I_{\Sigma}$$ is an identity matrix of dimension $$\Sigma_i$$.
 
When solving linear mixed models $$\Sigma$$ and $$\Gamma$$ are user determined constants, whereas when estimating variances $$\Gamma$$ is a user determined constant and $$\Sigma$$ is a function of the data.
 
Supported type for $$\Gamma$$ are
*an [https://en.wikipedia.org/wiki/Identity_matrix identity matrix]
*an arbitrary positive definite [https://en.wikipedia.org/wiki/Diagonal_matrix diagonal matrix]
*a pedigree-based numerator relationship matrix $$A$$ which may contain meta-founders
*a pedigree- and genotype-based relationship matrix $$H$$ which may contain meta-founders
*a user-defined(u.d.) symmetric, positive definite matrix of which inverse is supplied
**as a sparse upper-triangular matrix stored in [https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format) csr format]
**as a dense matrix
*a co-variance matrix of a selected auto-regressive process
 
=== Supported linear mixed model solvers ===
{{lmt}} supports
 
*a direct solver requiring to explicitly build the linear mixed model equations left-hand-side coefficient matrix($$C$$)
*an iteration-on-data pre-conditioned gradient solver which '''does not''' require $$C$$
 
=== Supported features related to genomic data ===
*direct use of genomic marker data
*building of genomic relationship matrices($$G$$) from supplied genomic data
*uploading of a u.d. $$G$$
*adjustment of $$G$$ to $$A_{gg}$$
*solving Single-Step-G-BLUP models
*sampling variances for Single-Step-G-BLUP models
*solving Single-Step-T-BLUP models
*solving Single-Step-SNP-BLUP models
*all Single-Step models can be run from "bottom-up", that is the user supplies the genotypes and all necessary ingredients(e.g. $$G$$) are built on the fly.
 
=== Supported pedigree types===
*ordinary pedigrees
*probabilistic pedigrees with an unlimited number of parent pairs per individual
*genetic group pedigrees
*meta-founders
 
== {{lmt}} Linear mixed model terminology ==
=== Matrix notation, factors and sub-factors ===
 
Consider the multi-variate linear mixed model
 
$$
\left(
\begin{array}{c}
y_1 \\
y_2 \\
y_3
\end{array}
\right)
=
\left(
\begin{array}{ccc}
X_1 & 0 & 0 \\
0 & X_2 & 0 \\
0 & 0 & X_3
\end{array}
\right)
\left(
\begin{array}{c}
b_1 \\
b_2 \\
b_3
\end{array}
\right)
+
\left(
\begin{array}{ccc}
Z_1 & 0 & 0\\
0 & Z_2 & 0\\
0 & 0 & Z_3
\end{array}
\right)
\left(
\begin{array}{c}
u_1 \\
u_2 \\
u_3
\end{array}
\right)
+
\left(
\begin{array}{c}
e_1 \\
e_2 \\
e_3
\end{array}
\right)
$$
 
where $$(y_1,y_2,y_3)'$$, $$(b_1,b_2,b_3)'$$, $$(u_1,u_2,u_3)'$$ and $$(e_1,e_2,e_3)'$$ are vectors of response variables, effects of fixed factors, effects of random factors and effects of residuals respectively, and matrices
$$\left(
\begin{array}{ccc}
X_1 & 0 & 0 \\
0 & X_2 & 0 \\
0 & 0 & X_3
\end{array}
\right)$$, and
$$
\left(
\begin{array}{ccc}
Z_1 & 0 & 0\\
0 & Z_2 & 0\\
0 & 0 & Z_3
\end{array}
\right)
$$ are block-diagonal design matrices linking effects in the respective vectors to their related response variables. In usual mixed model terminology $$b_1$$, $$b_2$$ and $$b_3$$ are called fixed factors, and $$u_1$$, $$u_2$$ and $$u_3$$ are called random factors. Ignoring the residual the above model has in total 6 factors.
 
However, the model maybe rewritten in matrix formulation as
 
$$vec(Y)=Xvec(B)+Zvec(U)+vec(E)$$,
 
where $$vec$$ is the [https://en.wikipedia.org/wiki/Vectorization_(mathematics) vectorization operator], $$Y=[y_1,y_2,y_3]$$, $$B=[b_1,b_2,b_3]$$, $$U=[u_1,u_2,u_3]$$ and $$E=[e_1,e_2,e_3]$$ are column matrices of response variables, the effects of the fixed and random factor, and the residuals, respectively, and
$$X=\left(
\begin{array}{ccc}
X_1 & 0 & 0 \\
0 & X_2 & 0 \\
0 & 0 & X_3
\end{array}
\right)$$, and
$$Z=
\left(
\begin{array}{ccc}
Z_1 & 0 & 0\\
0 & Z_2 & 0\\
0 & 0 & Z_3
\end{array}
\right)
$$. The distribution assumption for the random components in the model are $$vec(U^{'})\sim N((0,0,0)',\Gamma_u \otimes \Sigma_u)$$ and $$vec(E^{'})\sim N((0,0,0)',\Gamma_e \otimes \Sigma_e)$$. Note that the column and row dimensions of $$U$$ are determined by the column dimension of $$\Sigma_u$$ and $$\Gamma_u$$ respectively.
 
Slightly different to the above terminology, {{lmt}} refers to $$B$$ and $$U$$ as factors, and therefore the model has only two factors, whereas the columns in $$B$$ and $$U$$ are referred to as '''sub-factors'''.
 
Following the above matrix notation {{lmt}} will always invoke only one factor for all modelled fixed classification variables and only one factor for all modelled fixed continuous co-variables. Sub-factors are summarized into a single random factors if they share the same $$\Sigma$$ matrix. Thus, {{lmt}} will invoke as many random factors as there are different $$\Gamma \otimes \Sigma$$ constructs. That is, in {{lmt}} terminology the multi-variate model
 
$$
\left(
\begin{array}{c}
y_1 \\
y_2 \\
y_3
\end{array}
\right)
=
\left(
\begin{array}{ccc}
X_1 & 0 & 0 \\
0 & X_2 & 0 \\
0 & 0 & X_3
\end{array}
\right)
\left(
\begin{array}{c}
b_1 \\
b_2 \\
b_3
\end{array}
\right)
+
\left(
\begin{array}{cccccc}
Z_{d,1} & 0 & 0 & Z_{m,1} & 0 & 0\\
0 & Z_{d,2} & 0 & 0 & Z_{m,2} & 0\\
0 & 0 & Z_{d,3} & 0 & 0 & Z_{m,3}\\
\end{array}
\right)
\left(
\begin{array}{c}
u_{d,1} \\
u_{d,2} \\
u_{d,3} \\
u_{m,1} \\
u_{m,2} \\
u_{m,3}
\end{array}
\right)
+
\left(
\begin{array}{ccc}
W_1 & 0 & 0\\
0 & W_2 & 0\\
0 & 0 & W_3
\end{array}
\right)
\left(
\begin{array}{c}
v_1 \\
v_2 \\
v_3
\end{array}
\right)
+
\left(
\begin{array}{c}
e_1 \\
e_2 \\
e_3
\end{array}
\right)
$$
 
with
$$(u_{d,1},u_{d,2},u_{d,3},u_{m,1},u_{m,2},u_{m,3})'\sim N((0,0,0,0,0,0)',\Sigma_u \otimes \Gamma_u)$$ and $$(v_1,v_2,v_3)'\sim N((0,0,0)',\Sigma_v \otimes \Gamma_v)$$, rewritten as $$vec(Y)=Xvec(B)+Zvec(U)+Wvec(V)+vec(E)$$ will have only 3 factors, $$B$$, $$U$$ and $$V$$ with $$b_1,b_2,b_3$$, $$u_{d,1},u_{d,2},u_{d,3},u_{m,1},u_{m,2},u_{m,3}$$ and $$v_1,v_2,v_3$$ being subfactors of $$U$$ and $$V$$ respectively.
 
=== Model syntax ===
 
The syntax for communicating the model to {{lmt}} is effectively '''just write the model'''.
 
A valid {{lmt}} model string would {{cc|1=y=mu*b+id*u(v(my_var(1)))}}. The model string consist of
*the response variable {{cc|y}}, which must be a column name in the data file
*variables {{cc|mu}} and {{cc|id}}, which must be a column names in the data file
*sub-factors {{cc|b}} and {{cc|u}} which are user-defined alpha-numeric character strings
*relation operators {{cc|1==}}, {{cc|*}} and {{cc|+}}
*a specifier {{cc|(v(my_var(1)))}} used to specify the nature of {{cc|u}}
 
The rules for using relational operators are
*{{cc|1==}} links the response variable to the model
*{{cc|*}} links a model variable to it's sub-factor, which together form a right hand side component
*{{cc|+}} concatenates different right hand side components.
 
Variables and sub-factors maybe accompanied by a specifier. A specifier is a [https://en.wikipedia.org/wiki/Tree_structure#Nested_parentheses tree diagramm] in [https://en.wikipedia.org/wiki/Newick_format  Newick format] with all nodes named where the root node is the variable or sub-factor. It provides additional information about a variable or sub-factor. The {{lmt}} version of the above [https://en.wikipedia.org/wiki/Newick_format  tree diagram] differs in that
*the parent nodes precede child nodes
*child nodes within the same parent node are separated by semicolon
*leaf nodes can contain a bracket space with additional, maybe comma-separated information
Without any specifier {{lmt}} assumes that
<ol>
<li>variables are classification variables with the respective columns in the data file containing integer numbers coding for the different levels of the associated sub-factor</li>
<li>sub-factors are fixed</li>
</ol>
==== Sub-factor specifiers ====
Sub-fatcor specifiers are used to communicate that a sub-factor is random. Following the above example {{cc|u(v(my_var(1)))}}, {{cc|u}} is the root node, {{cc|v}} is a child node to {{cc|u}} with a hard-coded name '''v''', {{cc|my_var}} is a child node to {{cc|v}} with a user-defined name '''my_var''' which references the user defined name of a $$\Gamma \otimes \Sigma$$ construct, and {{cc|1}} is an additional information to {{cc|my_var}} communicating that the diagonal element in $$\Sigma$$ related to {{cc|u}} is diagonal element #1.
==== Variable specifiers ====
Variable specifiers are used to communicate further information which may be that the variable
*is continuous but real numbers
*is continuous but integer numbers
*is a genetic group regression matrix
*undergoes a polynomial expansion
*is associated to a nesting variable
etc.
 
 
Variables as well as sub-factors maybe used across traits. That is a model
y1=mu*b1+id*u1(v(sigma(1))
y2=mu*b2+id*u2(v(sigma(2))
 
== Disclaimer ==
{{lmt}} is under ongoing development and many of its features have been tested only a few
times on a limited number of models and data sets. Thus, the users uses {{lmt}} completely
on his/her own risk. This also applies to any decisions made based on the results provided
by {{lmt}}.
 
== Conditions of use ==
{{lmt}} can be used by the scientific community free of charge, but users must credit {{lmt}}
in any publications. Commercial users must obtain the explicit approval of the author
before using {{lmt}} and must credit {{lmt}} in any publication in scientific journals.


== Feedback and support ==
== Feedback and support ==
Line 316: Line 25:
However, the author appreciates feedback about the program functionality, possible aborts (segmentation faults), usability of output and comprehensiveness of the manual.
However, the author appreciates feedback about the program functionality, possible aborts (segmentation faults), usability of output and comprehensiveness of the manual.


For feedback, wish list, questions and support contact [mailto:vinzent.boerner@qgg.au.dk vinzent.boerner@qgg.au.dk](infrequently checked) or [mailto:vinzent.boerner@gmx.de vinzent.boerner@gmx.de](frequently checked).


 
== [[supported features| Supported features]] ==
* [http://localhost/mediawiki/index.php/Run_It Run It]
== [[Algorithms|Algorithms]] ==
* [http://localhost/mediawiki/index.php/Inputfileformats Input file formats]
== [[Parameterfile1| Parameter file terminology]] ==
* [http://localhost/mediawiki/index.php/Parameterfile1 Parameter file terminology part 1]
== [[linear mixed models in lmt| Linear mixed models in lmt]] ==
* [http://localhost/mediawiki/index.php/Jump_Start Jump Start]
== [[Genomic data in lmt| Genomic data lmt]] ==
== [[File formats|File formats]] ==
== [[Input files|Input files]] ==
== [[Output files|Output files]] ==
== [[Run_It|How to run it]] ==
== [[Examples|Examples]] ==
== [[Parameter file elements| Parameter file elements]] ==

Latest revision as of 00:58, 12 May 2022

Introduction

The Linear mixed Models Toolbox (lmt) is a stand-alone single executable software for for large scale linear mixed model analysis.

lmt supports all models commonly used in genetic evaluation and has various options to handle genomic markers.

lmt has been used successfully for genetic evaluation data sets with >>200k genotyped animals, >>15m animals, >>500m equations.

lmt is only available for 64 bit Linux operation systems, is run from the Linux command line, and uses an xml style parameter file which is supposed to allow for an easy understanding by the user. Further using xml comes with support for automated commenting, uncommenting, indentation, code-folding and syntax highlighting by almost every editor, thus easing to follow the structure of the parameter file even if it spans several tens of lines of code.

Conditions of use

lmt can be used by the scientific community free of charge, but users must credit lmt in any publications. Commercial users must obtain the explicit approval of the author before using lmt and must credit lmt in any publication in scientific journals. If lmt cannot be credited via citation the author must become a co-author.

How to get it

lmt can be obtained on request from the author.

Feedback and support

lmt comes without any guaranteed support and the user is strongly advised to study the manual thoroughly. However, the author appreciates feedback about the program functionality, possible aborts (segmentation faults), usability of output and comprehensiveness of the manual.

For feedback, wish list, questions and support contact vinzent.boerner@qgg.au.dk(infrequently checked) or vinzent.boerner@gmx.de(frequently checked).

Supported features

Algorithms

Parameter file terminology

Linear mixed models in lmt

Genomic data lmt

File formats

Input files

Output files

How to run it

Examples

Parameter file elements