|
|
(49 intermediate revisions by 2 users not shown) |
Line 2: |
Line 2: |
|
| |
|
| The <b>L</b>inear mixed <b>M</b>odels <b>T</b>oolbox ({{lmt}}) is a stand-alone single executable software for for large scale linear mixed model analysis. | | The <b>L</b>inear mixed <b>M</b>odels <b>T</b>oolbox ({{lmt}}) is a stand-alone single executable software for for large scale linear mixed model analysis. |
| It is the successor of DMU, the well-known and
| |
| widely used software package for linear mixed model analysis developed and maintained
| |
| by Per Madsen and Just Jensen.
| |
|
| |
|
| Since the early days of software development in statistics and quantitative genetics
| | {{lmt}} supports all models commonly used in genetic evaluation and has various options to handle genomic markers. |
| time has moved on in terms of what programming languages are capable of and therefore
| |
| DMU has been given a thorough overhaul.
| |
|
| |
|
| One result of the overhaul is the new name, {{lmt}}, resulting from the difficulty to translate
| | {{lmt}} has been used successfully for genetic evaluation data sets with >>200k genotyped animals, >>15m animals, >>500m equations. |
| the acronym DMU into something which is generally meaningful throughout time. For
| |
| those who prefer the acronym DMU, they may refer to {{lmt}} as <b>DMU-next</b>.
| |
|
| |
|
| The second area of the overhaul is the parameter file interface. {{lmt}} now comes with
| | {{lmt}} is only available for 64 bit Linux operation systems, is run from the Linux command line, and uses an [https://www.w3schools.com/xml/ xml] style parameter file which is supposed to allow for an easy understanding by the user. Further using [https://www.w3schools.com/xml/ xml] comes with support for automated commenting, uncommenting, indentation, code-folding and syntax highlighting by almost every editor, |
| an xml style parameter file which is supposed to allow for a much easier understanding | | thus easing to follow the structure of the parameter file even if it spans several tens of lines of code. |
| by the user. Further using xml comes with support for automated commenting, un- | |
| commenting, indentation, code-folding and syntax highlighting by almost every editor,
| |
| thus easing to follow the structure of the parameter file even if it spans several tens of | |
| lines of code. | |
|
| |
|
| The third area of the overhaul is the program structure. DMU was structured into
| | == Conditions of use == |
| several programs (<i>DMU1, DMU4, DMU5, DMUAI, RJMC</i>). In contrast, {{lmt}} is meant
| | {{lmt}} can be used by the scientific community free of charge, but users must credit {{lmt}} |
| to provide the functionalities all those programs via a single parameter file and a single
| | in any publications. |
| executable.
| | Commercial users must obtain the explicit approval of the author before using {{lmt}} and must credit {{lmt}} in any publication in scientific journals. |
| | | If {{lmt}} cannot be credited via citation the author must become a co-author. |
| While {{lmt}} is finally meant to be a full scale successor of DMU, it does not yet provide
| |
| all its functionalities in some areas, in others it already provides more. More specifi-
| |
| cally, there no REML facilities available yet, but large scale linear mixed model solving
| |
| provides Single-Step-T-BLUP facilities, uploading of genotypes and building of genomic
| |
| relationship matrices on the fly etc etc.
| |
| | |
| ==Supported features ==
| |
| | |
| === Supported operations ===
| |
| | |
| Currently {{lmt}} support the following operations on linear mixed models:
| |
|
| |
|
| *Solving for BLUP and BLUE solutions conditional on supplied variances for random and fixed factor, respectively;
| | == How to get it == |
| *Gibbs sampling of variance components in single pass and blocked mode;
| |
| *MC-EM-REML estimation of variance components
| |
| *Sampling elements of the inverse of the mixed model coefficient matrix
| |
|
| |
|
| === Supported factors and variables ===
| | {{lmt}} can be obtained '''on request''' from the [mailto:vinzent.boerner@qgg.au.dk author]. |
| {{lmt}} supports | |
| *fixed
| |
| *random factors
| |
| *classification variables
| |
| *continuous co-variables, which can be nested. For continuous co-variables {{lmt}} support user-defined polynomials and hard coded [https://en.wikipedia.org/wiki/Legendre_polynomials Legendre polynomials] up to order 6.
| |
| *genetic group co-variables
| |
| | |
| All classification and co-variables can be associated to a fixed or random factor.
| |
| | |
| === Supported variance structures ===
| |
| For random factor {{lmt}} supports variance structures of
| |
| *structure [https://en.wikipedia.org/wiki/Kronecker_product $$\Gamma\otimes\Sigma$$], where $$\Sigma$$ is an dense symmetric positive definite matrix, and
| |
| *$$\Theta_L(\Gamma\otimes I_{\Sigma})\Theta_L^{'}$$, where $$\Theta$$ is symmetric positive definite [https://en.wikipedia.org/wiki/Block_matrix#Block_diagonal_matrices block-diagonal matrix] of $$n$$ symmetric positive definite martices $$\Sigma_i, i=1,..,n$$, $$\Theta_L$$ is the lower [https://en.wikipedia.org/wiki/Cholesky_decomposition Cholesky factor] of $$\Theta$$ and $$I_{\Sigma}$$ is an identity matrix of dimension $$\Sigma_i$$.
| |
| | |
| When solving linear mixed models $$\Sigma$$ and $$\Gamma$$ are user determined constants, whereas when estimating variances $$\Gamma$$ is a user determined constant and $$\Sigma$$ is a function of the data.
| |
| | |
| Supported type for $$\Gamma$$ are
| |
| *an [https://en.wikipedia.org/wiki/Identity_matrix identity matrix]
| |
| *an arbitrary positive definite [https://en.wikipedia.org/wiki/Diagonal_matrix diagonal matrix]
| |
| *a pedigree-based numerator relationship matrix $$A$$ which may contain meta-founders
| |
| *a pedigree- and genotype-based relationship matrix $$H$$ which may contain meta-founders
| |
| *a user-defined(u.d.) symmetric, positive definite matrix of which inverse is supplied
| |
| **as a sparse upper-triangular matrix stored in [https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format) csr format]
| |
| **as a dense matrix
| |
| *a co-variance matrix of a selected auto-regressive process
| |
| | |
| === Supported linear mixed model solvers ===
| |
| {{lmt}} supports
| |
| | |
| *a direct solver requiring to explicitly build the linear mixed model equations left-hand-side coefficient matrix($$C$$)
| |
| *an iteration-on-data pre-conditioned gradient solver which '''does not''' require $$C$$
| |
| | |
| === Supported features related to genomic data ===
| |
| *direct use of genomic marker data
| |
| *building of genomic relationship matrices($$G$$) from supplied genomic data
| |
| *uploading of a u.d. $$G$$
| |
| *adjustment of $$G$$ to $$A_{gg}$$
| |
| *solving Single-Step-G-BLUP models
| |
| *sampling variances for Single-Step-G-BLUP models
| |
| *solving Single-Step-T-BLUP models
| |
| *solving Single-Step-SNP-BLUP models
| |
| *all Single-Step models can be run from "bottom-up", that is the user supplies the genotypes and all necessary ingredients(e.g. $$G$$) are built on the fly.
| |
| | |
| === Supported pedigree types===
| |
| *ordinary pedigrees
| |
| *probabilistic pedigrees with an unlimited number of parent pairs per individual
| |
| *genetic group pedigrees
| |
| *meta-founders
| |
| | |
| == {{lmt}} Linear mixed model terminology ==
| |
| === Matrix notation, factors and sub-factors ===
| |
| | |
| Consider the multi-variate linear mixed model
| |
| | |
| $$
| |
| \left(
| |
| \begin{array}{c}
| |
| y_1 \\
| |
| y_2 \\
| |
| y_3
| |
| \end{array}
| |
| \right)
| |
| =
| |
| \left(
| |
| \begin{array}{ccc}
| |
| X_1 & 0 & 0 \\
| |
| 0 & X_2 & 0 \\
| |
| 0 & 0 & X_3
| |
| \end{array}
| |
| \right)
| |
| \left(
| |
| \begin{array}{c}
| |
| b_1 \\
| |
| b_2 \\
| |
| b_3
| |
| \end{array}
| |
| \right)
| |
| +
| |
| \left(
| |
| \begin{array}{ccc}
| |
| Z_1 & 0 & 0\\
| |
| 0 & Z_2 & 0\\
| |
| 0 & 0 & Z_3
| |
| \end{array}
| |
| \right)
| |
| \left(
| |
| \begin{array}{c}
| |
| u_1 \\
| |
| u_2 \\
| |
| u_3
| |
| \end{array}
| |
| \right)
| |
| +
| |
| \left(
| |
| \begin{array}{c}
| |
| e_1 \\
| |
| e_2 \\
| |
| e_3
| |
| \end{array}
| |
| \right)
| |
| $$
| |
| | |
| where $$(y_1,y_2,y_3)'$$, $$(b_1,b_2,b_3)'$$, $$(u_1,u_2,u_3)'$$ and $$(e_1,e_2,e_3)'$$ are vectors of response variables, effects of fixed factors, effects of random factors and effects of residuals respectively, and matrices
| |
| $$\left(
| |
| \begin{array}{ccc}
| |
| X_1 & 0 & 0 \\
| |
| 0 & X_2 & 0 \\
| |
| 0 & 0 & X_3
| |
| \end{array}
| |
| \right)$$, and
| |
| $$
| |
| \left(
| |
| \begin{array}{ccc}
| |
| Z_1 & 0 & 0\\
| |
| 0 & Z_2 & 0\\
| |
| 0 & 0 & Z_3
| |
| \end{array}
| |
| \right)
| |
| $$ are block-diagonal design matrices linking effects in the respective vectors to their related response variables. In usual mixed model terminology $$b_1$$, $$b_2$$ and $$b_3$$ are called fixed factors, and $$u_1$$, $$u_2$$ and $$u_3$$ are called random factors. Ignoring the residual the above model has in total 6 factors.
| |
| | |
| However, the model maybe rewritten in matrix formulation as
| |
| | |
| $$vec(Y)=Xvec(B)+Zvec(U)+vec(E)$$,
| |
| | |
| where $$vec$$ is the [https://en.wikipedia.org/wiki/Vectorization_(mathematics) vectorization operator], $$Y=[y_1,y_2,y_3]$$, $$B=[b_1,b_2,b_3]$$, $$U=[u_1,u_2,u_3]$$ and $$E=[e_1,e_2,e_3]$$ are column matrices of response variables, the effects of the fixed and random factor, and the residuals, respectively, and
| |
| $$X=\left(
| |
| \begin{array}{ccc}
| |
| X_1 & 0 & 0 \\
| |
| 0 & X_2 & 0 \\
| |
| 0 & 0 & X_3
| |
| \end{array}
| |
| \right)$$, and
| |
| $$Z=
| |
| \left(
| |
| \begin{array}{ccc}
| |
| Z_1 & 0 & 0\\
| |
| 0 & Z_2 & 0\\
| |
| 0 & 0 & Z_3
| |
| \end{array}
| |
| \right)
| |
| $$. The distribution assumption for the random components in the model are $$vec(U^{'})\sim N((0,0,0)',\Gamma_u \otimes \Sigma_u)$$ and $$vec(E^{'})\sim N((0,0,0)',\Gamma_e \otimes \Sigma_e)$$. Note that the column and row dimensions of $$U$$ are determined by the column dimension of $$\Sigma_u$$ and $$\Gamma_u$$ respectively.
| |
| | |
| Slightly different to the above terminology, {{lmt}} refers to $$B$$ and $$U$$ as factors, and therefore the model has only two factors, whereas the columns in $$B$$ and $$U$$ are referred to as '''sub-factors'''.
| |
| | |
| Following the above matrix notation {{lmt}} will always invoke only one factor for all modelled fixed classification variables and only one factor for all modelled fixed continuous co-variables. Sub-factors are summarized into a single random factors if they share the same $$\Sigma$$ matrix. Thus, {{lmt}} will invoke as many random factors as there are different $$\Gamma \otimes \Sigma$$ constructs. That is, in {{lmt}} terminology the multi-variate model
| |
| | |
| $$
| |
| \left(
| |
| \begin{array}{c}
| |
| y_1 \\
| |
| y_2 \\
| |
| y_3
| |
| \end{array}
| |
| \right)
| |
| =
| |
| \left(
| |
| \begin{array}{ccc}
| |
| X_1 & 0 & 0 \\
| |
| 0 & X_2 & 0 \\
| |
| 0 & 0 & X_3
| |
| \end{array}
| |
| \right)
| |
| \left(
| |
| \begin{array}{c}
| |
| b_1 \\
| |
| b_2 \\
| |
| b_3
| |
| \end{array}
| |
| \right)
| |
| +
| |
| \left(
| |
| \begin{array}{cccccc}
| |
| Z_{d,1} & 0 & 0 & Z_{m,1} & 0 & 0\\
| |
| 0 & Z_{d,2} & 0 & 0 & Z_{m,2} & 0\\
| |
| 0 & 0 & Z_{d,3} & 0 & 0 & Z_{m,3}\\
| |
| \end{array}
| |
| \right)
| |
| \left(
| |
| \begin{array}{c}
| |
| u_{d,1} \\
| |
| u_{d,2} \\
| |
| u_{d,3} \\
| |
| u_{m,1} \\
| |
| u_{m,2} \\
| |
| u_{m,3}
| |
| \end{array}
| |
| \right)
| |
| +
| |
| \left(
| |
| \begin{array}{ccc}
| |
| W_1 & 0 & 0\\
| |
| 0 & W_2 & 0\\
| |
| 0 & 0 & W_3
| |
| \end{array}
| |
| \right)
| |
| \left(
| |
| \begin{array}{c}
| |
| v_1 \\
| |
| v_2 \\
| |
| v_3
| |
| \end{array}
| |
| \right)
| |
| +
| |
| \left(
| |
| \begin{array}{c}
| |
| e_1 \\
| |
| e_2 \\
| |
| e_3
| |
| \end{array}
| |
| \right)
| |
| $$
| |
| | |
| with
| |
| $$(u_{d,1},u_{d,2},u_{d,3},u_{m,1},u_{m,2},u_{m,3})'\sim N((0,0,0,0,0,0)',\Sigma_u \otimes \Gamma_u)$$ and $$(v_1,v_2,v_3)'\sim N((0,0,0)',\Sigma_v \otimes \Gamma_v)$$, rewritten as $$vec(Y)=Xvec(B)+Zvec(U)+Wvec(V)+vec(E)$$ will have only 3 factors, $$B$$, $$U$$ and $$V$$ with $$b_1,b_2,b_3$$, $$u_{d,1},u_{d,2},u_{d,3},u_{m,1},u_{m,2},u_{m,3}$$ and $$v_1,v_2,v_3$$ being subfactors of $$U$$ and $$V$$ respectively.
| |
| | |
| === Model syntax ===
| |
| | |
| The syntax for communicating the model to {{lmt}} is effectively '''just write the model'''.
| |
| | |
| A valid {{lmt}} model string would {{cc|1=y=mu*b+id*u(v(my_var(1)))}}. The model string consist of
| |
| *the response variable {{cc|y}}, which must be a column name in the data file
| |
| *variables {{cc|mu}} and {{cc|id}}, which must be a column names in the data file
| |
| *sub-factors {{cc|b}} and {{cc|u}} which are user-defined alpha-numeric character strings
| |
| *relation operators {{cc|1==}}, {{cc|*}} and {{cc|+}}
| |
| *a specifier {{cc|(v(my_var(1)))}} used to specify the nature of {{cc|u}}
| |
| | |
| The rules for using relational operators are
| |
| *{{cc|1==}} links the response variable to the model
| |
| *{{cc|*}} links a model variable to it's sub-factor, which together form a right hand side component
| |
| *{{cc|+}} concatenates different right hand side components.
| |
| | |
| Variables and sub-factors maybe accompanied by a specifier. A specifier is a [https://en.wikipedia.org/wiki/Tree_structure#Nested_parentheses tree diagramm] in [https://en.wikipedia.org/wiki/Newick_format Newick format] with all nodes named where the root node is the variable or sub-factor. It provides additional information about a variable or sub-factor. The {{lmt}} version of the above [https://en.wikipedia.org/wiki/Newick_format tree diagram] differs in that
| |
| *the parent nodes precede child nodes
| |
| *child nodes within the same parent node are separated by semicolon
| |
| *leaf nodes can contain a bracket space with additional, maybe comma-separated information
| |
| '''Without any specifier {{lmt}} assumes that'''
| |
| *'''variables are classification variables with the respective columns in the data file containing integer numbers coding for the different levels of the associated sub-factor'''
| |
| *'''sub-factors are fixed'''
| |
| ==== Sub-factor specifiers ====
| |
| Sub-fatcor specifiers are used to communicate that a sub-factor is random. Following the above example {{cc|u(v(my_var(1)))}}, {{cc|u}} is the root node, {{cc|v}} is a child node to {{cc|u}} with a hard-coded name '''v''', {{cc|my_var}} is a child node to {{cc|v}} with a user-defined name '''my_var''' which references the user defined name of a $$\Gamma \otimes \Sigma$$ construct, and {{cc|1}} is an additional information to {{cc|my_var}} communicating that the diagonal element in $$\Sigma$$ related to {{cc|u}} is diagonal element #1.
| |
| ==== Variable specifiers ====
| |
| Variable specifiers are used to communicate further information which may be that the variable
| |
| *is continuous but real numbers
| |
| *is continuous but integer numbers
| |
| *is a genetic group regression matrix
| |
| *undergoes a polynomial expansion
| |
| *is associated to a nesting variable
| |
| etc.
| |
| | |
| [[File:variabletree.jpg|1500px|]]
| |
| | |
| Variables as well as sub-factors maybe used across traits. That is a model
| |
| y1=mu*b1+id*u1(v(sigma(1))
| |
| y2=mu*b2+id*u2(v(sigma(2))
| |
| | |
| == Disclaimer ==
| |
| {{lmt}} is under ongoing development and many of its features have been tested only a few
| |
| times on a limited number of models and data sets. Thus, the users uses {{lmt}} completely
| |
| on his/her own risk. This also applies to any decisions made based on the results provided
| |
| by {{lmt}}.
| |
| | |
| == Conditions of use ==
| |
| {{lmt}} can be used by the scientific community free of charge, but users must credit {{lmt}}
| |
| in any publications. Commercial users must obtain the explicit approval of the author
| |
| before using {{lmt}} and must credit {{lmt}} in any publication in scientific journals.
| |
|
| |
|
| == Feedback and support == | | == Feedback and support == |
Line 315: |
Line 25: |
| However, the author appreciates feedback about the program functionality, possible aborts (segmentation faults), usability of output and comprehensiveness of the manual. | | However, the author appreciates feedback about the program functionality, possible aborts (segmentation faults), usability of output and comprehensiveness of the manual. |
|
| |
|
| | For feedback, wish list, questions and support contact [mailto:vinzent.boerner@qgg.au.dk vinzent.boerner@qgg.au.dk](infrequently checked) or [mailto:vinzent.boerner@gmx.de vinzent.boerner@gmx.de](frequently checked). |
|
| |
|
| | | == [[supported features| Supported features]] == |
| * [http://localhost/mediawiki/index.php/Run_It Run It]
| | == [[Algorithms|Algorithms]] == |
| * [http://localhost/mediawiki/index.php/Inputfileformats Input file formats]
| | == [[Parameterfile1| Parameter file terminology]] == |
| * [http://localhost/mediawiki/index.php/Parameterfile1 Parameter file terminology part 1]
| | == [[linear mixed models in lmt| Linear mixed models in lmt]] == |
| * [http://localhost/mediawiki/index.php/Jump_Start Jump Start]
| | == [[Genomic data in lmt| Genomic data lmt]] == |
| | == [[File formats|File formats]] == |
| | == [[Input files|Input files]] == |
| | == [[Output files|Output files]] == |
| | == [[Run_It|How to run it]] == |
| | == [[Examples|Examples]] == |
| | == [[Parameter file elements| Parameter file elements]] == |