Difference between revisions of "Input files"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
Line 64: Line 64:
  T,T
  T,T
  T,F
  T,F
=== pedigree file ===
For all types of pedigree described below it is required that
*all pedigree ids are positive integer numbers not larger than 9.223372e+18(i.e. the ids must fit in a 64 bit integer)
*the pedigree is complete, that is all individuals occurring as parents must have a record as individuals,
*missing parents are coded with zero
==== ordinary pedigree file ====
A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id. For instance
1,0,0
2,0,0
3,1,2
4,1,2
5,3,4
==== probabilistic pedigree file ====
Probabilistic pedigrees account for the possibility that an individual originates from more than one pair of parents. That is, an ordinary pedigree is just a special case of a probabilistic pedigree with all probabilities set to 1. In a probabilistic pedigree individual may have repeated records.
A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id, parentage probability. Within individuals parentage probabilities must sum up to 1. Further, repeated records of the same individual id must be adjacent. For instance
1,0,0,1.0
2,0,0,1.0
3,1,2,1.0
4,1,2,0.5
4,0,0,0.5
5,3,4,0.1
5,1,3,0.2
5,1,4,0.2
5,1,2,0.5


== genotype file ==
== genotype file ==

Revision as of 10:20, 28 December 2020

File content determination via file name extensions

lmt automatically detects the format of input files by the filename extension. Supported extensions are

  • ".csv" for ordinary comma separated values ascii text files
  • ".blkcsv" for comma separated value ascii text files in block format
  • ".bin" for binary files in block format

Note that this mechanism does not apply to files containing genotypes.

".csv" files

".csv" files may contain several commented lines at the top only where the comment character is "#".

The type of the file content is determined by its prospective use, that is

  • the data file is supposed to contain only real/float numbers which are transferred to integer if required,
  • a file containing an ordinary pedigree is supposed to contain only integer numbers,
  • a file containing a missing value indicator matrix is supposed to contain only character strings.

data file

lmt accepts only a single file containing the actual data. A data file in ".csv" format must follow the following formatting rules:

  • file must have at least one commented line where the last commented line must containing the column names separated by comma,
  • the column names
    • must be alpha-numeric only
    • must not be quoted
    • must be unique
  • below the header the data file must contain only numeric values where the decimal separator is a dot(".").

An example for a data file with three columns is shown below.

#y,mu,id
25.0,1,5
33.1,1,6
36.0,1,7
28.3,1,8

co-variance matrix file

A co-variance matrix files must contain a single full squared symmetric matrix, for instance

1.5,0.8,0.1
0.8,2.1,1.1
0.1,1.1,1.9

co-variance mask files

A co-variance mask file communicates to lmt which co-variances should remain constant when lmt is used to estimate variances. The mask file must contain characters which can be interpreted as a boolean data type, and where "T" codes for co-variances which must remain constant and "F" codes for variances which are allowed to float. The mask file must have the same dimensions as the associated co-variance matrix. For instance the above co-variance matrix file maybe accompanied by a mask file containing

T,F,F
F,T,F
F,F,F

which communicates to lmt that the first and second diagonal element should remain at their original values.

missing observations indicator file

The pattern of missing observations maybe communicated via an indicator file, where the file must contain characters which can be interpreted as a boolean data type, and where "T" codes for an available observation and "F" codes for a missing observation. Further, similar to the data file, the missing value indicator file must contain a header with the same column names as the observation columns in the data file.

For example a data file

#y1,y2,mu,id
25.0,0.0,1,5
0.0,0.8,1,6
36.0,-1.5,1,7
28.3,0.0,1,8

maybe accompanied by a missing value indicator file

#y1,y2
T,F
F,T
T,T
T,F

pedigree file

For all types of pedigree described below it is required that

  • all pedigree ids are positive integer numbers not larger than 9.223372e+18(i.e. the ids must fit in a 64 bit integer)
  • the pedigree is complete, that is all individuals occurring as parents must have a record as individuals,
  • missing parents are coded with zero

ordinary pedigree file

A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id. For instance

1,0,0
2,0,0
3,1,2
4,1,2
5,3,4

probabilistic pedigree file

Probabilistic pedigrees account for the possibility that an individual originates from more than one pair of parents. That is, an ordinary pedigree is just a special case of a probabilistic pedigree with all probabilities set to 1. In a probabilistic pedigree individual may have repeated records. A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id, parentage probability. Within individuals parentage probabilities must sum up to 1. Further, repeated records of the same individual id must be adjacent. For instance

1,0,0,1.0
2,0,0,1.0
3,1,2,1.0
4,1,2,0.5
4,0,0,0.5
5,3,4,0.1
5,1,3,0.2
5,1,4,0.2
5,1,2,0.5


genotype file

A genotype file must be in ascii text format where each line contains a single genotype coded 0,1 or 2 for homozygous "aa", heterozygous "Aa" and homozygous "AA", respectively, and 3 for missing. A single genotype must not contain any space. The file must have a many lines as there are genotypes and each genotype must have the same number of markers. A file containing 10 genotypes of 40 markers each maybe

0122221210011211221100021210220020021221
1211121121111120110011110201121020111111
0122211210012202222200022120220111021222
0122211210012202222200022120220111021222
0222220220020220220000020200220020022220
1111122111102111111111111211121020110112
2200022002202020000002200202022020200002
1211111121112111111111111111121111111112
2200022012202020000002200202012020200002
1211121121111120110001110201111020111111