Difference between revisions of "Input files"
(5 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
lmt '''does not require''' the user to provide information about the file content. Depending on the prospective use lmt will expect a particular file content and will try to read the file accordingly. For example | |||
*the data file is supposed to contain only real/float numbers which are transferred to integer if required, | |||
*a file containing an ordinary pedigree is supposed to contain only integer numbers, | |||
*a file containing a missing value indicator matrix is supposed to contain only character strings, etc. | |||
== data file == | == data file == | ||
Line 148: | Line 154: | ||
1211121121111120110001110201111020111111 | 1211121121111120110001110201111020111111 | ||
Note that additional information about the genotypes, e.g. the pedigree ids of the related individuals, maybe supplied via an additional file. | Note that additional information about the genotypes, e.g. the pedigree ids of the related individuals, maybe supplied via an additional file. | ||
== Allele frequency file == | |||
A file containing allele frequencies must be in block format and contain a single block named {{cc|FREQUENCIES}}. | |||
As an example, for 5 markers and setting the allele frequencies to 0.5 the file content is | |||
BEGIN FREQUENCIES | |||
real,array,64,10 | |||
1.0 | |||
1.0 | |||
1.0 | |||
1.0 | |||
1.0 | |||
END FREQUENCIES | |||
Note that the allele frequencies must be expressed as expected allele content(2p). | |||
A pqfile can be generated in R by | |||
<syntaxhighlight lang="R" line> | |||
x=list(FREQUENCIES=rep(1,nmarker)) | |||
writelmtblockfile(x,"pq.blkcsv","txt") | |||
</syntaxhighlight> | |||
and providing it to {{lmt}}. |
Latest revision as of 23:42, 31 August 2022
lmt does not require the user to provide information about the file content. Depending on the prospective use lmt will expect a particular file content and will try to read the file accordingly. For example
- the data file is supposed to contain only real/float numbers which are transferred to integer if required,
- a file containing an ordinary pedigree is supposed to contain only integer numbers,
- a file containing a missing value indicator matrix is supposed to contain only character strings, etc.
data file
lmt accepts only a single file containing the actual data. A data file in ".csv" format must follow the following formatting rules:
- file must have at least one commented line where the last commented line must containing the column names separated by comma,
- the column names
- must be alpha-numeric only
- must not be quoted
- must be unique
- below the header the data file must contain only numeric values where the decimal separator is a dot(".").
The content of a data file maybe named "mydata.csv" with three columns may look like:
#y,mu,id 25.0,1,5 33.1,1,6 36.0,1,7 28.3,1,8
co-variance matrix file
A co-variance matrix files must contain a single full squared symmetric positive definite matrix. The content of a co-variance file maybe named "sigma.csv" may look like
1.5,0.8,0.1 0.8,2.1,1.1 0.1,1.1,1.9
A co-variance matrix can be checked in R via
s<-as.matrix(read.table("sigma.csv",sep=","))
if(nrow(s)!=ncol(s)) stop("matrix not squared")
if(any(s[lower.tri(s)]!=t(s)[lower.tri(s)])) stop("matrix not symmetric")
if(any(diag(s)<10e-12)) stop("diagonal element near zero")
if(min(eigen(s)$values)<10e-5) stop("matrix near indefinite")
co-variance mask files
A co-variance mask file communicates to lmt which co-variances should remain constant when lmt is used to estimate variances. The mask file must contain characters which can be interpreted as a boolean data type, and where "T" codes for co-variances which must remain constant and "F" codes for variances which are allowed to float. The mask file must have the same dimensions as the associated co-variance matrix. For instance the above co-variance matrix file maybe accompanied by a mask file containing
T,F,F F,T,F F,F,F
which communicates to lmt that the first and second diagonal element should remain at their original values.
missing observations indicator file
The pattern of missing observations maybe communicated via an indicator file, where the file must contain characters which can be interpreted as a boolean data type, and where "T" codes for an available observation and "F" codes for a missing observation. Further, similar to the data file, the missing value indicator file must contain a header with the same column names as the observation columns in the data file.
An example data file "mydata.csv"
#y1,y2,mu,id 25.0,0.0,1,5 0.0,0.8,1,6 36.0,-1.5,1,7 28.3,0.0,1,8
maybe accompanied by a missing value indicator file "mymiss.csv"
#y1,y2 T,F F,T T,T T,F
pedigree file
For all types of pedigree described below it is required that
- all pedigree ids are positive integer numbers not larger than 9.223372e+18(i.e. the ids must fit in a 64 bit integer)
- the pedigree is complete, that is all individuals occurring as parents must have a record as individuals,
- missing parents are coded with zero
- the pedigree must not contain cycle dependencies
It is not necessary that the pedigree is sorted or that a sorting variable(e.g. date of birth) is supplied.
ordinary pedigree file
A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id, where the number of unique ids in the first column must be equal to the row dimension of the pedigree.
For instance a file maybe called "myped.csv" may contain
1,0,0 2,0,0 3,1,2 4,1,2 5,3,4
A consistency check for that pedigree in R may be
p<-read.table("myped.csv",sep=",")
colnames(p)<-c("i","s","d")
if(length(unique(p$i))!=nrow(p)) stop("ids not unique")
if(any(!(p$s[p$s!=0] %in% p$i))) stop("some sires don't have records")
if(any(!(p$d[p$d!=0] %in% p$i))) stop("some dams don't have records")
probabilistic pedigree file
Probabilistic pedigrees account for the possibility that an individual originates from more than one pair of parents. That is, an ordinary pedigree is just a special case of a probabilistic pedigree with all probabilities set to 1. In a probabilistic pedigree individual may have repeated records. A file containing an ordinary pedigree must have three numeric columns: individual id,first parent id, second parent id, parentage probability. Within individuals parentage probabilities must sum up to 1. Further, repeated records of the same individual id must be adjacent. For instance
1,0,0,1.0 2,0,0,1.0 3,1,2,1.0 4,1,2,0.5 4,0,0,0.5 5,3,4,0.1 5,1,3,0.2 5,1,4,0.2 5,1,2,0.5
A consistency check for that pedigree in R may be
p<-read.table("myped.csv",sep=",")
colnames(p)<-c("i","s","d","p")
if(any(!(p$s[p$s!=0] %in% p$i))) stop("some sires don't have records")
if(any(!(p$d[p$d!=0] %in% p$i))) stop("some dams don't have records")
if(any((abs(aggregate(p$p,by=list(p$i),sum)$x-1.0)>10e-12))) stop("probabilities do not sum up")
genetic group pedigree file
Ordinary and probabilistic pedigrees can contain genetic groups. lmt assumes one phantom parent per genetic group where the phantom parents must be located at the top of pedigree and must have their parents set to zero. The number of phantom parents is communicated to lmt as an extra parameter at the appropriate location in the parameter file. Adding phantom parents to a pedigree requires to shift the numbering of the original pedigree by the number of phantom parents. For example the above ordinary pedigree can be transferred into a pedigree with 2 genetic groups, and therefore 2 phantom parents:
1,0,0 2,0,0 3,1,1 4,2,2 5,3,4 6,3,4 7,5,6
where the original founder individuals 1 and 2 are now coded as 3 and 4, and are off-spring of phantom parents 1 and 2 respectively. The genetic group methodology requires that in a genetic group pedigree the only individuals with unknown parents are the phantom parents. lmt is not enforcing this concept, that is the user may supply a genetic group pedigree where an individual has one or both parents unknown although the individual is not a phantom parent.
A consistency check for the above pedigree in R may be
p<-read.table("myped.csv",sep=",")
colnames(p)<-c("i","s","d")
if(length(unique(p$i))!=nrow(p)) stop("ids not unique")
if(any(!(p$s[p$s!=0] %in% p$i))) stop("some sires don't have records")
if(any(!(p$d[p$d!=0] %in% p$i))) stop("some dams don't have records")
if(any(p[c(1:2),c("s","d")]!=0)) stop("some phantom parents have known parents")
if(any(p[-c(1:2),c("s","d")]==0)) stop("some ordinary individuals have missing parents")
meta-founder pedigree file
The meta-founder concept is very similar to the genetic group concept with the phantom parents becoming meta-founders, and therefore the same format requirements as for the genetic group pedigree apply. The number of meta-founders as well as the meta-founder co-variance matrix are communicated to lmt as extra parameters at the appropriate location in the parameter file.
Genotype cross-reference file
The file contains the pedigree ids of the genotyped individuals as a single column vector. The genotype file and the cross-reference file must have the same number of lines. lmt will related the individual id and the genotype via the file line number, that is the pedigree id located in line #5 of the cross-reference file is that of an individual of which genotype is located in line #5 of the genotype file.
Genotype file
A genotype file must be in ascii text format where each line contains a single genotype coded 0,1 or 2 for homozygous "aa", heterozygous "Aa" and homozygous "AA", respectively. Currently, missing values are not supported. A single genotype must not contain any space. The file must have a many lines as there are genotypes and each genotype must have the same number of markers. A file containing 10 genotypes of 40 markers each maybe
0122221210011211221100021210220020021221 1211121121111120110011110201121020111111 0122211210012202222200022120220111021222 0122211210012202222200022120220111021222 0222220220020220220000020200220020022220 1111122111102111111111111211121020110112 2200022002202020000002200202022020200002 1211111121112111111111111111121111111112 2200022012202020000002200202012020200002 1211121121111120110001110201111020111111
Note that additional information about the genotypes, e.g. the pedigree ids of the related individuals, maybe supplied via an additional file.
Allele frequency file
A file containing allele frequencies must be in block format and contain a single block named FREQUENCIES . As an example, for 5 markers and setting the allele frequencies to 0.5 the file content is
BEGIN FREQUENCIES real,array,64,10 1.0 1.0 1.0 1.0 1.0 END FREQUENCIES
Note that the allele frequencies must be expressed as expected allele content(2p). A pqfile can be generated in R by
x=list(FREQUENCIES=rep(1,nmarker))
writelmtblockfile(x,"pq.blkcsv","txt")
and providing it to lmt.