File formats

From Linear Mixed Models Toolbox
Jump to navigation Jump to search

File content determination via file name extensions

lmt automatically detects the format of input files by the filename extension. Supported extensions are

  • ".csv" for ordinary comma separated values ascii text files
  • ".blkcsv" for comma separated value ascii text files in block format
  • ".bin" for binary files in block format
  • ".coocsv" csv-format for storing sparse matrices in coordinate format.

Note that this mechanism does not apply to files containing genotypes.

".csv" files

"*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records.

lmt does not supports empty records, e.g.

1,2,,4

must be filled with some value (e.g zero).

".csv" files may contain several commented lines at the top only where the comment character is "#".

Block format

Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.

The general structure of a single block is:

BEGIN MYNAME
DESCRIPTOR
DATA
END MYNAME

where MYNAME is a user-defined block name which must be unique, DESCRIPTOR is a comma-separated sequence of data descriptors, and DATA are the actual data. DESCRIPTOR contains at least 3 values:

  • type
  • kind
  • size
  • dimension

where type is either int for integer, real for real or char for character. kind can be either scalar or array . size is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If kind is set to array , dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.

A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:

BEGIN a                                            
int,scalar,64                                      
5
END a                                              
BEGIN b                                            
int,array,64,2                                     
1
2
END b                                              
BEGIN c                                            
real,array,64,2,2                                  
5,5
5,5
END c                                              
BEGIN d                                            
real,scalar,64                                     
5
END d

where the above file can be produced in R

x<-list(a=as.integer(5),b=as.integer(c(1,2)),c=matrix(5,2,2),d=5)
writelmtblockfile(x,"myfile.blkcsv","txt")

The R function writelmtblockfile can be obtained from the author.

In binary format, BEGIN MYNAME , END MYNAME and DESCRIPTOR are stored in 50 byte character strings which must not be null terminated.

lmt will assume block format for all files with file name suffix .bin and .blkcsv .


.coocsv format

Comma-separated coordinate format is used to write out sparse matrices in acsii. The first line of the respective file contains the number of rows and columns of the sparse matrix. All subsequent rows contain tuples of three values which are

  • row number
  • column number
  • value