Difference between revisions of "File formats"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
Line 9: Line 9:


== ".csv" files ==
== ".csv" files ==
"*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records.
lmt supports empty records, e.g.
1,2,,4
which will be filled with zero.


".csv" files may contain several commented lines at the top only where the comment character is "#".
".csv" files may contain several commented lines at the top only where the comment character is "#".
The type of the file content is determined by its prospective use, that is
*the data file is supposed to contain only real/float numbers which are transferred to integer if required,
*a file containing an ordinary pedigree is supposed to contain only integer numbers,
*a file containing a missing value indicator matrix is supposed to contain only character strings.


== Block format ==
== Block format ==

Revision as of 06:27, 4 March 2022

File content determination via file name extensions

lmt automatically detects the format of input files by the filename extension. Supported extensions are

  • ".csv" for ordinary comma separated values ascii text files
  • ".blkcsv" for comma separated value ascii text files in block format
  • ".bin" for binary files in block format
  • ".coocsv" csv-format for storing sparse matrices in coordinate format.

Note that this mechanism does not apply to files containing genotypes.

".csv" files

"*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records.

lmt supports empty records, e.g.

1,2,,4

which will be filled with zero.

".csv" files may contain several commented lines at the top only where the comment character is "#".

Block format

Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.

The general structure of a single block is:

BEGIN MYNAME
DESCRIPTOR
DATA
END MYNAME

where MYNAME is a user-defined block name which must be unique, DESCRIPTOR is a comma-separated sequence of data descriptors, and DATA are the actual data. DESCRIPTOR contains at least 3 values:

  • type
  • kind
  • size
  • dimension

where type is either int for integer, real for real or char for character. kind can be either scalar or array . size is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If kind is set to array , dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.

A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:

BEGIN a                                            
int,scalar,64                                      
5
END a                                              
BEGIN b                                            
int,array,64,2                                     
1
2
END b                                              
BEGIN c                                            
real,array,64,2,2                                  
5,5
5,5
END c                                              
BEGIN d                                            
real,scalar,64                                     
5
END d

In binary format, BEGIN MYNAME , END MYNAME and DESCRIPTOR are stored in 50 byte character strings which must not be null terminated.

lmt will assume block format for all files with file name suffix .bin and .blkcsv .