Difference between revisions of "File formats"
Line 59: | Line 59: | ||
END d | END d | ||
In binary format, {{cc|BEGIN MYNAME}}, {{cc|END MYNAME}} and {{cc|DESCRIPTOR}} are stored in 50 byte character strings which '''must not''' be null terminated | In binary format, {{cc|BEGIN MYNAME}}, {{cc|END MYNAME}} and {{cc|DESCRIPTOR}} are stored in 50 byte character strings which '''must not''' be null terminated. | ||
lmt will assume block format for all files with file name suffix {{cc|.bin}} and {{cc|.blkcsv}}. |
Revision as of 06:23, 4 March 2022
File content determination via file name extensions
lmt automatically detects the format of input files by the filename extension. Supported extensions are
- ".csv" for ordinary comma separated values ascii text files
- ".blkcsv" for comma separated value ascii text files in block format
- ".bin" for binary files in block format
- ".coocsv" csv-format for storing sparse matrices in coordinate format.
Note that this mechanism does not apply to files containing genotypes.
".csv" files
".csv" files may contain several commented lines at the top only where the comment character is "#".
The type of the file content is determined by its prospective use, that is
- the data file is supposed to contain only real/float numbers which are transferred to integer if required,
- a file containing an ordinary pedigree is supposed to contain only integer numbers,
- a file containing a missing value indicator matrix is supposed to contain only character strings.
Block format
Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.
The general structure of a single block is:
BEGIN MYNAME DESCRIPTOR DATA END MYNAME
where MYNAME is a user-defined block name which must be unique, DESCRIPTOR is a comma-separated sequence of data descriptors, and DATA are the actual data. DESCRIPTOR contains at least 3 values:
- type
- kind
- size
- dimension
where type is either int for integer, real for real or char for character. kind can be either scalar or array . size is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If kind is set to array , dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.
A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:
BEGIN a int,scalar,64 5 END a BEGIN b int,array,64,2 1 2 END b BEGIN c real,array,64,2,2 5,5 5,5 END c BEGIN d real,scalar,64 5 END d
In binary format, BEGIN MYNAME , END MYNAME and DESCRIPTOR are stored in 50 byte character strings which must not be null terminated.
lmt will assume block format for all files with file name suffix .bin and .blkcsv .