Difference between revisions of "File formats"
Line 12: | Line 12: | ||
"*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records. | "*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records. | ||
lmt supports empty records, e.g. | lmt does not supports empty records, e.g. | ||
1,2,,4 | 1,2,,4 | ||
must be filled with some value (e.g zero). | |||
".csv" files may contain several commented lines at the top only where the comment character is "#". | ".csv" files may contain several commented lines at the top only where the comment character is "#". |
Revision as of 06:33, 4 March 2022
File content determination via file name extensions
lmt automatically detects the format of input files by the filename extension. Supported extensions are
- ".csv" for ordinary comma separated values ascii text files
- ".blkcsv" for comma separated value ascii text files in block format
- ".bin" for binary files in block format
- ".coocsv" csv-format for storing sparse matrices in coordinate format.
Note that this mechanism does not apply to files containing genotypes.
".csv" files
"*.csv" files must contain rectangular arrays, that is every line must contain the same number of comma-separated records.
lmt does not supports empty records, e.g.
1,2,,4
must be filled with some value (e.g zero).
".csv" files may contain several commented lines at the top only where the comment character is "#".
Block format
Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.
The general structure of a single block is:
BEGIN MYNAME DESCRIPTOR DATA END MYNAME
where MYNAME is a user-defined block name which must be unique, DESCRIPTOR is a comma-separated sequence of data descriptors, and DATA are the actual data. DESCRIPTOR contains at least 3 values:
- type
- kind
- size
- dimension
where type is either int for integer, real for real or char for character. kind can be either scalar or array . size is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If kind is set to array , dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.
A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:
BEGIN a int,scalar,64 5 END a BEGIN b int,array,64,2 1 2 END b BEGIN c real,array,64,2,2 5,5 5,5 END c BEGIN d real,scalar,64 5 END d
In binary format, BEGIN MYNAME , END MYNAME and DESCRIPTOR are stored in 50 byte character strings which must not be null terminated.
lmt will assume block format for all files with file name suffix .bin and .blkcsv .