Difference between revisions of "File formats"

From Linear Mixed Models Toolbox
Jump to navigation Jump to search
Line 16: Line 16:
*a file containing an ordinary pedigree is supposed to contain only integer numbers,
*a file containing an ordinary pedigree is supposed to contain only integer numbers,
*a file containing a missing value indicator matrix is supposed to contain only character strings.
*a file containing a missing value indicator matrix is supposed to contain only character strings.
== Block format ==
Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.
The general structure of a single block is:
BEGIN MYNAME
DESCRIPTOR
DATA
END MYNAME
where {{cc|MYNAME}} is a user-defined block name which must be unique, {{cc|DESCRIPTOR}} is a comma-separated sequence of data descriptors, and {{cc|DATA}} are the actual data.
{{cc|DESCRIPTOR}} contains at least 3 values:
*type
*kind
*size
*dimension
where {{cc|type}} is either {{cc|int}} for integer, {{cc|real}} for real or {{cc|char}} for character. {{cc|kind}} can be either {{cc|scalar}} or {{cc|array}}. {{cc|size}} is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If {{cc|kind}} is set to {{cc|array}}, dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.
A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:
BEGIN a                                           
int,scalar,64                                     
5
END a                                             
BEGIN b                                           
int,array,64,2                                   
1
2
END b                                             
BEGIN c                                           
real,array,64,2,2                                 
5,5
5,5
END c                                             
BEGIN d                                           
real,scalar,64                                   
5
END d

Revision as of 06:19, 4 March 2022

File content determination via file name extensions

lmt automatically detects the format of input files by the filename extension. Supported extensions are

  • ".csv" for ordinary comma separated values ascii text files
  • ".blkcsv" for comma separated value ascii text files in block format
  • ".bin" for binary files in block format
  • ".coocsv" csv-format for storing sparse matrices in coordinate format.

Note that this mechanism does not apply to files containing genotypes.

".csv" files

".csv" files may contain several commented lines at the top only where the comment character is "#".

The type of the file content is determined by its prospective use, that is

  • the data file is supposed to contain only real/float numbers which are transferred to integer if required,
  • a file containing an ordinary pedigree is supposed to contain only integer numbers,
  • a file containing a missing value indicator matrix is supposed to contain only character strings.

Block format

Files in block format can contain multiple records of very different data. That is, a single file can contain several different matrices, vector, and scalars and a mixture of those. Block files are important for supplying sparse matrices, large arrays(e.g. genomic relationship matrices), block-diagonal matrices (e.g. when residual variances are heterogenous) etc. Further, lmt may write intermediate output to a block file for having connected information in the same file.

The general structure of a single block is:

BEGIN MYNAME
DESCRIPTOR
DATA
END MYNAME

where MYNAME is a user-defined block name which must be unique, DESCRIPTOR is a comma-separated sequence of data descriptors, and DATA are the actual data. DESCRIPTOR contains at least 3 values:

  • type
  • kind
  • size
  • dimension

where type is either int for integer, real for real or char for character. kind can be either scalar or array . size is the storage size in bit for a single value, for integer and real it should be 64, for character it should be the storage size of the longest character string in an array of strings. If kind is set to array , dimension is a vector of comma-separated values containing the dimensions of the array, where the length of the vector determines the number of dimensions. That is, for a one-dimensional array the vector is of length one, for two-dimensional the length is two and for three-dimensional the length is three. Currently only one, two and three dimensional arrays are supported.

A block file in csv format containing an integer scalar, an integer vector, a real matrix, and a real scalar maybe:

BEGIN a                                            
int,scalar,64                                      
5
END a                                              
BEGIN b                                            
int,array,64,2                                     
1
2
END b                                              
BEGIN c                                            
real,array,64,2,2                                  
5,5
5,5
END c                                              
BEGIN d                                            
real,scalar,64                                     
5
END d