Field | Header | Data | Mask |
---|---|---|---|
Description |
The header from the FASTA file is copied, and appended to it is
a colon (:) followed by the range
represented by the data.
I originally chose the Python indexing method of
expressing the range, Following the header proper is a newline character (actually, one or more should be allowed, so that the format is not unix-specific; however the first version of the program probably will not work with DOS-format files), then a capital letter 'P'. Immediately following the 'P' is the packed data, followed by the packed "mask" information.
This header was chosen for ease of use: with standard GNU tools; one can
|
The packed data uses two bits per byte of source data, using the arbitrary
mapping |
The mask is exactly as long as the compressed data, and clarifies the
meaning of the data proper. The data can be used without the mask, however
the repeat-masking would be lost, and unknown data would show up
incorrectly as repeats of 'G'. Currently 3 values of mask are used:
|
Example Input | >chrY\n | gatcGATCnN | (N/A) |
Example Output | (hex) 1B1B40 or (binary)
|
(hex) 5500A0 or (binary)
|
Another way of expressing it is that 00 (G) masked by the dibit 00 is still G, masked with 01 becomes g, and masked with 10 becomes N. A mask of 11 would render the data undefined at this point; I may find valuable use for that mask value later.
There is no final end-of-line or end-of-file character specified by the format, however because of the structure design, any remaining bytes will be ignored (carriage return, linefeed, ^Z, whatever). That's all there is to it! The above example is repeated below in a different format.
jcomeau@notebook ~ $ cat gatcGATCnN.txt >testfile gatcGATCnN jcomeau@notebook ~ $ dump gatcGATCnN.txt.2bit gatcGATCnN.txt.2bit: Addr 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 2 4 6 8 A C E -------- ---- ---- ---- ---- ---- ---- ---- ---- ---------------- 00000000 3e74 6573 7466 696c 653a 302d 3130 0a50 >testfile:0-10.P 00000010 1b1b 4055 00a0 ..@U.