If you would like to contribute code you can do so through GitHub by forking the repository and sending us a pull request.
When submitting code, please make every an effort to follow already used conventions and style of the current code in order to keep the code as readable as possible.
Please also try to write unit tests wherever it is possible.
By contributing your code, you agree to license your contribution under the terms of the MIT License:
If you are adding a new file it should have a header like this:
/**
* Copyright (c) 2019 DKFZ - ODCF
*
* Distributed under the MIT License (license terms are at https://github.com/dkfz-odcf/FastqIndEx/blob/master/LICENSE.txt).
*/
- FastqIndEx is currently limited to run under Linux. E.g. we access the /proc folder which might not be available on your operating system.
FQI files are binary files with the following general structure:
|HEADER|ENTRY 0|ENTRY 1|ENTRY 2|ENTRY ...|
The version 1 header is exactly 512 Byte wide and can be described like:
| (IndexHeader) |
| (u_int32_t) | (u_int32_t) | (u_int32_t) | (u_int32_t) | (int64_t) | (int64_t) | (bool) | (u_char) | (int64_t)[59] |
| indexWriterVersion | sizeOfIndexEntry | magicNumber | blockInterval | numberOfEntries | linesInIndexedFile | dictionariesAreCompressed | placeholder | reserved |
The version 1 index entry has an extracted width of 32800 Byte index entry can be described like:
| IndexEntry |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[32768] |
| blockID | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine | bits | reserved | compressedDictionarySize | dictionary |
If compression is enabled, this looks a bit different (note the last field differs then and depends on compressedDictionarySize!), when it is stored in the FQI file:
| IndexEntry |
| (int64_t) | (int64_t) | (int64_t) | (u_int32_t) | (u_int32_t) | (u_char) | (u_int16_t) | (u_char)[compressedDictionarySize] |
| blockID | blockOffsetInRawFile | startingLineInEntry | offsetOfFirstValidLine | bits | reserved | compressedDictionarySize | dictionary |
If compression is enabled, you need to read out the IndexEntry without the dictionary first.
If you use an IDE like CLion, make sure to activate the environment before running the IDE.
Also make sure to use the right compilers and tools. They are named a bit differently in Conda. CLion recognizes them, if you use the environment like mentioned.
There are several things we consider and do to make usage of FastqIndEx as safe as possible.
=> We will not get the application a 100% safe, but we try to minimize the risk of data corruption.
In our code base, you will not find assert and throws but collect errors as soon as possible. The class used for this is the ErrorAccumulator. Most classes inherit this class. Now, before anything is done, we try to collect errors as soon as possible and abort before index / extract or stats is running.
In short:
- Do not use throws or assert, when not necessary. Document it, when you plan to diverge from this.
- Try to keep your error messages precise and helpful.
- Mark programming errors with the prefix "BUG: "
We use CLion to develop FastqIndEx. CLion has clang-tidy support built in, which we use to eliminate well known issues and problems.
We use Valgrind to check for memory leaks which e.g. lead to SIGSEV or SIGABRT.
When it comes to safely accessing the index file, we want to have it so that:
- Writing to the index is exclusive. No other reader or writer is allowed at the same time.
- It is allowed to read multiple times from an index file, if no writer is active.
As we work with network file systems, we need to deal with several problems:
- flock only works correctly with NFS on newer Linux kernels.
- However, we cannot guarantee, that a file which was not locked, will be overriden during our read.
To overcome these problems:
- We can use flock for our writer. So writing will be safe, as long as
no other process starts writing to the file.
=> Implemented - We can write out a md5sum file for the index after it was created.
Not implemented - We can check for an existing lockfile before we read from a file and
abort early.
Implemented - We can read in the md5sum file and calculate our md5sum during our
read.
Not implemented - We can also constantly perform sanity check for changes in timestamp,
file size or file existence.
Not implemented - We can store md5sums for IndexEntries, e.g. for the first valid
line(s) in each compressed block.
Not implemented
- AWS
- CMake
- CMake Tutorial 1
- CMake Tutorial 2
- Miniconda
- UnitTest++
- Valgrind
- zindex
- zran.c random gz file access example.