Sequences read by a processor have a number of main components in the following order:
- 16 letter non-random ID
- +- 250 letters of the genetic sequence itself Additionally, we have quality information generated by the sequence for each character of each component. We call all this information together a read.
Our algorithm processes these reads (UIDs, genetic sequence, quality information) to remove errors generated by the sequencer. Errors can occur in any component of the sequence information. We start by matching the IDs of these sequences to form groups/clusters with the same sequence ID. If two reads have the same ID we form a consensus using the sequence itself. We generate the consensus by comparing our new sequence to our reference sequence and taking the higher quality character from either sequence. We are then left with consensus reads created from the summary of many other reads and singleton reads. We then compare the singletons to the consensus groups we have created using a similar methodology as above.