load, analyze, manipulate and save DNA sequences
The system will interact with the user through a CLI (Command Line Interface) that uses the standard I/O. Using that interface, the user will be able to load DNA sequences from files, to analyze them, to manipulate them (e.g., by extracting sequence slices or by modifying the sequence), and to store modified sequences and reports. The commands are detailed in the following sections
The command line interface allows interaction with the user. Throughout that
interface, the user can enter their input and see the application's output. The prompt
of the CLI is usually > cmd >>> ; it might change when special type of input is
required.
There are several groups of commands, which are detailed in the next chapters.
Sequences
The application's most important elements are sequences . These are DNA sequences
(composed of the four characters A , C , T and G ). Each sequence that is held in the app's
memory has a name and sequence number . When referring to a sequence in
commands, unless otherwise defined, it is possible to refer it either by its ID or by its
name .
Reference to the sequence number is done using the h ash character #
For example:
#1 means sequence number 1 .
#107 means sequence number 107 .
Reference to the sequence name is done using the a t character @
For example:
@short-seq refers to the sequence named short-seq.
@dolly-dna refers to the sequence named dolly-dna .
The application stores and reads DNA sequences from files. It will use raw DNA data,
that is, files that contain the four letters and nothing else.
The default file extension for that app is .rawdna .
Common CLI markings
● [argument] - Words starting with " [ ", ending with " ] " represent optional
arguments.
● - Words starting with " < ", ending with " > " represent required
arguments.
● arg1|arg2 - Pipe sign ("|") between words represents that each one of them can
be used.
Example using bash command cp :\
cp [-r|-f] <source> [<source2> [<source3> ...]] <destination>
Means that:
● cp can be used with flag -r or with flag -f, but they both are optional.
● After the flags (if they exist) must be the source files. At least one, but
can be many.
● At the end must be the destination
Legal examples:
cp x.cpp dna.cpp
cp -r project dnaProject
cp -f y.cpp dna.cpp
cp dna.cpp dna.h dnaProject
The following commands are being used to generate new sequences :
> cmd >>> new <sequence> [@<sequence_name>]
Creates a new sequence, as described by the followed sequence. If the @<sequence_name> is used, then this will be the name of the new sequence. Otherwise, a default name will be provided - seq1 (or seq2 , seq3 and so on, if the name is already taken). The new sequence, its name and its number (internal ID, starting with 1) are printed. For example:
> cmd >>> new ATACTGCCTGAATAC @short_seq
will create that sequence; if this is the first sequence, it will be numbered "1" and the following will be printed:
[ 1 ] short_seq: ATACTGCCTGAATAC
> cmd >>> load <file_name> [@<sequence_name>]
loads the sequence from the file, assigns it with a number (ID) and a default name, if one was not provided (based on the file name, possibly postfixed with a number if the name already exists), and prints it. For example:
> cmd >>> load my_dna_seq.rawdna
will load the sequence from the file my_dna_seq.rawdna and will print its assigned ID, its name and the sequence (no more than 40 chars; If there are more in the sequence, it prints the first 32, then an ellipsis, and then the last three ones). So, a typical output might be:
[ 14 ] my_dna_seq: AACGTTTTTGAACACCAGTCAACAACTAGCCA...TTG
> cmd >>> dup <seq> [@<new_seq_name>]
duplicates the sequence. If a new name is not provided, then it will be based on the name of , suffixed by _1 (or _2 , _3 , ... if the name is already taken). For example:
> cmd >>> dup # 22
will result in
[ 23 ] conseq_1_1: ATACTGCCTGAATACAGCATAGCATTGCCT
The following commands manipulate existing sequences :
Their default behavior is to modify the source sequence in-place (that is, the
original ID and name of the sequence are left the same, only its content is
modified).
● If a colon : appears after the command's argument, then the original
sequence is left untouched, and a new sequence is generated with the
manipulation results.
● If an argument of the form @<new_seq_name> is provided after the colon, then
this is the name of the new sequence.
● Otherwise, if @@ instead, then the name of the new sequence is automatically
generated by the app.
Each command might generate a different default name.
When a sequence is required as a source, both ID ( #<seq_id> ) or name ( @seq_name )
are acceptable, unless otherwise defined.
> cmd >>> slice <seq> <from_ind> <to_ind> [: [@<new_seq_name>|@@]]
Slices the sequence, so that starts in <from_ind> (0-based index) and ends in <to_ind
(inclusive ).
If @<new_seq_name is provided, the results will create a new sequence with that name.
If @@ is provided, the results will create a new sequence with auto-generate name,
based on the name of the original sequence, with the suffix _s1 (or, if that name is
already occupied, with the suffix _s2 , and so on).
For example:
Assuming that the former short_seq 's ID is 1, the following command:
> cmd >>> slice # 1 4 9
will change the sequence to TGCCT (Letters at indices 4,5,6,7,8) and print the output:
[ 1 ] short_seq: TGCCT
If @@ was provided to the same command, then sequence 1 would have not changed, and a new sequence would have been generated instead. A typical call, then, might look like:
> cmd >>> slice # 1 4 8 : @@
[ 19 ] short_seq_s1: TGCCT
> cmd >>> replace <seq> <index> <new_letter> [: [@<new_seq_name>|@@]]
replaces the letter in the (0-based) index of by <new_letter> .
If @<new_seq_name> is provided, the original sequence is left untouched and the
result is put in a newly created sequence with that name.
If @@ is provided, the name is based on the original sequence, with the suffix _r1 (or,
if that name is already existing, _r2 and so on).
The command might get more than a single replacement. In that case, after
there will be more than one pair of and <new_letter> .
For example:
> cmd >>> replace @short_seq_s1 0 A 3 A : @repl_seq
will result in the following output:
[ 20 ] repl_seq: AGCAT
> cmd >>> concat <seq_1> <seq_2> [: [@<new_seq_name>|@@]]
concatenates <seq_2> at the end of seq_1 .
If the colon ":" syntax is used, then both seq_1 and seq_2 are left untouched, and the
result is being put in a new sequence, named either as provided by <new_seq_name> .
If the @@ syntax is used, automatically generated to include the name of <seq_1> , an
underscore, then the name of seq_2 and finally the suffix _c1 (or _c2 , _c3 , ..., if the
name is already taken).
For example:
The result of:
> cmd >>> concat # 19 @repl_seq : @@
will result in a new sequence and will print:
[ 21 ] short_seq_s1_repl_seq_c1: TGCCTAGCAT
Concatenation is also possible for more than two sequences. In this case, the default name for the new sequence, in case that: @@ is provided, will be conseq_1 (or conseq_2 and so on, if that name is already taken). For example: The following command:
> cmd >>> concat # 1 # 20 # 20 # 19 : @@
will result the following:
[ 22 ] conseq_1: ATACTGCCTGAATACAGCATAGCATTGCCT
If no "colon argument" is used, then the command modifies only the first sequence (e.g., in the last example - #1) and the other sequences are left untouched.
This is a list of commands that manage existing sequences (without manipulating them).
> cmd >>> del <seq>
deletes that sequence. Before deleting it, the user is asked to confirm that: Confirmation is done by entering y or Y , Entering n or N cancels the deletion. Any other input will result in a message that asks the user again to confirm the deletion. Once confirmed, the sequence is deleted and a message is printed. Otherwise, a cancellation message is printed. So, a deletion scenario might look like:
> cmd >>> del # 23
Do you really want to delete conseq_1_1: ATACTGCCTGAATACAGCATAGCATTGCCT?
Please confirm by 'y' or 'Y' , or cancel by 'n' or 'N' .
> confirm >>> x
You have typed an invalid response. Please either confirm by 'y' / 'Y' , or
cancel by 'n' / 'N' .
> confirm >>> Y
Deleted: [ 23 ] conseq_1_1: ATACTGCCTGAATACAGCATAGCATTGCCT
> cmd >>> save <seq> [<filename>]
saves sequence to a file.
If is not provided, the sequence name is being used.
The filename is suffixed by .rawdna .\
> cmd >>> len <seq_id>
prints the length of the sequence. For example: If sequence #34 is AAATGTGATG , then it will look like this:
> cmd >>> len # 34
10
The find command finds a sub-sequence within a sequence. It has two flavors:\
- Takes an expressed sub-sequence :
> cmd >>> find <seq> <expressed_sub_seq>
returns the (0-based) index of the first appearance of <expressed_sub_seq> in
the sequence .
Thus, for example:\
If sequence #11 is AACCTTGGAATTCCGGAA and we are looking for the sub-sequence GG , it will look like:\
> cmd >>> find # 11 GG
7
- Refers an existing sub-sequence :
> cmd >>> find <seq_to_find_in> <seq_to_be_found>
Thus, for example: If seq #11 is as appears above, and sequence #25 is CTTGGA , it might look like:
> cmd >>> find # 11 # 25
4
count works in a similar way to find , only it returns the number of instances of the
sub-sequence within the larger sequence.
Like find , it has two flavors, one that takes an expressed sub-sequence , and one
that refers an existing sub-sequence :\
> cmd >>> count <seq> <expressed_sub_seq>
> cmd >>> count <seq_to_find_in> <seq_to_be_found>
> cmd >>> findall <seq> <expressed_sub_seq>
> cmd >>> findall <seq_to_find_in> <seq_to_be_found>
work very similar to find , only they return all the indices where the sub-sequence
appears.
Thus, for example:
Using the above sequence for sequence #11, it might look like:
> cmd >>> findall # 11 GA
8 16
> cmd >>> findall # 11 AA
1 9 17
> cmd >>> list
shows all the sequences in the system, by order.
For each sequence, it shows its number, its name and the sequence itself (up to 40
chars).
A sequence that is c onnected to a file (that is, either was loaded from a file, or was
saved to a file at least once) and was not manipulated since last save, is prefixed by a
- sign.
A sequence that was manipulated after last save, is prefixed by a * sign.
A sequence that is not connected to any file yet is prefixed by a o sign.
For example, a typical list output might be:
cmd >>> list
- [ 1 ] short_seq: ATACTGCCTGAATAC
- [ 14 ] my_dna_seq: AACGTTTTTGAACACCAGTCAACAACTAGCCA...TTG
- [ 21 ] short_seq_s1_repl_seq_c1: TGCCTAGCAT
- [ 22 ] conseq_1: ATACTGCCTGAATACAGCATAGCATTGCCT
o [ 24 ] short_seq_s1_repl_seq_c1_p1: ACGGATCGTA
> cmd >>>show `<seq>` [<num_chars>]
Shows the sequence: Its ID, its name, its status and the sequence itself. The status is either up to date (was not changed since last save), modified (changed since last save) or new (not yet connected to a file). The ID, name and status are printed in the first line. Then, the sequence itself is printed in the next lines, no more than 99 chars per line. If <num_chars> is provided, then this is the number of chars to print (if the sequence is longer than that). Otherwise, <num_chars> defaults to 99. Thus, it might look like:
> cmd >>> show # 7 204
[ 7 ] lab_test_seq_feb_2015 modified
CCGTGCCTAGCATACGGATCGTATGCCTAGCATACTAGCATCCGTGCCTAGCATACGGATCGTATGCCTAGC
ATACTAGCATCCGTGCCTAGCATACGG
CGTATGCCTAGCATCCCGGATCGTATACGGATCGTAGTGCCTAGCATACGGAGCCTAGCATACTAGCATCCG
TGCCTAGCATTACTAGCATCCGTGCCT
AGCATA
> cmd >>> quit
prints a goodbye message and exists the application.
If not all the sequences are up to date , it first requests for a confirmation.
So, for example, it might look like:\
> cmd >>> quit
There are 3 modified and 2 new sequences. Are you sure you want to quit?
Please confirm by 'y' or 'Y' , or cancel by 'n' or 'N' .
> confirm >>> qwerty
You have typed an invalid response. Please either confirm by 'y' / 'Y' , or
cancel by 'n' / 'N' .
> confirm >>> Y
Thank you for using Dnalanyzer.
Goodbye!
> cmd >>>batch <batch_name>
Batch mode allows the user to define a series of actions that will take place one after
another.
In order to define a batch, the user enters the command batch , followed by the name
of that new batch. Then, it enters into batch mode , where any command is not being
activated immediately, but rather, entered into the batch.
The command end ends the batch mode.
For example:
> cmd >>> batch my_batch
> batch >>> load basil_dna.rawdna @basil
> batch >>> pair basil @basil_pair
> batch >>> find ## TGATTCTC : @start_slice
> batch >>> find ## TTTTAAAATTTTCCCC
> batch >>> calc __ + 4
> batch >>> slice @basil_pair @start_slice __ @basil_interesting_part
> batch >>> save ##
> batch >>> end
> cmd >>>
This batch (when run) will load a sequence from the file basil_dna.rawdna and will
keep it under the name basil .
Then, it will create its pair and keep it under the name basil_pair .
After that, it will slice it from the first index of the sub-sequence TGATTCTC to four
nucleodites after the first index of the sub-sequence TTTTAAAATTTTCCCC (that is, it
will include the first four T of that sequence). That slice will be kept under the name
basil_interesting_part and then will be saved to disk using that name (with the
.rawdna suffix).
When the batch mode ends, the batch is added to the list of active batches - nothing is
being activated yet.
The command run <@batchname> runs a batch, that is, executes it as if the commands were entered manually.
The command batchlist shows a list of all the batch names.
The command batchshow <@batch_name> shows the content of that batch.
Saving a batch is done using the command batchsave , followed by the filename. If the filename is omitted, then the batch name is being used as the filename, with the suffix .dnabatch The batch is saved exactly as it is written in the CLI (without the prompt, of course). Thus, for example: The above script will be saved as:
load basil_dna.rawdna @basil
pair basil @basil_pair
find ## TGATTCTC : @start_slice
find ## TTTTAAAATTTTCCCC
calc __ + 4
slice @basil_pair @start_slice __ @basil_interesting_part
save ##
Loading a batch is done using the command batchload , followed by the filename to be loaded. The loaded batch will have that name (without the .dnabatch suffix, if appears). If the command is followed by : @<batch_name> , then it will be kept as batch_name .