-
Notifications
You must be signed in to change notification settings - Fork 9
/
PAW_protein_grouper.py
1540 lines (1326 loc) · 78.1 KB
/
PAW_protein_grouper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""program "PAW_protein_grouper.py"
Produces quantitative protein and peptide results tables.
This is a protein grouping step in the PAW analysis pipeline for Comet
searches of large data sets where spectral count or TMT quantitation is being
performed. The program combines similar proteins into "families" (homology
grouping), redetermines shared and unique peptide status, and re-splits
shared peptide counts.
Written by Phil Wilmarth, 2011-2017, OHSU.
The MIT License (MIT)
Copyright (c) 2017 Phillip A. Wilmarth and OHSU
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
Direct questions to:
Technology & Research Collaborations, Oregon Health & Science University,
Ph: 503-494-8200, FAX: 503-494-4729, Email: techmgmt@ohsu.edu.
"""
# global imports statements
import os
import sys
#
VERSION = 'v1.0.4' # updated 8/30/2017
# updated for Python 3, 8/29/17 -PW
"""More extensive documentation added April, 2016 -PW
Protein and peptide summary files from the PAW pipeline are read in for one or more
biological samples in a proteomics experiment.
Grouping involves testing proteins for similarity. Comparisons are always done between pairs of
proteins. Peptide sequence sets are constructed for each protein from its associated peptides.
Peptides sequences are "masp spec" sequences where "I" and "L" redidues are replaced with a "j".
When comparing two proteins' peptide sets, three sets are determined: the intersection (the
set of peptides the two proteins share), and the left and right difference sets (the peptides
that are unique to each protein, respectively). A quantitative value is computed for each set.
These values are the experiment-wide total spectral count of all the peptides contained within
each set.
Extenstions to the basic parsimony concepts of redundant proteins and subset proteins are tested
first. Instead of exaclty identical peptide sets for redundant proteins and true formal set
subsets for subsets, we define "nearly" identical and "almost" a subset using the "pseudo"
parameter. This is based on the long standing minimum protein identification requirement of
two peptides per protein.
"""
# variable parameters to control grouping
VERBOSE = True # controls how much information is printed to console (global parameter)
PSEUDO = 2.0 # how much independent total count evidence is needed for pseudo-redundants, -subsets
LOW = 2.5 # true unique total count evidence threshold used in combination test
SHARE = 40.0 # absolute total shared count threshold used in combination test
MASS = 20.0 # relative fraction of shared-to-unique counts threshold used in combination test
#
IL = True # make I and L indistinguisable (True)
diffZ = True # treat different charge states as distinct (True)
def get_file(default_location, extension_list, title_string=""):
"""Dialog box to browse to a file. Returns full file name.
Usage: full_file_name = get_file(default_location, extension_list, [title]),
where "default_location" is a starting folder location,
extension_list is a list of (label, pattern) tuples,
e.g. extension_list = [('Text files', '*.txt')],
"title" is an optional message to list in the dialog box, and
"full_file_name" is the complete path name of the selected file.
Written by Phil Wilmarth, OHSU, 2008.
"""
import tkinter
from tkinter import filedialog
# set up GUI elements
root = tkinter.Tk()
root.withdraw()
# set default title string if not passed
if title_string == "":
title_string = 'Please select a single FILE'
# create dialog box and return file selection
root.update()
return filedialog.askopenfilename(parent=root, initialdir=default_location,
filetypes=extension_list, title=title_string)
# end
def time_stamp_logfile(message, file_obj):
"""Prints "message" and time stamp to a log file.
Written by Phil Wilmarth, OHSU, 2009.
"""
import time
print('%s on %s' % (message, time.ctime()), file=file_obj)
return
class ProteinTuplet:
"""Object to hold protein information from the protein_summary
file rows. This is a simple data container object. No methods.
All data from each row are collected so results table can be
very similar to the input table.
Redundant proteins occupy multuiple rows in the standard PAW
summary file so that acceessions and descriptions are easier
to read. Here we need single protein objects that correspond
to their respective sets of peptides. We will not create separate
protein objects for redundant proteins. We will make an object for
the first member and add some information about any additional
redundant members to that object. This data structure will also
hold information about the larger protein families that result
from the protein grouping tests. Since "protein groups" already
have a defined meaning in proteomics, we will use the term
"families". It is really more specific than that since we are
really talking about very similar family members such as triplets.
That is why the object is called a ProteinTuplet instead of a
ProteinGroup.
written by Phil Wilmarth, 2011, OHSU.
"""
def __init__(self):
"""Basic constructor, no parameters. Attributes are
listed below along with descriptions.
"""
self.number = '0' # ProtGroup (group number string)
self.counter = '' # Counter column of ones
self.name = '' # Accession (or derived from)
self.redundants = [] # List of other group/redundant member accessions
self.filter = '' # Filter (flag column)
self.coverge = '' # Coverage (sequence coverage)
self.length = '' # SeqLength (sequence length)
self.mw = '' # MW (calculated MW)
self.description = '' # Description (protein description string)
self.tot_count = '' # CountTot column
self.tot_unique = '' # UniqueTot column
self.uniqfrac = '' # UniqFrac column
self.totals = [] # total count list (strings)
self.uniques = [] # unique count list (strings)
self.correcteds = [] # corrected count list (strings)
self.otherloci = '' # OtherLoci (list of accessions)
self.all_loci = set() # expanded set of other loci
self.new_totcounts = [] # new total count list (list is number of samples long)
self.new_uniqcounts = [] # new unique count list
self.new_corrcounts = [] # new corrected count list
self.done = False # set to True if OtherLoci combining has been completed
self.keep = True # set to False if protein has been combined with another
self.combined = [] # constituent proteins for combinations
return
# end ProteinTuplet class
class ProteinSummary:
"""Container for proteins and protein tuplets (indistinguishable by peptides).
In addition to the traditional definition of redundant proteins, protein
tuplets may include pseudo-redundant protein groupings, pseudo-subset protein
groupings, and highly similar protein groupings.
Several summary mappings (dictionaries) are also created to facilitate
processing.
Methods:
load_results: reads in a protein results file (with redundant protein
rollup).
get_samples: parses the sample names from column headers.
parse_row: parses the fields from each row of the results file.
fix_otherloci: updates OtherLoci to include all proteins having any shared peptides.
print_summary: prints summary table to new results file
written by Phil Wilmarth, OHSU, 2011
"""
def __init__(self):
"""Basic constructor, no parameters. Attributes and definitions are listed.
"""
self.samples = [] # list of sample names (from Total count column headings)
self.tuplets = [] # list of protein results objects (ProteinTuplets)
self.col_map = {} # map of column headings to column position
self.header = '' # save a copy of the header line
self.acc_map = {} # key is accession (str), value is tuplet index (int)
self.acc_to_desc = {} # dictionary of accessions and associated descriptions (for reports)
self.tuplet_map = {} # key is group number (str), value is accession (str)
self.row_count = 0 # count of number of protein rows read in
self.protein_count = 0 # number of protein tuplets
self.distinguishable_count = 0 # number of proteins having some distinguisable peptides
self.indistinguishable_count = 0 # number of protein groups having indistinguishable peptides
return
def load_results(self, file_name):
"""Read protein results file and parse main table row fields
into ProteinTuplet objects. Protein_summary files have some lines
before the main table and some short lines after the main table.
Lines are skipped until the header line is found. The length of the
header line (after splitting) is used to determine what lines are
in the main table body (the last column is optional in the table so
we have to adjust the length). The header line is split into its
column headers and the column header text is used to index the column
position using a "col_map" dictionary (saved as an attribute).
"""
try:
open_file = open(file_name, 'r')
except IOError:
print('...WARNING: could not open protein results file.')
#
columns = 10000 # so that lines before the header get skipped
for line in open_file:
line = '\t'.join([x.strip() if x.strip() else ' ' for x in line.split('\t')]) # remove EOL and clean up headers
col = line.split('\t') # split into column headers
if col[0] == 'ProtGroup' and col[1] == 'Counter': # look for header line
self.header = line # save a copy of the header line
columns = len(col) - 2 # minimum length of actual protein rows
for i, col_header in enumerate(col):
self.col_map[col_header] = i # map column headings to positions
self.get_samples(col) # get the sample names
elif len(col) < columns: # skip any short lines (before or after the protein table)
continue
else:
self.parse_row(col) # parse the protein row
#
open_file.close()
"""We "walk" the OtherLoci proteins to determine the full set of all
proteins that are linked via any shared peptides. Proteins removed as
subsets during previous PAW processing may still be present in OtherLoci
lists but missing from the results table. We remove any references to those.
"""
self.fix_otherloci() # create full sets of OtherLoci and remove any left overs
self.protein_count = len(self.tuplets)
for prot_tuplet in self.tuplets:
if len(prot_tuplet.redundants) > 1:
self.indistinguishable_count += 1
else:
self.distinguishable_count += 1
return
def get_samples(self, col):
"""Parses sample names from header line.
Counts from the samples are in three repeating blocks of columns:
Total counts, unique counts, and corrected counts (fractional
counting of shared peptides, split based on other protein information).
The columns start after the "UniqFrac" column and end before the
"OtherLoci" column. The type of counts are labeled in the row above
the main header row. The column headers in the main header row are
just the sample names.
"""
# get the sample names
start = self.col_map['UniqFrac'] + 1
end = int(start + (self.col_map['OtherLoci'] - start) / 3)
for sample in col[start:end]:
self.samples.append(sample)
return
def parse_row(self, col):
"""Parses protein results rows with redundant protein rollup.
col is the list result after splitting the row line on tab character.
"col_map" is the dictionary to get index into "col" from column header
text string (more readible and independent of column order). As we combine
redundant proteins and/or create groups, we keep one protein description
and make lists of additional protein accessions. We will make a bigger
dictionary to remember all the protein descriptions so we can make reports
at the end. We add a new data object when we get a new group number,
otherwise we "rollup" additional redundant protein information into previous
data object.
"""
# add accession (key) and description (value) to acc_to_desc dictionary
self.acc_to_desc[col[self.col_map['Accession']]] = col[self.col_map['Description']]
# see if group number is new (not in dictionary)
group = col[self.col_map['ProtGroup']].split('.')[0] # NOTE: group is a string not an integer
if group not in self.tuplet_map:
# add dictionary entry and instantiate new ProteinTuplet object (for group)
self.tuplet_map[group] = col[self.col_map['Accession']]
self.acc_map[col[self.col_map['Accession']]] = len(self.tuplets) # add accession to dictionary
new = ProteinTuplet()
# set tuplet object attributes
new.number = group # NOTE: a string not an integer
new.counter = col[self.col_map['Counter']]
new.name = col[self.col_map['Accession']]
new.redundants.append(new.name) # add self to list of accessions
new.filter = col[self.col_map['Filter']]
new.coverage = col[self.col_map['Coverage']]
new.length = col[self.col_map['SeqLength']]
new.mw = col[self.col_map['MW']]
new.description = col[self.col_map['Description']]
new.tot_count = col[self.col_map['CountsTot']]
new.tot_unique = col[self.col_map['UniqueTot']]
new.uniqfrac = col[self.col_map['UniqFrac']]
# save spectral count numbers (no need to fix any non-grouped proteins)
start = self.col_map['UniqFrac'] + 1
samples = len(self.samples)
new.totals = col[start: start+samples] # saved as strings
new.uniques = col[start+samples: start+2*samples] # saved as strings
new.correcteds = col[start+2*samples: start+3*samples] # saved as strings
try:
new.otherloci = col[self.col_map['OtherLoci']] # rows w/o OtherLoci shorter so catch index error
if new.otherloci == ' ':
new.otherloci = ''
except IndexError:
pass
# add to list of tuplets (groups)
self.tuplets.append(new)
else:
# redundant row so update name and add to the redundant members list of last tuplet
tuplet = self.tuplets[-1]
i = len(tuplet.redundants) # number of additional redundant proteins so far
if i > 1:
last_suffix = '(+%s)' % (i-1,)
next_suffix = '(+%s)' % (i,)
tuplet.name = tuplet.name.replace(last_suffix, next_suffix) # save name with updated number
else:
tuplet.name = tuplet.name + ' (+1)'
tuplet.redundants.append(col[self.col_map['Accession']]) # add to accession list
self.row_count += 1
return
def fix_otherloci(self):
"""Fixes accessions in OtherLoci that are not in results file.
Also builds more complete OtherLoci sets.
If A shares some peptides with B and B shares some peptides with both A and C,
then OtherLoci for A will be A, B. OtherLoci for B will be A, B, C. For group
testing we need to consider A, B, and C. So we want the OtherLoci for A to be
A, B, C. We visit all other proteins pointed to by OtherLoci and add loci. We
keep following OtherLoci until the set stops growing.
"""
# get rid of loci that did not make the protein report (because of Parsimony filter)
for i, prot in enumerate(self.tuplets):
if prot.otherloci == '': # skip unique proteins
continue
acc_list = [x.strip() for x in prot.otherloci.split(',')] # splits and removes whitespace
new_acc_list = []
for acc in acc_list:
acc = acc.replace('"','') # could have quotes enclosing cells (CSV-ish from Excel)
if acc in self.acc_map: # skip accessions that do not correspond to rows in results file
new_acc_list.append(acc)
if len(new_acc_list) > 1: # update OtherLoci strings with only relevant accessions
prot.otherloci = ', '.join(new_acc_list)
else:
prot.otherloci = ''
# build more comprehensive OtherLoci-like lists
for i, prot in enumerate(self.tuplets):
if prot.otherloci:
new = set([x.strip() for x in prot.otherloci.split(',') if x != '']) # starting set of other loci
previous = len(new) # starting total number of other loci
current = 0
while current != previous: # stop when set stops growing
old = set(new) # reset old for next iteration (set call makes copy)
previous = len(old)
for loci in old: # make union of all OtherLoci sets for loci in old
otherloci = set([x.strip() for x in
self.tuplets[self.acc_map[loci]].otherloci.split(',') if x != ''])
new.update(otherloci)
current = len(new) # length of (possibly) more complete set
prot.all_loci = set(new) # save the more complete set of other loci
return
def print_summary(self, fileobj=None, log_list=None):
"""Prints a new protein summary file.
We create a similar format to the original protein_summary file.
OtherLoci and grouping information gets moved to the beginning of the
rows instead of at the ends. Lists of accessions get concatenated into
strings.
"""
import time
# header stuff first
print('Program "PAW_protein_grouper.py" run on', time.asctime(), file=fileobj)
print(file=fileobj)
print(file=fileobj)
header = self.header.split('\t')
header.insert(self.col_map['Filter'], header[-1]) # add OtherLoci to before "Filter" col
header = header[:-1] # remove original OtherLoci header at end
header.insert(self.col_map['Filter'], 'Similar') # column for family groups
header.insert(self.col_map['Filter'], 'Identical') # column for fully-redundant proteins
pre_header = ['' for x in header] # row above main header
rows = len([x for x in self.tuplets if x.keep == True])
prefix = 5
pre_header[1] = '=SUBTOTAL(109,B%s:B%s)' % (prefix+1, rows+prefix) # make counter subtotal function
print('\t'.join(pre_header), file=fileobj)
print('\t'.join(header), file=fileobj)
# protein rows next (skip any rows not marked as keepers; these were added to family groups)
for p in self.tuplets:
if p.keep:
# most quantities from original protein_summary were saved as strings
row = [p.number, p.counter, p.name]
if len(p.redundants) > 1: # Identical (redundants) column
row.append('&'.join(p.redundants))
else:
row.append(' ')
if len(p.combined) > 0: # Similar (grouped families) column
row.append('&'.join(p.combined))
else:
row.append(' ')
if len(p.otherloci) > 0: # OtherLoci column (original not expanded?)
row.append(p.otherloci)
else:
row.append(' ')
row += [p.filter]
try:
row.append('%0.1f' % (float(p.coverage),)) # do some formatting if possible
except ValueError:
row.append(p.coverage)
row += [p.length, p.mw, p.description, p.tot_count, p.tot_unique]
row.append('%0.3f' % (float(p.uniqfrac),)) # do some formatting
row = row + p.totals + p.uniques + p.correcteds
print('\t'.join(row), file=fileobj)
# end ProteinSummary class
class PeptideSummary:
"""Container for Peptide Summary file contents.
Methods:
__init__: basic constructor, defined attributes
load_results: opens and parses peptide summary file
parse_row: parses peptide rows into IdentifiedPeptides objects
print_summary: prints a new peptide summary table
Each protein has one or more rows of peptide information in the
peptide_summary file. We will collect all rows for a given protein
accession as an IdentifiedPeptides object. We will store those
objects in a list, one item for each protein group/tuplet accession.
We will store the tuple indices in a dictionary keyed by accessions.
We will save a copy of the header line and make a dictionary to
access columns by column header text strings.
written by Phil Wilmarth, OHSU, 2011.
"""
def __init__(self, IL=True, diffZ=True):
"""Simple constructor, no parameters. Attributes and definitions:
"""
self.tuplets = [] # list of IdentifiedPeptides objects, one per protein tuplet
self.col_map = {} # maps column headings to column positions
self.acc_map = {} # key is accession (str), value is tuplets index (int)
self.line_count = 0 # counts number of lines read
self.header = '' # saves a copy of the header line
self.samples = [] # list of sample names
self.IL = IL # True makes I and L indistinguishable
self.diffZ = diffZ # True keeps different charge states as distinct sequences
return
def load_results(self, file_name):
"""Reads a peptide summary file to list and indexes accessions.
We allow for optional "short" lines before and after the main table,
whose width is determined by the length of the split header row.
"""
try:
open_file = open(file_name, 'r')
except IOError:
print('...WARNING: could not open peptide results file.')
#
columns = 100000 # so that lines before the column headers get skipped
for line in open_file:
line = '\t'.join([x.strip() if x.strip() else ' ' for x in line.split('\t')]) # remove end-of-line characters
col = line.split('\t') # split into pieces
if col[0] == 'ProtGroup' and col[-1] == 'OtherLoci': # header line
self.header = line # so save
columns = len(col) - 1 # the length of data rows in the table (OtherLoci may be empty)
for i, col_header in enumerate(col): # map column headings to column positions
self.col_map[col_header] = i
beg = self.col_map['TotCount']+1
end = self.col_map['OtherLoci']
self.samples = col[beg:end] # get the sample names (between the above positions)
elif len(col) < columns: # skip any lines without data
continue
else:
self.parse_row(col, line)
return
def parse_row(self, col, line):
"""Saves peptide results rows into an IdentifiedPeptides Object (one object per protein).
"""
if col[self.col_map['Accession']] not in self.acc_map: # if new protein, add new object to tuplets
self.tuplets.append(IdentifiedPeptides(self.IL, self.diffZ))
self.tuplets[-1].group = col[self.col_map['ProtGroup']]
self.tuplets[-1].acc = col[self.col_map['Accession']]
self.acc_map[col[self.col_map['Accession']]] = len(self.tuplets)-1 # map accession to tuplet index
self.tuplets[-1].rows.append(line) # always append line to the last tuple's row list
self.line_count += 1
return
def print_summary(self, fileobj=None, log_list=None):
"""Prints a new peptide summary file. Any proteins that have been combined are
flagged as non-keepers and skipped. Combined proteins add new tuplets with combined
peptide rows.
Write the header information, then the peptide rows, then the parameter settings.
"""
import time
print('Program "PAW_protein_grouper.py" run on', time.asctime(), file=fileobj)
print('\n\n', file=fileobj)
print(self.header, file=fileobj)
for prot in self.tuplets:
if prot.keep:
for row in prot.rows:
print(row, file=fileobj)
return
# end PeptideSummary class
class IdentifiedPeptides:
"""Raw container for each protein's identified peptides.
No significant parsing is done at initialization.
Methods:
get_peptide_sets_counts: makes peptide sequence sets and
dictionaries of counts associated with each peptide.
written by Phil Wilmarth, OHSU, 2011.
"""
def __init__(self, IL=True, diffZ=True):
"""Basic constructor, no parameters.
"""
self.group = '' # protein group number
self.acc = '' # protein accession string
self.keep = True # set to False if protein has been combined with another
self.rows = [] # rows of identified peptide information from the peptide summary file
self.IL = IL # flag to set I and L indistinguishable
self.diffZ = diffZ # flag to keep charge states distinguishable
def get_peptide_sets_counts(self, col_map, samples):
"""Returns a protein's master peptide set for the entire experiment.
IL flag replaces I/L with j, diffZ flag makes different peptide
charge states distinct.
Also returns a list of dictionaries of peptide:count for all
individual samples and the experiment-wide grand total.
"""
import re
pep_set = set()
pep_count = [{} for s in samples]
pep_count.append({}) # grand totals
sample_list = samples + ['TotCount']
for row in self.rows:
col = [x.strip() for x in row.split('\t')]
# get the sequence
seq = col[col_map['Sequence']].split('.')[1] # get peptide sequence without bounding residues
if self.diffZ:
seq = seq + '.' + col[col_map['Z']] # add charge state to end of peptide sequence
if self.IL:
seq = re.sub(r'[IL]', 'j', seq) # mask I and L with j (make peptide sequence mass spec-like)
pep_set.add(seq) # add to peptide set (only need
# get the counts
for i, sample in enumerate(sample_list):
if col[col_map[sample]] == '':
count = 0
else:
try:
count = int(col[col_map[sample]])
except ValueError:
count = float(col[col_map[sample]])
pep_count[i][seq] = count # save count in dictionary
return pep_set, pep_count
def _snoop(self):
print('============ identified peptide contents ===============')
print('group number:', self.group)
print('accession:', self.acc)
print('keep flag:', self.keep)
print('number of rows:', len(self.rows))
# end IdentifiedPeptides class
class OtherLociCombiner:
"""Tests if any of the proteins having shared peptides are
similar enough that they should be combined.
Performs three tests: two are extenstions of redundant protein
grouping and peptide subset removal, respectively. The third
looks for cases where the collective shared peptide counts between
two protein are large compared to any unique peptide counts. The
test values are total spectral counts of peptides not smaller peptide
sequence counts. Noise peptide matches will tend to produce small
counts, real peptide matches can be larger, particularly for
abundant proteins. Total counts should separate signal from noise
better than non-redudant peptide sequence counts.
Some other key points:
There is extensive use of sets and count dictionaries to keep track
of everything and calculate the quantities needed for the testing.
Only sets of proteins that share peptides with each other are tested.
There can be many such sets in typical experiments. The sets will
range in size but will typically be relatively small. Sizes will depend
on the proteins in the sample and the nature of the protein database.
Comparisons are all pair-wise and shared and unique peptides are often
defined with respect to the pair being compared.
Comparisons are ordered from proteins with the greatest unique counts to
lowest. This helps prevent over grouping due to lack of information.
Comparison tests are iterative. Test logic is driven by dynamic
definitions of shared and unique peptides, which change during
grouping. After any pair of proteins are grouped, everything is
recomputed and testing starts over. Testing continues until no
more proteins are grouped.
Protein grouping usually changes definitions of shared and
unique peptide status (we define things with respect to the set
of proteins identified in the sample rather than with respect
to the entire protein database). That changes how shared protein
spectral counts are split between proteins. Several things need to
be updated in the protein and peptide reports. There is quite a
bit of code to deal with these aspects.
Methods:
__init__: basic consructor
load_data: builds peptide sets and count dictionaries (for each protein, for each sample)
peptide_to_accession: makes acc lists for each peptide (all proteins that they map to)
make_unique_share_count_maps: makes peptide sequence to count dictionaries for subsets of peptides (unique or shared)
total_count_set: computes grand total spectral count of entire passed in set
combine_proteins: performs the three tests and fixes up the results
test_redundants: tests for psuedo redundant situations
test_subsets: test for pseudo subset situations
test_shares: tests if proteins are similar enough to group together
sort_accessions: orders proteins for testing from highest unique counts to lowest
protein_rollup: combines constituent proteins into new family groups
split_counts: recomputes corrected spectral counts
make_peptide_rows: makes new peptide_summary output for combined proteins
get_other_loci: recomputes the OtherLoci lists to reflect family grouping
trap_combined_accessions: something to do with the one-and-done nature of the subsets test
sort_combine_list: splits apart compound accessions and sorts all accssions by counts
written by Phil Wilmarth, OHSU, 2011.
"""
def __init__(self, pseudo=2.0, low=2.5, share=40.0, mass=20.0, IL=True, diffZ=True):
"""Basic constructor: attributes and definitions listed here:
"""
self.pseudo = pseudo # min counts for pseudo-redundants and pseudo-subsets (experiment wide)
self.low = low # min ave unique threshold (per sample or per experiment?)
self.share = share # ave total count of shared peptides threshold (per sample or per experiment?)
self.mass = mass # ratio of shared to unique total counts threshold
self.IL = IL # True makes I and L indistinguishable
self.diffZ = diffZ # True makes different charge states distinct
self.number = 0 # passed-in numerical label for console printing (protein group number)
self.proteins = None # save pointer to proteins object list (from load_data method)
self.peptides = None # save pointer to peptides object list (from load_data method)
self.samples = [] # list of biological samples in the experiment
self.pep_to_acc = {} # map of peptide sequences to accession lists (unique peptides have lists with one acc)
self.acc_list = [] # list of accessions to be compared (proteins linked by any shared peptides)
self.sets = [] # list of peptide sets, one for each acc to be compared
# these are experiment-wide grand totals
self.tot_count = [] # list of peptide:total count dictionaries for each accession
self.tot_unique = [] # list of unique peptide:total count dictionaries for each accession
self.tot_share = [] # list of shared peptide:total count dictionaries for each accession
# these are counts per sample
self.tot_samples = [] # nested list peptide:sample count dicts for each acc, sample
self.uniq_samples = [] # nested list unique peptide:sample count dicts for each acc, sample
self.share_samples = [] # nested list shared peptide:sample count dicts for each acc, sample
self.history = '' # compound accession string capturing combining history
self.not_to_keep = [] # list of protein accessions that have been combined
self.log = [] # list of verbose log file lines
return
def load_data(self, acc_list, proteins, peptides):
"""Loads data structures for protein combining.
"""
# make sure there are proteins to compare
if len(acc_list) <= 1:
return False
# set some attributes
self.acc_list = acc_list
self.proteins = proteins
self.peptides = peptides
self.samples = peptides.samples
# get peptide set and count dictionary for each protein accession in acc_list
for acc in acc_list:
identified_peptides = peptides.tuplets[peptides.acc_map[acc]] # get IDed peptide info for "acc"
# NOTE: pep_set is created with defaults for IL (True) and diffZ (True)
"""Should test if diffZ does anything. It may well not make any difference..."""
pep_set, counts = identified_peptides.get_peptide_sets_counts(peptides.col_map, self.samples)
self.sets.append(pep_set) # save the peptide set for each accession
self.tot_count.append(counts[-1]) # last dictionary in list is the total counts
self.tot_samples.append(counts[:-1]) # other dictionaries are the per sample counts
# set flags to indicate that this protein has been tested
proteins.tuplets[proteins.acc_map[acc]].done = True
# build a master dictionary of all peptides and lists of accessions they map to
self.peptide_to_accession()
# make unique (to protein collection) and shared peptide count dictionaries
self.make_unique_share_count_maps()
return
def peptide_to_accession(self):
"""Builds a master list of peptide:acc_list for all proteins being compared.
Peptides with lists of length one are unique with respect to the set of proteins
being compared. Any peptides with longer lists are shared.
"""
for i, pep_set in enumerate(self.sets):
for pep in pep_set:
if pep in self.pep_to_acc:
self.pep_to_acc[pep].append(self.acc_list[i])
else:
self.pep_to_acc[pep] = []
self.pep_to_acc[pep].append(self.acc_list[i])
return
def make_unique_share_count_maps(self):
"""Makes unique peptides to counts and shared peptides to counts dictionaries.
Need grand total count (experiment-wide) and per sample count dictionaries.
"""
for i, acc in enumerate(self.acc_list):
# create the dictionaries and lists to hold the results, pack into respective lists
self.tot_unique.append({})
self.tot_share.append({})
self.uniq_samples.append([])
self.share_samples.append([])
for seq in self.sets[i]:
# add the peptide keys and counts to the correct dictionaries
if len(self.pep_to_acc[seq]) == 1: # unique peptides
self.tot_unique[i][seq] = self.tot_count[i][seq] # add total SpC for "unique" peptide
else: # shared peptides
self.tot_share[i][seq] = self.tot_count[i][seq] # add total SpC for "shared" peptide
# do nested per sample stuff next
for j in range(len(self.samples)):
# add the dictionaries to nested lists
self.uniq_samples[i].append({})
self.share_samples[i].append({})
# add the peptide keys and counts to the correct dictionaries
if len(self.pep_to_acc[seq]) == 1: # unique peptides (may not be in every sample)
self.uniq_samples[i][j][seq] = self.tot_samples[i][j].get(seq, 0)
else: # shared peptides (may not be present in every sample)
self.share_samples[i][j][seq] = self.tot_samples[i][j].get(seq, 0)
return
def total_count_set(self, pep_set, count_dict):
"""Returns total SpC for all peptides in pep_set using passed in count dictionary.
"""
total = 0
for seq in pep_set:
total += count_dict.get(seq, 0)
return total
def combine_proteins(self, number):
self.number = number
"""Tests for pseudo-redundants, pseudo-subsets, and finally siblings.
Each step iterates until the set of proteins is stable.
There is also considerable book keeping to create new collections
of information about new family groups for the subsequent reporting.
The first two tests are similar in concept. In large datasets, there can
be incorrect matches to correct proteins. It is possible that two fully
redundant proteins could have small numbers of matches that change
their identical peptide sets into "nearly" identical sets. We test for
pairs of proteins where there are mostly shared peptides with small
numbers of unique peptides. A similar situation can prevent a peptide
subset from being removed where "nearly" all peptides are a subset of
larger peptide set. The two situations require different test logic
but are conceptually more or less the same.
The final testing is more agressive to find and combine proteins
with high homologies. Proteins in this category are many
house keeping genes and well-established familes (actins,
tubulins, heat shock, HLA, 14-3-3, keratins, etc.). These cases
become problematic with more complete protein databases but
are also present even in the most carefully curated databases.
"""
self.log.append('\n%s. %s' % (number, ', '.join(self.acc_list)))
redo_split = False
"""The three methods are functionally similar to call. The protein
group number is passed in (for console printing). The other data
is kept in object attributes (accession lists and peptide sets).
What is returned are lists of accesssions to be combined by the
"protein_rollup" method. The attributes (eg. accession lists) are
updated during protein rollups. The first and third tests start
testing a set of accessions or compound accessions (intermediate
protein families), stop when a grouping has been determined, do the
grouping, which shortens the accession list attribute. Then the testing
is started over. Testing stops when all remaining accessions make it
through the tests and there is nothing to group together.
"""
# test for pseudo-redundants first
to_combine = self.test_redundants(number) # need to get the loop started
while len(to_combine) != 0:
for acc_list in to_combine: # we only get here if there are acc in to_combine
self.protein_rollup(acc_list, 'PsRedun')
redo_split = True
to_combine = self.test_redundants(number)
"""Subsets is probably the one to modifiy for Scaffold testing.
"""
# test for pseudo-subsets next
to_combine = self.test_subsets(number) # this is a one and done test
for acc_list in to_combine:
self.trap_combined_accessions(acc_list) # need this check...
self.protein_rollup(acc_list, 'PsSubset')
redo_split = True
# test for similar pairs with significant share overlap (do last)
to_combine = self.test_shares(number) # need to get the loop started
while len(to_combine) != 0:
for acc_list in to_combine: # we only get here if there are acc in to_combine
self.protein_rollup(acc_list, 'Share')
redo_split = True
to_combine = self.test_shares(number)
self.log.append('.....acc: %s' % (self.acc_list,))
self.log.append('.....history: %s' % (self.history,))
# any grouping changes shared and unique definitions and shared peptide splitting
if redo_split: # this is only True if something got combined or grouped
get_counts = self.total_count_set # make short nickname for the function call
for i, acc in enumerate(self.acc_list):
if '&' in acc:
# need to make new protein entries for the combined groups
new = ProteinTuplet() # create a new data container
primary = acc.split('&')[0] # use the first accession as a key
new.number = str(len(self.proteins.tuplets)+1) # it will go at the end of the tuplets list
new.counter = '1' # set counter column value
new.name = primary + '_family' # make a proper family accession
# new redundant protein list is concatnation of respective lists (if any)
redundant_list = []
for subacc in acc.split('&'):
redundant_list += self.proteins.tuplets[self.proteins.acc_map[subacc]].redundants
# several fields get values from the primary protein's data
new.filter = self.proteins.tuplets[self.proteins.acc_map[primary]].filter # filter
# these values in parentheses to see that they are from primary, extra single quote for Excel
new.coverage = float(self.proteins.tuplets[self.proteins.acc_map[primary]].coverage) # coverage
new.length = self.proteins.tuplets[self.proteins.acc_map[primary]].length # length
new.mw = self.proteins.tuplets[self.proteins.acc_map[primary]].mw # molecular weight
new.description = self.proteins.tuplets[self.proteins.acc_map[primary]].description # description
# these are recomputed because the grouping changed definitions of shared and unique
new.tot_count = str(get_counts(self.sets[i], self.tot_count[i])) # Total counts
new.tot_unique = str(get_counts(self.sets[i], self.tot_unique[i])) # total unique counts
new.uniqfrac = '%0.3f' % (float(new.tot_unique)/float(new.tot_count)) # unique fraction
# recompute the counts per sample and add to their respective lists
for j in range(len(self.samples)):
new.totals.append(str(get_counts(self.sets[i], self.tot_samples[i][j])))
new.uniques.append(str(get_counts(self.sets[i], self.uniq_samples[i][j])))
new.correcteds.append(self.split_counts(i, j))
new.combined = acc.split('&') # make combined accession list
new.otherloci = ', '.join(self.get_other_loci(i)) # make new OtherLoci string
self.proteins.tuplets.append(new) # add data object to end of the protein tuplets list
# peptides, too
newpep = IdentifiedPeptides() # new data object
newpep.group = new.number # new group number
newpep.acc = new.name # new family accession
self.make_peptide_rows(acc, newpep) # recompute the rows (need to update shared/unique)
self.peptides.tuplets.append(newpep) # add to end of peptide tuplet list
else:
# and recompute corrected counts (per sample) for non-combined proteins
new_corr_counts = []
tuplet_index = self.proteins.acc_map[acc] # this will be among the original protein tuplets
for j in range(len(self.samples)):
new_corr_counts.append(self.split_counts(i, j)) # shared splitting done with internal data
self.proteins.tuplets[tuplet_index].correcteds = new_corr_counts # replace old corrected counts with new ones
return
def test_redundants(self, number):
"""Tests for pseudo-redundant proteins. Breaks after a redundant pair is found.
This will cause a combination of the pair. The pair's accessions and peptides
will be merged. This will change shared and unique peptide definitions of peptides
among the combined pair and the rest of the proteins in the test set. After recomputing
everthing, we start the testing over again.
"""
to_combine = []
self.sort_accessions('descending') # order from most unique to fewest unique
# upper diagonal loop over decreasing count-sorted proteins
for i in range(len(self.acc_list)):
skip = False # use a flag so we can break out of loop after grouping
for j in range(i+1, len(self.acc_list)):
# compare pair-unique to pair-shared total counts ("left" protein is "i" and "right" protein is "j")
share_set = self.sets[i].intersection(self.sets[j]) # set of pair-wise shared peptides
share_count = self.total_count_set(share_set, self.tot_count[j]) # total count of shared peptides (can use either i or j)
unique_i_set = self.sets[i].difference(self.sets[j]) # set of "left" unique peptides
unique_i_count = self.total_count_set(unique_i_set, self.tot_count[i]) # total count of "left" uniques
unique_j_set = self.sets[j].difference(self.sets[i]) # set of "right" unique peptides
unique_j_count = self.total_count_set(unique_j_set, self.tot_count[j]) # total count of "right" peptides
"""left and right unique count totals have to both be less than or equal to
the PSEUDO parameter AND the total shared count has to be ten times bigger
than the largest unique total.
"""
if (unique_i_count <= self.pseudo and
share_count >= 10*max(unique_i_count, unique_j_count) and
unique_j_count <= self.pseudo):
#
to_combine.append((self.acc_list[i], self.acc_list[j])) # add the pair to the to_combine list
self.log.append('...%s pseudo-redundant with %s [%s (%s) %s]' %
(self.acc_list[i], self.acc_list[j], unique_i_count, share_count, unique_j_count))
skip = True
break # breaks out of the inner loop
if skip:
break # breaks out of the outer loop (we need to get out of both after grouping)
return to_combine
def test_subsets(self, number):
"""Tests for psuedo-subset proteins. Subsets cascade when proteins are sorted
by counts so we can do them all in one pass and it is more efficient than
breaking after each pair to combine.
"""
to_combine = []
supersets = {} # this accumulates all test results (values are accession lists)
self.sort_accessions() # sort from smallest set to largest
# upper diagonal loop over increasing count-sorted proteins
for i in range(len(self.acc_list)):
subsets = {} # this accumulated one "row" (outer loop) of test results
for j in range(i+1, len(self.acc_list)):
# compare pair-unique to pair-shared total counts ("left" protein is "i" and "right" protein is "j")
share_set = self.sets[i].intersection(self.sets[j]) # set of shared peptides between pair
share_count = self.total_count_set(share_set, self.tot_count[j]) # total shared counts (could use either i or j)
unique_i_set = self.sets[i].difference(self.sets[j]) # "left" unique peptide set
unique_i_count = self.total_count_set(unique_i_set, self.tot_count[i]) # total counts of "left" uniques
unique_j_set = self.sets[j].difference(self.sets[i]) # "right" unique peptide set
unique_j_count = self.total_count_set(unique_j_set, self.tot_count[j]) # total counts of "right" uniques
"""Pseudo subset if unique total less than or equalt to PSEUDO and shared total
is ten times larger than unique (or 10 if unique is zero). The first test
eliminates (maybe) anything that would have been handled with the redundants test
above.
"""
## # should we test "j" size or not? Very small unique i or j may fail redundant test due to 10*max(i,j)
## # but have one pass the subset test here and get removed. I think no test of "j" more true to subset definition
## if (unique_j_count > self.pseudo and
## share_count >= 10*max(unique_i_count, 1) and
## unique_i_count <= self.pseudo):
if (share_count >= 10*max(unique_i_count, 1)) and (unique_i_count <= self.pseudo): # don't care about "right" uniques
subsets[self.acc_list[i]] = self.acc_list[j] # add to the test results dictionary
# the dict value will be the largest set in the "row" that contains "i"
self.log.append('...%s pseudo-subset of %s [%s (%s) %s]' %
(self.acc_list[i], self.acc_list[j], unique_i_count, share_count, unique_j_count))