PCTSEA guide for developers

This is a detailed description of how the PCTSEA analysis is implemented in Java.

This guide is supposed to help future developers to continue improving this software.

Flow chart:

See a flow chart here

Both command line and web versions are coupled to the pctsea-core module where the PCTSEA.java class is defined and where the logic of the analysis is implemented, more in particular, in the run() method which has been filled with comments so that can be followed.

This class is well documented, with a lot of comments along the code, however here there is more information about it that might be useful:

In order to change scoring methods, the developer should focus on method

private int calculateScoresToRankSingleCells(
   List<SingleCell> singleCellList,
   GeneExpressionsRetriever interactorExpressions, 
   ScoringSchema scoringSchema, 
   boolean writeScoresFile,
   boolean outputToLog, 
   boolean getExpressionsUsedForScore, 
   boolean takeZerosForCorrelation,
   double minCorrelation) throws IOException

where depending on the ScoringMethod of the ScoringSchema a different score is calculated per SingleCell that reorder them in a ranking list used in the Kolmogorov-Smirnov test used for the calculation of the enrichment score.

Similarity score calculation per single cell:

Inside this method there is a switch clause that calls the appropriate method depending on the ScoringMethod:

switch (scoringMethod) {
   case PEARSONS_CORRELATION:
      singleCell.calculateCorrelation(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
      break;
   case SIMPLE_SCORE:
      singleCell.calculateSimpleScore(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
      break;
   case DOT_PRODUCT:
      singleCell.calculateDotProductScore(interactorExpressions, takeZerosForCorrelation, getExpressionsUsedForScore); 
      break;
   case REGRESSION:
      singleCell.calculateRegressionCoefficient(interactorExpressions, getExpressionsUsedForScore);
      break;
   default:
      throw new IllegalArgumentException("Method " + scoringMethod.getScoreName() + " still not supported.");
}

As you can note, the implementation of the scores is actually performed inside of each singleCell object.

Enrichment score calculation per cell type:

Once all single cells have a score of similarity against the input protein list, we used the ranked list of single cells in a Kolmogorov-Smirnov test, following indications similar to Gene Set Enrichment Analysis. This is implemented in the method calculateEnrichmentScore and the enrichment scores are stored in the CellTypeClassification objects.

Enrichment score significance calculation per cell type:

Then, following the same principles described in the GSEA analysis article, we calculate the significance of the enrichment scores by randomly permutating the cell types of the single cells and recalculating the enrichment scores until having a distribution to use for calculating a p-value. This is implemented in the method calculateSignificanceByCellTypesPermutations where, after permutating the cell types, calls to the method calculateEnrichmentScore with the parameter flag permutatedData=true. Then, the p-value associated with each real enrichment score x of each cell type will be the proportion of random enrichment scores x' greater or equal to x divided by the total number of random enrichment scores obtained for that cell type.

Enrichment score False Discovery Rate per cell type:

Once we have a p-value per cell type, we want to calculate an FDR associated with each cell type, and we do this by using the real enrichment scores x_t of all cell types t, and all the random enrichment scores x'_t of all cell types t. The FDR for a certain cell type t will be the number of random enrichment scores that are greater or equal than x_t (snull) divided by the number of real enrichment scores that are greater or equal than x_t (sobs). However, a factor of normalization by the number of cells in the cell type t is applied to that number. See line of code:

// nobs is the total number of real scores
// nnull is the total number of random scores
final int nobs = totalRealNormalizedScores.size();
final int nnull = totalRandomNormalizedScores.size();
fdr = (1.0 * snull / sobs) * (1.0 * nobs / nnull);

This is implemented at the end of the method calculateSignificanceByCellTypesPermutations.

How to insert a new dataset in the single cells rnaSeq database:

This can be done in a separate script that reads the information from the new dataset and creates the appropriate objects and saves them into the MongoDB database.

Set up your code to have access to the MongoDB database:
In order to have access to the database, you should use the utility class MongoBaseService.java from the pctsea-core module.
In addition, you will have to use Spring injection using some Spring annotations in order to get an instance of MongoBaseService. See the below code snippet as an example:

@RunWith(SpringRunner.class)
@AutoConfigureDataMongo
@SpringBootTest(// we don't want a web environment to test
		webEnvironment = WebEnvironment.NONE, //
		properties = { "headles=false" //
		              // if necessary: 
                              ,"spring.config.location=classpath:/application-remoteTunnel.properties"//
			      , "spring.jpa.hibernate.ddl-auto=create" 
                }
)
public class NewDatasetCreation {
        @Autowired
	MongoBaseService mongoBaseService;
        // if you want to access the database directly without the use of the methods in MongoBaseService, you can create the access to the repository like this, using @Autowired with all the *MongoRepository classes that are in the 'edu.scripps.yates.pctsea.db'
	@Autowired
	DatasetMongoRepository projectMongoRepo;
	@Autowired
	SingleCellMongoRepository singleCellMongoRepository;
	
	@Test
	public void DatasetCreation() {
            // here will be the code explained below
	}
}

Insert the new dataset object

final Dataset dataset = new Dataset();
dataset.setTag("HCL"); // the dataset Tag will be the unique key used to refer to that dataset
dataset.setName("Construction of a human cell landscape at single-cell level");
dataset.setReference("https://doi.org/10.1038/s41586-020-2157-4");
if (projectMongoRepo.findByName(project.getName()).isEmpty()) {
    projectMongoRepo.save(dataset);
}

Read the single cell expressions from the new dataset and create the singleCell objects:

// where we keep the SingleCell objects
List<SingleCell> singleCellList = new ArrayList<SingleCell>();

List<Expression> readSingleCellExpressions() {
   List<Expression> sces = new ArrayList<Expression>();
   // this would be rather in a loop in which we read from the files from the new dataset:
   String singleCellName = "single_cell_identifier_1234";
   String cellType = "neuron";
   String biomaterial = "brain";
   String datasetTag = dataset.getTag(); // we need to associate the single cell with the dataset

   // create singleCell object (be careful because we don't want duplicated singleCells in the DB, check whether the singleCell has already created by querying by its unique name
   SingleCell singleCelldb = new SingleCell(singleCellName, cellType, biomaterial, datasetTag);
   // store it in a list
   singleCellList.add(singleCelldb);
   // create single cell expression of a Gene
   String gene = "ALDOA";
   double expressionValue = 4.3;

   // Create Expression object
   final Expression sce = new Expression();
   sce.setCell(singleCelldb); // associate Expression with SingleCell
   sce.setGene(gene);
   sce.setExpression(expressionValue);
   sce.setProjectTag(datasetTag);
   
   // add Expression to a list of Expression objects
   sces.add(sce);

   return sces;
}

Save the Expression objects that are coming from previous method:

final List<Expression> sces = readSingleCellExpressions(); 
// size of insert queries
int BATCH_SIZE = 1000;

final List<Expression> batch = new ArrayList<Expression>();
for (final Expression sce : sces) {
    batch.add(sce);
    if (batch.size() == BATCH_SIZE) {
         mongoBaseService.saveExpressions(batch, statusListener);
	 System.out.println(batch.size() + " entities saved in database");
         batch.clear();
    }
}
// save the remaining ones
if (!batch.isEmpty()) {
    mongoBaseService.saveExpressions(batch, statusListener);
}

Save SingleCells that are in the list singleCellList:

final List<SingleCell> batch = new ArrayList<SingleCell>();
// to check whether the single cell was already stored:
Set<String> singleCellsInDB = new HashSet<String>();
for (final SingleCell sc : singleCellList) {
	if (!singleCellsInDB.contains(sc.getName())) {
		batch.add(sc);
	}
	if (batch.size() == BATCH_SIZE) {
                mongoBaseService.saveSingleCells(batch, statusListener);
		singleCellsInDB.addAll(batch.stream().map(sc2 -> sc2.getName()).collect(Collectors.toList()));
		batch.clear();
	}
}
// store the remaining ones:
if (!batch.isEmpty()) {
	mongoBaseService.saveSingleCells(batch, statusListener);
	singleCellsInDB.addAll(batch.stream().map(sc2 -> sc2.getName()).collect(Collectors.toList()));
}

Shiny apps for visualization of results:

PCTSEA contains two shiny apps for the visualization of the results one for the visualization of one result here, and the other for generating a heatmap from the results of multiple results here.

Proteomics Yates Laboratory
Salvador Martínez-Bartolomé (salvador at scripps.edu)
Research Associate
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037
Git-Hub profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly