Issue with FindMarkers() methodology #795

danielcgingerich · 2021-09-15T13:23:17Z

The two parts of the find markers function: 1) fold change calculation and 2) p value calculation. Fold change is calculated using raw counts, while the p value is calculated using normalized data. Why is this? I do not believe raw counts are comparable to normalized data in this scenario, and here is why:

Raw counts vs normalized counts result in a change of directionality of many peaks. Example: peak A might have a positive logFC value with raw counts, but it becomes negative when using normalized data. This means that the input data for the p value and the input data for the fold change are contradicting each other.

I calculated fold change on the raw counts and normalized counts and looked at how this changes direction of peak fold changes.

# raw counts
u.load <- mtx[, load.cells]
u.load <- rowMeans(u.load)+1
u.normal <- mtx[, norm.cells]
u.normal <- rowMeans(u.normal)+1
fc <- u.load/u.normal

table(fc > 1)
FALSE  TRUE 
55299 47509

a <- fc > 1

# normalized counts
mtx2 <- atac[['peaks']]@data
u.load <- mtx2[, load.cells]
u.load <- rowMeans(u.load)+1
u.normal <- mtx2[, norm.cells]
u.normal <- rowMeans(u.normal)+1
fc <- u.load/u.normal

table(fc > 1)
FALSE  TRUE 
19302 83506

b <- fc > 1

table(a == b)
FALSE  TRUE 
36399 66409

In this case, out of 102808 peaks, 36399 of them change direction depending on whether raw vs normalized data is used as input. For these peaks, the input for the p value and the input for the fold change calculation are directly contradicting each other.

The text was updated successfully, but these errors were encountered:

timoast · 2021-09-15T20:48:31Z

Hi @danielcgingerich, thanks for raising this issue. Having now thought about this more, I think you're right. In particular, the situation you've highlighted where there can be a conflict between FC and p-val is a good example of why we need to use consistent values for both. I think the change from normalized to raw counts was motivated by wanting to avoid any assumptions about how the values in the "data" slot were processed. In Seurat, there was an assumption when computing fold-changes that the values had been log-normalized, which caused issues when that assumption was not correct (if running on scATAC-seq data for example). However, the main issue with that was the choice of mean function for computing fold-change, rather than the data slot used. Since we've added a FoldChange() method for ChromatinAssay objects that sets a sensible mean function, I don't see any issues with using the normalized data here and so I'll update the FoldChange() function now to use normalized data.

danielcgingerich · 2021-09-16T13:59:45Z

@timoast Should normalized counts also be used for logFC calculation with scRNA data in Seurat?

timoast · 2021-09-16T14:34:40Z

By default they are

danielcgingerich · 2021-09-16T15:29:28Z

@timoast Another question: For Seurat, assuming method = 'LogNormalize' it looks like the default mean function uses expm1 to reverse the log transformation. The resulting data is counts per scale.factor. However, p value is still calculated on log transformed data, which could still lead to some problems. Should the mean function leave out expm1? Or, should the expm1 also be applied to the data used as input for the p value calculation?

Edit:
Additional info - we are using MAST for the DE test on our scRNA data

Edit2:
My guess is that nothing needs to be changed for the Seurat method. Since MAST uses a gaussian linear model for the continuous variable, the data needs to be log transformed. but for fold change calculation it should not be log transformed.

timoast · 2021-09-16T15:39:57Z

I think this is a question for the Seurat repo, please raise there

timoast added a commit that referenced this issue Sep 15, 2021

Use normalized counts in fold-change calculation; #795

2b18258

timoast closed this as completed Sep 15, 2021

timoast added this to Signac roadmap Nov 8, 2022

timoast removed this from Signac roadmap Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with FindMarkers() methodology #795

Issue with FindMarkers() methodology #795

danielcgingerich commented Sep 15, 2021 •

edited

Loading

timoast commented Sep 15, 2021 •

edited

Loading

danielcgingerich commented Sep 16, 2021

timoast commented Sep 16, 2021

danielcgingerich commented Sep 16, 2021 •

edited

Loading

timoast commented Sep 16, 2021

Issue with FindMarkers() methodology #795

Issue with FindMarkers() methodology #795

Comments

danielcgingerich commented Sep 15, 2021 • edited Loading

timoast commented Sep 15, 2021 • edited Loading

danielcgingerich commented Sep 16, 2021

timoast commented Sep 16, 2021

danielcgingerich commented Sep 16, 2021 • edited Loading

timoast commented Sep 16, 2021

danielcgingerich commented Sep 15, 2021 •

edited

Loading

timoast commented Sep 15, 2021 •

edited

Loading

danielcgingerich commented Sep 16, 2021 •

edited

Loading