Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with FindMarkers() methodology #795

Closed
danielcgingerich opened this issue Sep 15, 2021 · 5 comments
Closed

Issue with FindMarkers() methodology #795

danielcgingerich opened this issue Sep 15, 2021 · 5 comments

Comments

@danielcgingerich
Copy link

danielcgingerich commented Sep 15, 2021

The two parts of the find markers function: 1) fold change calculation and 2) p value calculation. Fold change is calculated using raw counts, while the p value is calculated using normalized data. Why is this? I do not believe raw counts are comparable to normalized data in this scenario, and here is why:

Raw counts vs normalized counts result in a change of directionality of many peaks. Example: peak A might have a positive logFC value with raw counts, but it becomes negative when using normalized data. This means that the input data for the p value and the input data for the fold change are contradicting each other.

I calculated fold change on the raw counts and normalized counts and looked at how this changes direction of peak fold changes.

# raw counts
u.load <- mtx[, load.cells]
u.load <- rowMeans(u.load)+1
u.normal <- mtx[, norm.cells]
u.normal <- rowMeans(u.normal)+1
fc <- u.load/u.normal

table(fc > 1)
FALSE  TRUE 
55299 47509

a <- fc > 1

# normalized counts
mtx2 <- atac[['peaks']]@data
u.load <- mtx2[, load.cells]
u.load <- rowMeans(u.load)+1
u.normal <- mtx2[, norm.cells]
u.normal <- rowMeans(u.normal)+1
fc <- u.load/u.normal

table(fc > 1)
FALSE  TRUE 
19302 83506

b <- fc > 1

table(a == b)
FALSE  TRUE 
36399 66409

In this case, out of 102808 peaks, 36399 of them change direction depending on whether raw vs normalized data is used as input. For these peaks, the input for the p value and the input for the fold change calculation are directly contradicting each other.

@timoast
Copy link
Collaborator

timoast commented Sep 15, 2021

Hi @danielcgingerich, thanks for raising this issue. Having now thought about this more, I think you're right. In particular, the situation you've highlighted where there can be a conflict between FC and p-val is a good example of why we need to use consistent values for both. I think the change from normalized to raw counts was motivated by wanting to avoid any assumptions about how the values in the "data" slot were processed. In Seurat, there was an assumption when computing fold-changes that the values had been log-normalized, which caused issues when that assumption was not correct (if running on scATAC-seq data for example). However, the main issue with that was the choice of mean function for computing fold-change, rather than the data slot used. Since we've added a FoldChange() method for ChromatinAssay objects that sets a sensible mean function, I don't see any issues with using the normalized data here and so I'll update the FoldChange() function now to use normalized data.

@danielcgingerich
Copy link
Author

@timoast Should normalized counts also be used for logFC calculation with scRNA data in Seurat?

@timoast
Copy link
Collaborator

timoast commented Sep 16, 2021

By default they are

@danielcgingerich
Copy link
Author

danielcgingerich commented Sep 16, 2021

@timoast Another question: For Seurat, assuming method = 'LogNormalize' it looks like the default mean function uses expm1 to reverse the log transformation. The resulting data is counts per scale.factor. However, p value is still calculated on log transformed data, which could still lead to some problems. Should the mean function leave out expm1? Or, should the expm1 also be applied to the data used as input for the p value calculation?

Edit:
Additional info - we are using MAST for the DE test on our scRNA data

Edit2:
My guess is that nothing needs to be changed for the Seurat method. Since MAST uses a gaussian linear model for the continuous variable, the data needs to be log transformed. but for fold change calculation it should not be log transformed.

@timoast
Copy link
Collaborator

timoast commented Sep 16, 2021

I think this is a question for the Seurat repo, please raise there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants