-
Notifications
You must be signed in to change notification settings - Fork 100
Additional analyses and advanced settings
Many eQTL datasets are created by measuring gene expression tissues that consist of many different cell types. As about 40% of the trait-associated SNPs show a cis-eQTL effect in these compound tissues, it is often unclear what the causal cell-type is for the disease. The method described here is able to use an a priori defined list of cell-type specific genes to determine whether a cis-eQTL is specific for a cell type.
This method is part of a manuscript titled: Cell-type specific eQTL analysis without the need to sort cells
Provided a list of genes that show a high correlation with the cell-count for the cell-type in question, we can determine a proxy phenotype in another gene expression dataset that does not have cell counts. Our method first creates a correlation matrix for the list of cell-type specific genes using the raw gene expression data (quantile normalized, log2 transformed, corrected for MDS components). Using principal component analysis on the correlation matrix, we obtain principal components (PC) that describe the variation among these genes. The first PC (PC1) attributes the largest amount of variation, and can thus be seen as a proxy-phenotype for the cell-type. PC1 can then be used as an independent variable in the linear model for the cis-eQTL. This means that apart from the genotype effect, we can now also determine the effect of PC1 on gene expression. Apart from these two effects, we also determine the interaction term between genotype and PC1. This interaction term describes the dependence of the cis-eQTL on PC1, and thus effectively the interaction between the cis-eQTL and the cell type (see the figure on the next page). We choose to only use PC1 here, because PC1 describes the majority of the variation amongst the cell-type specific genes and also to limit the number of independent variables in our model. A typical cis-eQTL linear model is the following:
y ~ β*g + e
where y is the gene expression of the gene, and β is the slope of the linear model, g is the genotype and e is the intersect with the y-axis. Including PC1 as an independent variable, the model is as follows:
y ~ β0*g + β1*p + β2*p*g + e
where p is the PC1 covariate and p*g is the interaction term between PC1 and the genotype. Note that this model fits three linear models at once (three different slopes). Apart from the cell count proxy, we also determine the interaction term for all genes that show a cis-eQTL effect.
The method is a two-step process: One step performs initial normalization of the gene expression data and adds an extra quality control step (by correlating the first PC over the sample correlation matrix, samples with poor RNA quality can be determined. We use a correlation threshold of 0.9 with PC1 to remove poor quality samples). Then, the program calculates the first PC using the cell-type specific probe correlation matrix to use as a proxy phenotype in the second step of the program. The second step performs the actual eQTL mapping using the Ordinary Least Squares model with two independent variables and an interaction term.
Locate and/or download the following files (to avoid confusion, use full paths when supplying these files to the software):
- ExpressionData.txt.gz: the raw gene expression data (not normalized, not corrected for MDS components, not corrected for any cis-eQTL effects). We will call this file
traitfile
. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
. - CellTypeSpecificProbeList.txt: the list of probes that correlate with the cell type of interest. One probe per line. Make sure the probe identifiers (or a subset thereof) are consistent with those in your
traitfile
We will call this fileprobelist
. - MDSComponents.txt: this (optional) file contains the four MDS components that were previously used for trait data normalization (which were calculated using pruned genotypes). We will call this file
mdscomponents
. This format of this file is described here Phenotype file, covariate file. - GenotypeToExpressionCoupling.txt: the file linking genotypes to gene expression sample identifiers. This file is also optional (for example if your sample identifiers in your gene expression data correspond to the genotype sample identifiers). We will call this file
genotypetotraitcoupling
. The format is described here: Genotype - phenotype coupling. - CellCounts.txt: in order to determine whether PC1 actually reflects cell-type differences, we may ask you to supply this optional file. The program will determine and output the correlation between the proxy phenotype and the actual phenotype if this file is supplied. The format is identical to the
mdscomponents
file. We will call this filecellcounts
####Normalization - Step 2 - Run the normalization and QC
java –jar eqtl-mapping-pipeline.jar --mode celltypespecific --step normalize --inexpraw traitfile --out outputdir --celltypespecificprobes probelist --mdscomponents mdscomponents --gte genotypetotraitcoupling --cellcounts cellcounts
Please note that the --gte, --mdscomponents and --cellcount switches are optional
After completion, the normalization step will have generated a number of files in your outputdir
:
File | Description |
---|---|
CellTypeProxyFile.txt | This file contains the first principal component, calculated over the cell type specific probe correlation matrix. This file is required for step 2. |
CellTypeSpecificProbeCorrelationMatrix.txt.gz | This file contains the correlation matrix for the cell type specific probes |
CellTypeSpecificProbePCA.PCAoverSamplesEigenvalues.txt.gz | This file contains the eigenvalues for the PCA analysis over the cell type specific probes |
CellTypeSpecificProbePCA.PCAoverSamplesEigenvectors.txt.gz | This file contains the eigenvectors for the PCA analysis over the cell type specific probes |
CellTypeSpecificProbePCA.PCAoverSamplesEigenvectorsTransposed.txt.gz | This is the transposed version of the eigenvector file above. |
CellTypeSpecificProbePCA.PCAoverSamplesPrincipalComponents.txt.gz | This file contains the actual principal components for the cell type specific probe PCA |
SampleCorrelationMatrix.txt | The correlation matrix over the samples. |
SamplePC1Correlations.txt | A PCA analysis has been performed using the sample correlation matrix. This file contains the correlations of all samples with the first principal component, as a quality control measure. Samples with a correlation < 0.9 with this first PC have been excluded to calculate the PCA that was used for CellTypeProxyFile.txt |
ComparisonToCellCount.txt | If you have used the --cellcounts option, this file contains the correlation of PC1 with your actual cell count. |
plot.pdf | If you have used the --cellcounts option, this file plots the cell-counts on the y-axis and the PC1 values on the x-axis. |
Additionally, the normalization step will have created a subdirectory in your outputdir
containing the following files:
File | Description |
---|---|
CellTypeSpecificProbeExpression.txt.gz | This file contains the (Quantile normalized, log2 transformed) gene expression data for the cell type specific probes. |
ExpressionData-QNormLog2Transformed.CovariatesRemoved.txt.gz OR | |
ExpressionData-QNormLog2Transformed.txt.gz | This file contains the gene expression data you used as inexp, although now it is quantile normalized and log2 transformed. Please note that if you did not correct for MDS components, this file will not end with ‘CovariatesRemoved’. |
ExpressionDataPCQC-QNormLog2Transformed.CovariatesRemoved.txt.gz OR | |
ExpressionDataPCQC-QNormLog2Transformed.txt.gz | As described before, PC1 of the sample correlation matrix is used as a quality control measure. Samples with a correlation of < 0.9 with PC1 will be excluded from further analysis. This file contains only those samples that pass this threshold. This file is required for step 2. |
PCAResults.PCAoverSamplesEigenvalues.txt.gz | PCA eigenvalues for the quality control PCA over the sample correlation matrix |
PCAResults.PCAoverSamplesEigenvectors.txt.gz | PCA eigenvectors for the quality control PCA over the sample correlation matrix |
PCAResults.PCAoverSamplesEigenvectorsTransposed.txt.gz | Transposed PCA eigenvectors for the quality control PCA over the sample correlation matrix |
PCAResults.PCAoverSamplesPrincipalComponents.txt.gz | Principal components for the quality control PCA over the sample correlation matrix |
Locate and/or download the following files (to avoid confusion, use full paths when supplying these files to the software):
- The directory containing your TriTyper genotype data. We will call this directory
genotypedir
. - ExpressionDataPCQC-QNormLog2Transformed.CovariatesRemoved.txt.gz OR ExpressionDataPCQC-QNormLog2Transformed.txt.gz: the raw gene expression data as generated in the normalization step (described above). We will call this file
covariates
- ExpressionData.txt.40PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz: trait data generated by Step 3 - Phenotype data normalization of the eQTL mapping pipeline, corrected for 40PCs (please make sure this file was not corrected for cis-eQTL effects but was corrected for MDS components). We will call this file
traitfile
- Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
. - SNPProbe.txt: the file containing the cis-eQTL SNP probe combinations to test. One combination per line, with the first column showing the genotype variant identifier. Please use a file that is suitable for your platform. We will call this file
snpprobefile
. - GenotypeToExpressionCoupling.txt: the file linking genotypes to gene expression sample identifiers. This file is also optional (for example if your sample identifiers in your gene expression data correspond to the genotype sample identifiers). We will call this file
genotypetotraitcoupling
. The format is described here: Genotype - phenotype coupling. - CellCounts.txt: the file containing the proxy-phenotype generated by the normalization step described above. The format is that of a Phenotype file, covariate file. We will call this file
cellcountproxy
- Determine the number of available cores/processors in your machine (optional). We will call this
nrThreads
java –jar eqtl-mapping-pipeline.jar --mode celltypespecific --step mapeqtls --inexp traitfile --covariates covariates --cellcounts cellcountproxy --in genotypedir --out outputdir --snpprobe snpprobefile --threads nrThreads
Please note that the --gte and --threads switches are optional
After finalizing, the association analysis will have generated a couple of files in your outputdir
:
File | Description |
---|---|
CellTypeSpecificEQTLEffects.txt | This file contains the Z-scores for both the cis-eQTL as well as the interaction term with the (proxy) cell count |
CellTypeSpecificityMatrix.binary.columns.ser | This binary file contains the column description for the CellTypeSpecificityMatrix |
CellTypeSpecificityMatrix.binary.dat | This binary matrix contains Z-scores: each column is a cis-eQTL, each row is a probe interaction Z-score, for each probe that was used as a covariate in the model. Also, this matrix contains two summary statistics (beta + Z-score) for the (proxy) cell count. This file can be used for eventual meta-analysis of cell-type specific effects. You can convert this file to a text-based matrix using the utility described in Step 3. |
CellTypeSpecificityMatrix.binary.rows.ser | This binary file contains the row description of the CellTypeSpecificityMatrix |
eQTLsNotPassingQC.txt | The set of eQTLs that did not pass QC |
SNPSummaryStatistics.txt | This file contains summary statistics on the SNPs: MAF, HWE, Call-rate, Alleleles, minor allele, etc. |
If you want to investigate the data stored within the CellTypeSpecificityMatrix.binary.dat file, you can use the following command to convert the binary file to a text-based, tab-separated matrix:
java –jar eqtl-mapping-pipeline.jar --mode util --convertbinarymatrix --in /path/to/ CellTypeSpecificityMatrix.binary.dat --out /path/to/textoutput.txt
The command line interface of this software allows for basic QTL analyses. However, our software has many more capabilities that are not accesible via the command line. In these cases, an XML file is required that describes the different settings (full path referred to as settingsfile
). An example settingsfile
is provided in the repository. Using a settings file allows you to quickly rerun certain analyses and to perform on-the-fly meta-analyses. A copy of the settingsfile
will always be copied to your outdir
.
Currently, settingsfile
can only be used in the --mode metaqtl
mode. You should note that a settingsfile
overrides all command line switches. The settingsfile
can be used as follows:
java –jar eqtl-mapping-pipeline.jar --mode metaqtl --settings settingsfile
XML is, like HTML, a hierarchical markup language, which works with so-called markup tags. An example of such tags (describing two QC settings) is below:
<defaults>
<qc>
<maf>0.05</maf>
<hwep>0.001</hwep>
</qc>
</defaults>
Please note that if you open a tag "<maf>
" you also need to close it: "</maf>
". Also note that the <maf>
tag is part of both <qc>
and <defaults>
, as an example of the hierarchy of XML files. If you have issues with the settings file, check whether all tags are opened and closed properly, and whether the hierarchy is correct. Also note that the tags are case-sensitive. The available settings are described as comments in the example xml file.
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other