-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDemultQuestions.Rmd
219 lines (141 loc) · 9.61 KB
/
DemultQuestions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "Demultiplex Report"
author: "Trevor Enright"
date: "August 2, 2018"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##Part 1
1.
```{r echo=FALSE}
library(knitr)
Reads <- c("read1","index1","index2","read2")
File <- c("1294_S1_L008_R1_001.fastq.gz","1294_S1_L008_R2_001.fastq.gz","1294_S1_L008_R3_001.fastq.gz","1294_S1_L008_R4_001.fastq.gz")
Meaning <- c("Forward sequence read","Forwrad index","Reverse index","Reverse sequence read")
tabl <- data.frame(File,Reads,Meaning)
kable(tabl, caption="File associations")
```
```{r echo=F}
target = 0.999
pf = -10*log10(1-target)
```
2a.
##Average Quality by Position Plots:
```{r echo=F}
R1 <- read.delim("~/School/BI622/HW/R1out2", header=FALSE, comment.char="#")
R2 <- read.delim("~/School/BI622/HW/R2out2", header=FALSE, comment.char="#")
R3 <- read.delim("~/School/BI622/HW/R3out2", header=FALSE, comment.char="#")
R4 <- read.delim("~/School/BI622/HW/R4out2", header=FALSE, comment.char="#")
plot(R1,type="l",xlab = "Position", ylab = "Phred Score", main = "Forward Sequence (Read 1)",col="orchid")
plot(R2,type="l",xlab = "Position", ylab = "Phred Score", main = "Forward Index (Read 2)",col="cornflower blue")
plot(R3,type="l",xlab = "Position", ylab = "Phred Score", main = "Reverse Index (Read 3)",col="cornflower blue")
plot(R4,type="l",xlab = "Position", ylab = "Phred Score", main = "Reverse Sequence (Read 4)",col="orchid")
```
2b.
Lets specify a cut off mean score of whatever score is $`r target*100`$% likely to be correct or $`r (1-target)*100`$% likely to have an error.
That is $-10 \times log_{10}(1-`r target`) = `r round(pf,2)`$ Phred Score.
2c.
To get at the number of index reads that have one or more "N"
$zcat 1294_S1_L008_R2_001.fastq.gz | sed -n 2~4p | grep -c N 3976613
$zcat 1294_S1_L008_R3_001.fastq.gz | sed -n 2~4p | grep -c N 3328051
Thus, 7,304,664 index reads have one or more "N" in them.
##Part 2
Problem: Indexes can be swapped during sequencing causing reads to be misassociated with their index. For ease of catching these errors for paired-end reads there is the same index (reverse compliment) placed on each side of the insert for both forward and reverse reads.
Solution: If swapping occurs then we should know by matching the forward index read with the reverse compliment index read. For the easiest solution, we'll only accept 100% matches that are one of the indexes provided. Ideally, we'd accept for mismatches that could only be one of the indexes provided using error correction methods including probability with set theory. After selecting for index hopping, or lack thereof, we'll select for average quality score of a read that is above a threshold.
The output will be 48 FASTQ files of the forward reads for each of the indexes forward and reverse. There will be an additional 2 text files for the reads that were unaccepted for index hopping and reads that were unaccepted due to not meeting the average quality score requirement.
### General Algorithm:
1) Make bins for *keep* and *throw out* for testing index hopping
2) Make bins for *keep* and *throw out* for testing average quality score threshold
3) Consolidate bins and create FASTQ files for each index / bin *keep* and write FASTQ files
4) Create text files of the reads that do not meet quality score cutoff and those that are putative index swapped or have "Ns"
### Functions:
Reverse Compliment - given a DNA sequence (string), return the reverse compliment (string)
sample input: "GTAGCGTA"
returns: "TACGCTAC"
IndexSwapping - given a R2 and R3 FASTQ index files and a set of indexes, compare R2 and R3 (reverse compliment) sequence lines and either pass or fail for that read. If both indexes are in the set of given indexes then bin them into "undetermined". return a success dictionary with the value being the index and the key being the read number, also return lists of reads (the number of the read in the file) that *failed* either by having "N" or otherwise, and return an undetermined list.
sample input: R3sample, R2sample, {"GTAGCGTA", "CGATCGAT" , "GATCAAGG" , "AACAGCGA" , "TAGCCATG" , "CGGTAATC" , "CTCTGGAT" , "TACCGGAT"}
sample output: {1,"GTAGCGTA"}, [2,3]
QualityCutoff - given either R1 or the R4 FASTQ read file and a cutoff phred score, calculate the mean quality score, x, of that read. If x is above the cutoff phred score: bin into a list of *keep*; else bin into list of *throw out*. return *keep* list and *throw out* list.
sample input: R1sample, 20
sample output: [1,2,3] , []
WriteFASTQfiles - given output dictionary and list from IndexSwapping and QualityCutoff keep list: Create 48 strings for each of the 24 indexes, 1 forward and 1 reverse; go through the R1 and R4 files and if the key in the dictionary then append the read from R1 and R4 to the string of the dictionary value index. Write to the output files. return None.
WriteDismissed - given output *failed* lists from IndexSwapping and QualityCutoff *throw out* list: loop through lists and append to either a string of IndexSwapping or string of QualityCutoff. Write each string to a separate file. return None.
### Advanced Method:
Instead of dismissing indexes with "N"s we can make a guess at what it was supposed to be. An algorithm to accomplish this would be to:
1) When an N is encountered in an index, make a set of putative indexes by removing the N and replacing with either A,G,T,C. Thus if an index has one N there would be four different possibilities in the set. If there was two N's there would be 4^2 = 16 possibilities in the set.
2) Find the intersection of this set with the set of known indexes. If the intersection is just one index then we keep the intersection as the index.
Useful Functions:
BuildSet - given an index (string) with at least one "N", return a set with the possibilities of what that index could be.
sample input: "NCATGACG"
returns: {"ACATGACG","TCATGACG","CCATGACG","GCATGACG"}
Intersect - given two sets, If the intersection of the sets is one element then return that element, else return False
sample inputs: {"ACATGACG","GGACTGCT"} {"ACATGACG","TCATGACG","CCATGACG","GCATGACG"}
returns: {"ACATGACG"}
## Input Files:
Sample FASTQ read file read (R1sample):
@K00337:83:HJKJNBBXX:8:1101:1265:1191 1:N:0:1
GNCTGGCATTCCCAGAGACATCAGTACCCAGTTGGTTCAGACAGTTCCTCTATTGGTTGACAAGGTCTTCATTTCTAGTGATATCAACACGGTGTCTACAA
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
@K00337:83:HJKJNBBXX:8:1101:1289:1191 1:N:0:1
ACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAG
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
@K00337:83:HJKJNBBXX:8:1101:1299:1191 1:N:0:1
GNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGT
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
Sample FASTQ read file read (R4sample):
@K00337:83:HJKJNBBXX:8:1101:1265:1191 1:N:0:1
GNCTGGCATTCCCAGAGACATCAGTACCCAGTTGGTTCAGACAGTTCCTCTATTGGTTGACAAGGTCTTCATTTCTAGTGATATCAACACGGTGTCTACAA
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
@K00337:83:HJKJNBBXX:8:1101:1289:1191 1:N:0:1
ACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAGACGGTGCTAG
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
@K00337:83:HJKJNBBXX:8:1101:1299:1191 1:N:0:1
GNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGTGNCTGGCATTCCCAGAGACATCAGT
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
Sample FASTQ index file read (R3sample):
@K00337:83:HJKJNBBXX:8:1101:1265:1191 2:N:0:1
TACGCTAC
+
#AA<FJJJ
@K00337:83:HJKJNBBXX:8:1101:1289:1191 2:N:0:1
NACAGCGA
+
#AAAFJJJ
@K00337:83:HJKJNBBXX:8:1101:1299:1191 2:N:0:1
NTCCTAAG
+
#AAFFJAJ
Sample FASTQ index file read (R2sample):
@K00337:83:HJKJNBBXX:8:1101:1265:1191 2:N:0:1
GTAGCGTA
+
#AA<FJJJ
@K00337:83:HJKJNBBXX:8:1101:1289:1191 2:N:0:1
TCGCTGTA
+
#AAAFJJJ
@K00337:83:HJKJNBBXX:8:1101:1299:1191 2:N:0:1
CTTAGGAT
+
#AAFFJAJ
## Output Files:
Given cutoff Phred score of 30 and an Indexes of GTAGCGTA and TCGCTGA, not using Advanced N correction...
Sample FASTQ given to Index: GTAGCGTA
@K00337:83:HJKJNBBXX:8:1101:1265:1191 1:N:0:1
GNCTGGCATTCCCAGAGACATCAGTACCCAGTTGGTTCAGACAGTTCCTCTATTGGTTGACAAGGTCTTCATTTCTAGTGATATCAACACGGTGTCTACAA
+
A#A-<FJJJ<JJJJJJJJJJJJJJJJJFJJJJFFJJFJJJAJJJJ-AJJJJJJJFFJJJJJJFFA-7<AJJJFFAJJJJJF<F--JJJJJJF-A-F7JJJJ
Sample mismatched and Undertermined Index text file
"Mismatched Indexes" "Undetermined"
2
3
Sample low quality text file
"[blank]"