1
00:00:00,070 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,215
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,215 --> 00:00:17,840
at ocw.mit.edu.

8
00:00:27,790 --> 00:00:33,820
PROFESSOR: Well, welcome back to
computational systems biology.

9
00:00:33,820 --> 00:00:36,980
We're back here today talking
about genome assembly.

10
00:00:36,980 --> 00:00:43,150
How many people have ever
assembled a genome before?

11
00:00:43,150 --> 00:00:44,050
In your spare time?

12
00:00:44,050 --> 00:00:46,740
Anybody done any
genome assembly here?

13
00:00:46,740 --> 00:00:48,330
One person?

14
00:00:48,330 --> 00:00:50,590
I think genome assembly
is a fascinating topic.

15
00:00:50,590 --> 00:00:54,830
And as you know, it's at the
bedrock of all modern biology.

16
00:00:54,830 --> 00:00:59,750
We rely upon genome references
for almost everything in terms

17
00:00:59,750 --> 00:01:03,690
of studying evolution, looking
at the structure of genes,

18
00:01:03,690 --> 00:01:07,350
regulation of genes,
differences between individuals.

19
00:01:07,350 --> 00:01:11,970
So it's really a very
fundamental concept.

20
00:01:11,970 --> 00:01:14,220
And we're going to talk today
about two different ways

21
00:01:14,220 --> 00:01:15,450
of assembling genomes.

22
00:01:15,450 --> 00:01:18,290
And I think one of the takeaway
messages from today's lecture

23
00:01:18,290 --> 00:01:21,200
is going to be that
genome assembly is more

24
00:01:21,200 --> 00:01:23,740
of an art, in some
sense, than a science.

25
00:01:23,740 --> 00:01:25,320
And one has to always
be a little bit

26
00:01:25,320 --> 00:01:28,310
suspicious of a
genome assembly given

27
00:01:28,310 --> 00:01:30,300
what you're about
to learn today.

28
00:01:30,300 --> 00:01:33,920
And, of course, genome assembly
is becoming even more complex

29
00:01:33,920 --> 00:01:37,700
because it used to be that
assembling the human genome

30
00:01:37,700 --> 00:01:41,740
was the big task scientifically
in front of the community.

31
00:01:41,740 --> 00:01:44,230
But now there are billions
of genomes waiting

32
00:01:44,230 --> 00:01:48,020
to be sequenced-- all the
individuals in the world

33
00:01:48,020 --> 00:01:49,330
and to try and interpret them.

34
00:01:49,330 --> 00:01:50,996
And now you can get
your genome sequence

35
00:01:50,996 --> 00:01:52,290
for between $5,000 and $10,000.

36
00:01:52,290 --> 00:01:57,220
How many people here are tempted
to get their genome sequenced?

37
00:01:57,220 --> 00:02:00,461
OK, I see about five
hands-- six hands.

38
00:02:00,461 --> 00:02:00,960
Great.

39
00:02:00,960 --> 00:02:07,950
So let's look at the science
behind genome assembly.

40
00:02:07,950 --> 00:02:10,289
The basic concept
is that we're going

41
00:02:10,289 --> 00:02:15,089
to collect some sequence
reads from the genome.

42
00:02:15,089 --> 00:02:16,630
And we're going to
assemble them know

43
00:02:16,630 --> 00:02:20,870
what are called contigs
for contiguous segments.

44
00:02:20,870 --> 00:02:22,827
And these represent
uninterrupted portions

45
00:02:22,827 --> 00:02:24,660
of the genome that are
completely covered by

46
00:02:24,660 --> 00:02:26,485
reads that we believe
are contiguous.

47
00:02:29,290 --> 00:02:33,730
These contigs then will be
paired together in scaffolds.

48
00:02:33,730 --> 00:02:36,360
And scaffolds are like contigs
except that there are missing

49
00:02:36,360 --> 00:02:39,560
parts between the
contigs in a scaffold.

50
00:02:39,560 --> 00:02:42,360
We don't know what
those parts are.

51
00:02:42,360 --> 00:02:44,710
But we're able to actually
glue them together

52
00:02:44,710 --> 00:02:47,750
by using read pairs that allow
us to jump over the missing

53
00:02:47,750 --> 00:02:50,870
parts because we have read
both ends of a molecule.

54
00:02:50,870 --> 00:02:54,190
But we don't know
what's in the middle.

55
00:02:54,190 --> 00:02:56,730
And then oftentimes we had
physical mapping technologies

56
00:02:56,730 --> 00:03:00,240
where we actually can go back
and assign location scaffolds

57
00:03:00,240 --> 00:03:03,040
to physical locations
on chromosomes

58
00:03:03,040 --> 00:03:08,260
by using PCR sequences
like sequence tag sites

59
00:03:08,260 --> 00:03:12,980
that physically locate
a particular sequence

60
00:03:12,980 --> 00:03:16,655
identity to a physical location
on a particular chromosome.

61
00:03:16,655 --> 00:03:19,804
And that provides us
with a total genome map.

62
00:03:19,804 --> 00:03:21,220
So today we're
going to be talking

63
00:03:21,220 --> 00:03:25,440
about how to go from a
hard drive full sequence

64
00:03:25,440 --> 00:03:28,505
reads all the way down
to a set of scaffolds

65
00:03:28,505 --> 00:03:31,820
that include assembled contigs.

66
00:03:31,820 --> 00:03:35,560
And the way to think
about this once again

67
00:03:35,560 --> 00:03:38,470
is that we start with
conceptually a single copy

68
00:03:38,470 --> 00:03:39,230
of the genome.

69
00:03:39,230 --> 00:03:42,250
We amplify this.

70
00:03:42,250 --> 00:03:47,020
And in order to sequence it
on contemporary instruments,

71
00:03:47,020 --> 00:03:48,470
we have to fragment it.

72
00:03:48,470 --> 00:03:52,020
Now for those of you who were
in last Friday's recitation,

73
00:03:52,020 --> 00:03:54,729
you heard Heng Li talking about
the idea that sequence reads

74
00:03:54,729 --> 00:03:55,520
are getting longer.

75
00:03:55,520 --> 00:03:57,490
In fact, sequence
reads up to 10 to 15

76
00:03:57,490 --> 00:03:59,630
kilobases are now possible.

77
00:03:59,630 --> 00:04:01,687
And sequence reads
even longer than that

78
00:04:01,687 --> 00:04:03,520
are going to be possible,
which will greatly

79
00:04:03,520 --> 00:04:05,636
simplify the assembly process.

80
00:04:05,636 --> 00:04:07,510
But for now we're talking
about the challenge

81
00:04:07,510 --> 00:04:11,245
of assembling short reads--
say 100 base pair reads off

82
00:04:11,245 --> 00:04:14,400
of contemporary
sequencing instruments.

83
00:04:14,400 --> 00:04:18,529
So we take the fragmented
reads and the notion

84
00:04:18,529 --> 00:04:20,250
is that we know
that they're going

85
00:04:20,250 --> 00:04:22,696
to align up like a puzzle.

86
00:04:22,696 --> 00:04:24,070
And all we have
to do is line the

87
00:04:24,070 --> 00:04:27,180
reads up to recover the read
sequence at the bottom--

88
00:04:27,180 --> 00:04:31,424
the original genome sequence.

89
00:04:31,424 --> 00:04:33,840
And I should add that many of
the illustrations in today's

90
00:04:33,840 --> 00:04:34,964
lecture are from Ben Lagmi.

91
00:04:34,964 --> 00:04:39,840
He was kind enough to allow me
to use them for today's talk.

92
00:04:39,840 --> 00:04:44,860
So the goal is to come up with
that red sequence at the bottom

93
00:04:44,860 --> 00:04:48,400
from the original set
of reads but, of course,

94
00:04:48,400 --> 00:04:50,690
the read set that
we're talking about

95
00:04:50,690 --> 00:04:53,680
is perhaps 200 million
reads or even a billion

96
00:04:53,680 --> 00:04:55,970
reads as we'll see.

97
00:04:55,970 --> 00:04:59,420
And so it's quite a tough task
to put pieces together given

98
00:04:59,420 --> 00:05:02,170
that we really don't know
where they came from.

99
00:05:02,170 --> 00:05:03,750
And we don't know
where they align

100
00:05:03,750 --> 00:05:08,512
because we don't have
the red part to guide us.

101
00:05:08,512 --> 00:05:09,970
Now today we're
going to be talking

102
00:05:09,970 --> 00:05:12,280
about what's called
de novo assembly.

103
00:05:12,280 --> 00:05:14,490
That means starting
from scratch.

104
00:05:14,490 --> 00:05:18,030
You hand me your set of reads
for your favorite organism.

105
00:05:18,030 --> 00:05:20,572
And we're going to
assemble it today.

106
00:05:20,572 --> 00:05:22,030
That's different
than what's called

107
00:05:22,030 --> 00:05:24,670
reference-guided
assembly because,

108
00:05:24,670 --> 00:05:27,490
for example, if you're going
to re-sequence me or you,

109
00:05:27,490 --> 00:05:29,660
there is a reference
human genome.

110
00:05:29,660 --> 00:05:33,900
And it would be a simple matter
to take the reads from you or I

111
00:05:33,900 --> 00:05:35,930
and map them back onto
the reference genome

112
00:05:35,930 --> 00:05:40,050
as a guide to trying to
reassemble our genomes.

113
00:05:40,050 --> 00:05:42,070
However, as you can
tell, if there's

114
00:05:42,070 --> 00:05:44,280
a large structural variation
between the reference

115
00:05:44,280 --> 00:05:48,500
genome and our genomes,
that process can fail.

116
00:05:48,500 --> 00:05:53,230
So we're going to be talking
today about de novo assembly.

117
00:05:53,230 --> 00:05:57,430
And in the process
of de novo assembly,

118
00:05:57,430 --> 00:05:59,840
oftentimes we talk
about coverage,

119
00:05:59,840 --> 00:06:03,570
which is on average how
many sequencing bases do

120
00:06:03,570 --> 00:06:06,390
we have for every
base of the genome.

121
00:06:06,390 --> 00:06:10,540
Here we have for this
little illustrative example

122
00:06:10,540 --> 00:06:14,050
coverage of about 7x.

123
00:06:14,050 --> 00:06:18,220
Now, at the origin of
the Human Genome Project,

124
00:06:18,220 --> 00:06:20,710
some calculations were done
about how much coverage

125
00:06:20,710 --> 00:06:23,670
was required to cover
the human genome.

126
00:06:23,670 --> 00:06:28,980
And we talked last time
about library complexity.

127
00:06:28,980 --> 00:06:30,670
This is a slightly
different idea,

128
00:06:30,670 --> 00:06:33,020
which is we want to estimate
the probability the base is

129
00:06:33,020 --> 00:06:34,830
uncovered.

130
00:06:34,830 --> 00:06:37,810
So if we have the genome size
as G and the number of reads

131
00:06:37,810 --> 00:06:40,090
as N and L is the
length of a read,

132
00:06:40,090 --> 00:06:44,030
then N times L is the total
number bases that we have.

133
00:06:44,030 --> 00:06:47,120
And that divided by the
genome is the average coverage

134
00:06:47,120 --> 00:06:49,020
of a base.

135
00:06:49,020 --> 00:06:52,240
And probably the probability
that a base is not covered

136
00:06:52,240 --> 00:06:55,250
is the probability
we're going to observe

137
00:06:55,250 --> 00:06:59,090
zero reads to that base,
which is e to the minus

138
00:06:59,090 --> 00:07:04,330
lambda, roughly speaking, if
we use a Poisson approximation.

139
00:07:04,330 --> 00:07:07,670
And therefore, the number of
uncovered bases it will have

140
00:07:07,670 --> 00:07:12,630
is going to be roughly G
times e to the minus lambda.

141
00:07:12,630 --> 00:07:15,640
The next calculations can
be thought intuitively

142
00:07:15,640 --> 00:07:19,900
as the following way, which is
if we have N reads, if there's

143
00:07:19,900 --> 00:07:21,540
going to be a gap
after a read, there

144
00:07:21,540 --> 00:07:23,669
has to be an uncovered
base after it.

145
00:07:23,669 --> 00:07:26,210
And so the number of gaps we're
going to have in our assembly

146
00:07:26,210 --> 00:07:30,360
is roughly N times e
to the minus lambda.

147
00:07:30,360 --> 00:07:33,280
So this is a back of
the envelop calculation.

148
00:07:33,280 --> 00:07:38,290
And now if we take some of
our 1,000 genomes data, which

149
00:07:38,290 --> 00:07:42,820
we previously used and asked how
well this approximation works,

150
00:07:42,820 --> 00:07:47,400
we see something like
this where the x-axis is

151
00:07:47,400 --> 00:07:50,340
the total number of reads and
the genome coverage in bases

152
00:07:50,340 --> 00:07:52,141
is shown on the y-axis.

153
00:07:52,141 --> 00:07:54,265
And these are all different
sequencing experiments.

154
00:07:56,840 --> 00:08:00,150
So you can see there the
roughly green outline,

155
00:08:00,150 --> 00:08:04,630
which follows the approximately
what we saw before

156
00:08:04,630 --> 00:08:06,137
in this Lander-Waterman rule.

157
00:08:06,137 --> 00:08:07,720
Could somebody tell
me what they think

158
00:08:07,720 --> 00:08:10,290
is going on with the red lines
that actually don't match up

159
00:08:10,290 --> 00:08:11,200
with that green line?

160
00:08:15,210 --> 00:08:17,970
Anybody have any
ideas about why we

161
00:08:17,970 --> 00:08:20,430
need more reads out
of those libraries

162
00:08:20,430 --> 00:08:22,005
to get better coverage?

163
00:08:24,700 --> 00:08:25,503
Yes?

164
00:08:25,503 --> 00:08:27,044
AUDIENCE: There is
probably some bias

165
00:08:27,044 --> 00:08:28,211
when you're amplifying them?

166
00:08:28,211 --> 00:08:29,585
PROFESSOR: Yeah,
there's probably

167
00:08:29,585 --> 00:08:32,179
skew in the original libraries
we talked about last time.

168
00:08:32,179 --> 00:08:34,289
In fact, we talked
about last time

169
00:08:34,289 --> 00:08:37,530
why the Poisson was not
a great approximation

170
00:08:37,530 --> 00:08:39,120
for looking at libraries.

171
00:08:39,120 --> 00:08:41,530
And in fact, we might
want to fit something

172
00:08:41,530 --> 00:08:44,850
like a negative binomial
in this particular case.

173
00:08:47,520 --> 00:08:50,110
So we've got our read set.

174
00:08:50,110 --> 00:08:52,270
And we can also
talk about coverage

175
00:08:52,270 --> 00:08:55,940
at a particular base, which is
different than average coverage

176
00:08:55,940 --> 00:08:58,600
just to be clear that there are
two different kinds of coverage

177
00:08:58,600 --> 00:09:00,510
that one can think about.

178
00:09:00,510 --> 00:09:05,590
Here we see coverage
at T of level six.

179
00:09:08,140 --> 00:09:12,310
And the other thing that
we need to be cognizant of

180
00:09:12,310 --> 00:09:16,700
is that there are two
reasons that we might--

181
00:09:16,700 --> 00:09:19,130
two common reasons why
we might actually see

182
00:09:19,130 --> 00:09:23,310
reads that overlap but don't
agree at all positions.

183
00:09:23,310 --> 00:09:24,790
The obvious reason
is that there's

184
00:09:24,790 --> 00:09:26,330
an error in one of the reads.

185
00:09:26,330 --> 00:09:27,890
We get quality
scores and so forth.

186
00:09:27,890 --> 00:09:30,610
And that can help us
decide which is the truth.

187
00:09:30,610 --> 00:09:34,330
But the other possibility
is that as you know,

188
00:09:34,330 --> 00:09:36,800
you have one of each
of your chromosomes

189
00:09:36,800 --> 00:09:39,077
from mom one from your dad.

190
00:09:39,077 --> 00:09:40,660
And there could be
allelic differences

191
00:09:40,660 --> 00:09:41,743
between these chromosomes.

192
00:09:41,743 --> 00:09:44,810
So when we're doing
assembly, oftentimes we'll

193
00:09:44,810 --> 00:09:47,530
find that these
allelic differences are

194
00:09:47,530 --> 00:09:53,230
going to pop up in terms of
non-concordance of our reads.

195
00:09:53,230 --> 00:09:55,140
And we'll have to
ultimately decide

196
00:09:55,140 --> 00:09:59,510
if we want to make a single
diploid approximation

197
00:09:59,510 --> 00:10:03,130
of a human genome or
we want to attempt

198
00:10:03,130 --> 00:10:09,240
to assemble a diploid genome.

199
00:10:09,240 --> 00:10:13,030
And if we're going to
do a diploid genome,

200
00:10:13,030 --> 00:10:15,250
then we have to be
quite careful and use

201
00:10:15,250 --> 00:10:18,560
somewhat different
assembly techniques.

202
00:10:18,560 --> 00:10:20,760
But the common reference
genome is haploid.

203
00:10:20,760 --> 00:10:24,500
It's only considering
one chromosomal sequence.

204
00:10:24,500 --> 00:10:27,500
Is that clear to everybody?

205
00:10:27,500 --> 00:10:29,900
OK, great.

206
00:10:29,900 --> 00:10:33,980
So we're going to talk about two
general approaches to assembly

207
00:10:33,980 --> 00:10:34,480
today.

208
00:10:34,480 --> 00:10:38,990
We're going to talk about
overlap layout consensus

209
00:10:38,990 --> 00:10:43,520
assemblers as exemplified
by a string graph assembler.

210
00:10:43,520 --> 00:10:45,820
And we're also going to
talk about De Bruijn graph

211
00:10:45,820 --> 00:10:48,020
assemblers today.

212
00:10:48,020 --> 00:10:52,850
Now, overlap
consensus assemblers

213
00:10:52,850 --> 00:10:55,760
were the first ones that
were used in the Human Genome

214
00:10:55,760 --> 00:10:58,940
Project because reads
were longer back then.

215
00:10:58,940 --> 00:11:02,330
However, as the number
of reads has increased,

216
00:11:02,330 --> 00:11:05,720
those assemblers are
more difficult to utilize

217
00:11:05,720 --> 00:11:09,525
in part because of the need to
find overlaps between reads,

218
00:11:09,525 --> 00:11:12,239
as we'll see in a moment.

219
00:11:12,239 --> 00:11:13,780
Whereas to De Bruijn
graph assemblers

220
00:11:13,780 --> 00:11:15,910
are somewhat more efficient.

221
00:11:15,910 --> 00:11:18,750
But they lose certain
kinds of information.

222
00:11:18,750 --> 00:11:22,550
So let's begin with these
overlap layout consensus

223
00:11:22,550 --> 00:11:24,830
assemblers.

224
00:11:24,830 --> 00:11:30,720
And we're going to talk about
three steps to build contigs

225
00:11:30,720 --> 00:11:33,000
and the scaffolding
step can be thought

226
00:11:33,000 --> 00:11:36,490
of a similar between either
the overlap layout consensus

227
00:11:36,490 --> 00:11:38,735
assemblers or De Bruijn
graph-based assemblers.

228
00:11:42,770 --> 00:11:45,220
So we're going to first
build an overlap graph.

229
00:11:45,220 --> 00:11:47,420
What's an overlap graph?

230
00:11:47,420 --> 00:11:49,870
The essential idea
is that when we

231
00:11:49,870 --> 00:11:52,280
take our collection
of reads, we look

232
00:11:52,280 --> 00:11:55,060
for overlaps between
the suffix of one read

233
00:11:55,060 --> 00:11:58,130
and the prefix of another read.

234
00:11:58,130 --> 00:12:00,040
And if we think of
all of our reads,

235
00:12:00,040 --> 00:12:04,130
we want to build a graph that
describes all of such overlaps.

236
00:12:04,130 --> 00:12:06,090
And just to be
clear, I'm not going

237
00:12:06,090 --> 00:12:09,750
to be talking today about
the reverse complement

238
00:12:09,750 --> 00:12:11,580
of these reads.

239
00:12:11,580 --> 00:12:14,400
Actual assemblers have
to represent that.

240
00:12:14,400 --> 00:12:16,700
But it just duplicates
all the nodes at edges.

241
00:12:16,700 --> 00:12:18,241
So we're going to
try and keep things

242
00:12:18,241 --> 00:12:21,330
uncluttered by-- that's OK.

243
00:12:21,330 --> 00:12:23,030
Thank you.

244
00:12:23,030 --> 00:12:26,150
We're going to try and
keep things uncluttered

245
00:12:26,150 --> 00:12:28,520
by not considering those today.

246
00:12:32,250 --> 00:12:36,480
Now, one of the
challenges is how

247
00:12:36,480 --> 00:12:38,929
to construct those overlaps.

248
00:12:38,929 --> 00:12:40,970
And we're going to be
talking about graphs a lot.

249
00:12:40,970 --> 00:12:44,212
So I thought it was worthwhile
just to review terminology.

250
00:12:44,212 --> 00:12:46,670
We're going to represent overlap
graphs as directed graphs,

251
00:12:46,670 --> 00:12:48,980
which consists of a
set of vertices, which

252
00:12:48,980 --> 00:12:52,020
are the objects represented by
the circles in the edges, which

253
00:12:52,020 --> 00:12:55,500
are the lines and a directed
edge goes from one vertex

254
00:12:55,500 --> 00:12:56,000
to another.

255
00:12:58,670 --> 00:13:02,460
And there's also an
equivalent representation

256
00:13:02,460 --> 00:13:06,540
in notational form on the lower
part of the right of the slide

257
00:13:06,540 --> 00:13:08,790
as well as a graphical
representation.

258
00:13:08,790 --> 00:13:11,330
We're going to be using the
graphical representations

259
00:13:11,330 --> 00:13:13,710
of these directed graphs today.

260
00:13:18,660 --> 00:13:23,070
So the overlap graph is
simply a representation

261
00:13:23,070 --> 00:13:25,520
of the overlap between reads.

262
00:13:25,520 --> 00:13:30,340
And we pick a minimum
length of overlap at times.

263
00:13:30,340 --> 00:13:34,640
But for the next few
slides, I'm simply

264
00:13:34,640 --> 00:13:39,780
going to represent each
node as an individual read.

265
00:13:39,780 --> 00:13:42,310
And the edges will be
annotated with the amount

266
00:13:42,310 --> 00:13:45,190
of overlap between the reads.

267
00:13:45,190 --> 00:13:48,660
So if I hand you a set of
reads, all we need to do

268
00:13:48,660 --> 00:13:51,130
is to compute this
overlap graph.

269
00:13:51,130 --> 00:13:53,930
We'll talk about how
to do that in a moment.

270
00:13:53,930 --> 00:13:58,500
And you'll see graphically
then what comes out

271
00:13:58,500 --> 00:14:00,520
of the process of computing
the overlap graph.

272
00:14:03,500 --> 00:14:08,700
Now, it's possible
that overlap graphs

273
00:14:08,700 --> 00:14:14,250
are cyclic because there
are circular chromosomes.

274
00:14:14,250 --> 00:14:18,020
And as we'll see, it's also
possible to get a cyclic graph

275
00:14:18,020 --> 00:14:21,600
out of a linear chromosome
if in fact there

276
00:14:21,600 --> 00:14:25,450
are repetitive structures
in the chromosome that

277
00:14:25,450 --> 00:14:28,130
cause a graph to
cycle back on itself.

278
00:14:30,940 --> 00:14:37,770
So how to find overlaps
in efficient time

279
00:14:37,770 --> 00:14:39,230
is a key problem.

280
00:14:39,230 --> 00:14:41,990
And that's one of the reasons
that people have shied away

281
00:14:41,990 --> 00:14:44,430
from using these
types of assemblers

282
00:14:44,430 --> 00:14:48,130
is because the cost
of computing overlaps

283
00:14:48,130 --> 00:14:50,630
has been thought to be N-squared
where N is the number reads

284
00:14:50,630 --> 00:14:54,040
because you have to compare
all the reads to one another.

285
00:14:54,040 --> 00:14:58,490
However, a really
clever algorithm

286
00:14:58,490 --> 00:15:01,630
was devised that
used the technology

287
00:15:01,630 --> 00:15:04,060
we talked about last time.

288
00:15:04,060 --> 00:15:09,330
You recall the idea of the FM
index and Burroughs-Wheeler

289
00:15:09,330 --> 00:15:16,240
transforms allowed us to index
a genome and then to look up

290
00:15:16,240 --> 00:15:21,110
reads in time proportional
to the length of the read.

291
00:15:21,110 --> 00:15:22,730
So here's the essential idea.

292
00:15:22,730 --> 00:15:26,540
What we're going to do is we're
going to take all of the reads

293
00:15:26,540 --> 00:15:28,480
that we collect.

294
00:15:28,480 --> 00:15:29,730
And we're going to index them.

295
00:15:32,890 --> 00:15:36,480
And we can do that
roughly at N log N time.

296
00:15:36,480 --> 00:15:39,640
And after we've indexed
all of the reads,

297
00:15:39,640 --> 00:15:42,497
then we can use that same
index to find overlaps very,

298
00:15:42,497 --> 00:15:43,205
very efficiently.

299
00:15:45,940 --> 00:15:49,580
And you can conceptualize this
as simply looking at a read

300
00:15:49,580 --> 00:15:52,930
that you have in your hand and
looking it up in the index.

301
00:15:52,930 --> 00:15:54,960
And you'll find all the
places that the suffix

302
00:15:54,960 --> 00:15:57,450
or prefix of that read batches.

303
00:15:57,450 --> 00:16:00,690
And you can trace back till you
find all the places it matches

304
00:16:00,690 --> 00:16:03,450
where they hit an end of a read.

305
00:16:03,450 --> 00:16:06,030
And those all correspond
to edges in the graph.

306
00:16:06,030 --> 00:16:08,640
And it turns out that
this is so clever

307
00:16:08,640 --> 00:16:12,450
that it eliminates
redundant edges.

308
00:16:12,450 --> 00:16:16,190
So, for example, if
I have reads that

309
00:16:16,190 --> 00:16:21,850
look like this where I have
read one overlaps with read

310
00:16:21,850 --> 00:16:27,050
two which overlaps
with read three.

311
00:16:27,050 --> 00:16:29,055
And read one and read
three also overlap.

312
00:16:32,003 --> 00:16:39,910
An unreduced graph would have
a representation like this.

313
00:16:39,910 --> 00:16:44,370
But it turns out
that we don't have

314
00:16:44,370 --> 00:16:50,630
to do that because we can
simply reduce our graph to this

315
00:16:50,630 --> 00:16:54,640
because we know that
read one and read three.

316
00:16:54,640 --> 00:16:56,700
Actually, this is
the graph that we

317
00:16:56,700 --> 00:16:59,300
would have that
would be unreduced.

318
00:16:59,300 --> 00:17:03,650
We can reduce the graph to
eliminate this transitive edge

319
00:17:03,650 --> 00:17:09,040
and simply represent
it in this fashion.

320
00:17:09,040 --> 00:17:11,960
So when we use these
indices, we eliminate

321
00:17:11,960 --> 00:17:14,595
these transitive edges
as we'll see momentarily.

322
00:17:19,160 --> 00:17:22,260
So here's an example graph.

323
00:17:22,260 --> 00:17:25,599
The sequence is
shown on the bottom.

324
00:17:25,599 --> 00:17:30,660
The read lengths are
of length seven bases.

325
00:17:30,660 --> 00:17:36,810
And we're going to consider all
overlaps a minimum size three.

326
00:17:36,810 --> 00:17:39,440
And the edge label
is the actual length

327
00:17:39,440 --> 00:17:42,870
of the overlap
between the reads.

328
00:17:42,870 --> 00:17:46,660
And you can see
that at the outset

329
00:17:46,660 --> 00:17:50,310
that these overlap graphs
are not necessarily simple.

330
00:17:50,310 --> 00:17:52,430
That tracing a path
of the graph that

331
00:17:52,430 --> 00:17:57,030
represents the original
string is not completely

332
00:17:57,030 --> 00:17:58,690
and totally straightforward.

333
00:17:58,690 --> 00:18:03,970
So we need to come up with a
way to articulate our metrics

334
00:18:03,970 --> 00:18:08,060
for how to trace a path to the
graph to reconstruct a genome.

335
00:18:10,940 --> 00:18:15,310
And that comes to the
question of layout,

336
00:18:15,310 --> 00:18:19,760
which is how do we formulate
the problem of tracing

337
00:18:19,760 --> 00:18:25,240
a path through an overlap graph?

338
00:18:25,240 --> 00:18:29,050
So we'll first start with
the idea of the shortest

339
00:18:29,050 --> 00:18:32,170
common superstring.

340
00:18:32,170 --> 00:18:38,240
The shortest common
superstring of a string S

341
00:18:38,240 --> 00:18:42,600
is the shortest string that
contains all the strings in S

342
00:18:42,600 --> 00:18:48,710
as substrings for a particular
length of substring.

343
00:18:48,710 --> 00:18:53,540
So, for example,
if we didn't have

344
00:18:53,540 --> 00:18:56,014
the constraint of
shortest, then just

345
00:18:56,014 --> 00:18:58,430
finding a string that contains
all the substrings is easy.

346
00:18:58,430 --> 00:19:01,540
You just put them all together.

347
00:19:01,540 --> 00:19:04,300
But if we want the
shortest, then we

348
00:19:04,300 --> 00:19:10,450
need to be more thoughtful
in terms of the way

349
00:19:10,450 --> 00:19:14,210
that we compute this
shortest common substring.

350
00:19:14,210 --> 00:19:16,530
And here is an example of
the shortest common substring

351
00:19:16,530 --> 00:19:22,700
for the substrings that I
have shown you up there.

352
00:19:22,700 --> 00:19:25,060
So one way to think about
the assembly problem

353
00:19:25,060 --> 00:19:28,480
is that we're trying to compute
the shortest common substring

354
00:19:28,480 --> 00:19:31,950
of all the reads that we have.

355
00:19:31,950 --> 00:19:35,420
And that will be the most
efficient representation

356
00:19:35,420 --> 00:19:37,015
of those reads in
a linear sequence.

357
00:19:40,020 --> 00:19:47,350
Now, we can describe
this problem

358
00:19:47,350 --> 00:19:49,807
in terms of an overlap graph.

359
00:19:49,807 --> 00:19:51,390
And if you think
about the way that we

360
00:19:51,390 --> 00:19:55,390
would solve this in overlap
graph, in the shortest strings,

361
00:19:55,390 --> 00:19:58,680
we want the maximum
amount of overlap.

362
00:19:58,680 --> 00:20:02,800
So we want to trace a path
through the overlap graph that

363
00:20:02,800 --> 00:20:06,960
gives us the largest
amount of overlap,

364
00:20:06,960 --> 00:20:08,920
which gives us the
shortest string.

365
00:20:08,920 --> 00:20:10,000
Right?

366
00:20:10,000 --> 00:20:14,680
So if we simply
negate the overlaps,

367
00:20:14,680 --> 00:20:20,340
we want to minimize the
total cost of the graph.

368
00:20:20,340 --> 00:20:22,130
Now, it turns out
that this problem

369
00:20:22,130 --> 00:20:24,690
is known to be a very hard
computational problem.

370
00:20:24,690 --> 00:20:27,600
It's in the class of
something called NP-hard

371
00:20:27,600 --> 00:20:30,296
because it's known as the
traveling salesman problem.

372
00:20:30,296 --> 00:20:31,670
And when you think
about the fact

373
00:20:31,670 --> 00:20:34,180
that we're going to have
hundreds of millions of reads,

374
00:20:34,180 --> 00:20:36,870
this is not really
going to be tractable.

375
00:20:36,870 --> 00:20:39,740
If we got rid of the
weights, and we simply

376
00:20:39,740 --> 00:20:42,309
wanted to find a path
through the graph,

377
00:20:42,309 --> 00:20:44,100
that's called the
Hamiltonian Path problem.

378
00:20:44,100 --> 00:20:46,880
That's also NP-complete.

379
00:20:46,880 --> 00:20:50,020
So the shortest common
substring is a way

380
00:20:50,020 --> 00:20:53,390
to think about assembling.

381
00:20:53,390 --> 00:20:57,070
But we can't really
necessarily optimize metrics

382
00:20:57,070 --> 00:20:59,640
because it's going
to be intractable.

383
00:20:59,640 --> 00:21:05,165
So think about ways of doing
this that are greedier.

384
00:21:05,165 --> 00:21:07,540
So here's an example of how
we would compute the shortest

385
00:21:07,540 --> 00:21:11,460
common substring starting
with the first string.

386
00:21:11,460 --> 00:21:14,970
And each step along the
way, is a concatenation

387
00:21:14,970 --> 00:21:19,150
of strings or a
collapsing of strings that

388
00:21:19,150 --> 00:21:23,470
works towards building the
shortest common substring.

389
00:21:23,470 --> 00:21:29,490
And we get the input string
and the output string.

390
00:21:29,490 --> 00:21:32,620
So we could articulate
our assembly problem

391
00:21:32,620 --> 00:21:37,380
as a greedy SCS algorithm
to try and put all the

392
00:21:37,380 --> 00:21:40,590
reads together to come
up with a superstring.

393
00:21:40,590 --> 00:21:49,800
And let me just describe to you
this will give us an intuition

394
00:21:49,800 --> 00:21:52,960
into what goes wrong with
assembly in a moment.

395
00:21:52,960 --> 00:21:55,580
But we do know there are
some bounds on this--

396
00:21:55,580 --> 00:21:58,330
that if we actually did
the greedy algorithm, then

397
00:21:58,330 --> 00:22:01,940
the assembly that we got would
be only two and a half times

398
00:22:01,940 --> 00:22:05,970
longer than the true
shortest common substring.

399
00:22:05,970 --> 00:22:08,340
That isn't really very
much comfort to us.

400
00:22:08,340 --> 00:22:10,590
So we're going to have to
come up with different, more

401
00:22:10,590 --> 00:22:12,714
heuristic ways of approaching
the assembly problem.

402
00:22:15,870 --> 00:22:17,710
Here is another example.

403
00:22:17,710 --> 00:22:20,480
Now, this is the one
that I want to show you

404
00:22:20,480 --> 00:22:23,790
where we start with
a string at the top

405
00:22:23,790 --> 00:22:27,670
where we're going to be looking
for minimum overlaps of three

406
00:22:27,670 --> 00:22:32,530
and these are reads of six long.

407
00:22:32,530 --> 00:22:36,360
And when we do this
greedy algorithm,

408
00:22:36,360 --> 00:22:40,620
we come up with a
string, which is shorter

409
00:22:40,620 --> 00:22:44,581
than the original beginning
string we started with.

410
00:22:44,581 --> 00:22:46,080
Can somebody see
what happened here?

411
00:22:46,080 --> 00:22:49,760
Why are we missing part
of the original string?

412
00:22:53,156 --> 00:22:53,656
Yes?

413
00:22:53,656 --> 00:22:55,239
AUDIENCE: The reads
were short enough.

414
00:22:55,239 --> 00:22:59,410
And they repeated enough
that we never found out

415
00:22:59,410 --> 00:23:02,510
that it was of the length
that it actually was.

416
00:23:02,510 --> 00:23:06,530
And so we just kind of
[INAUDIBLE] did it [INAUDIBLE].

417
00:23:06,530 --> 00:23:09,660
PROFESSOR: So the point
was that the reads were

418
00:23:09,660 --> 00:23:14,120
too short to be able to
unambiguously identify

419
00:23:14,120 --> 00:23:15,740
the number of repeats
of long that we

420
00:23:15,740 --> 00:23:18,860
had in the original sequence.

421
00:23:18,860 --> 00:23:20,090
That's absolutely correct.

422
00:23:20,090 --> 00:23:24,650
So we're not able to
disambiguate what was going on.

423
00:23:24,650 --> 00:23:29,832
And perhaps if we went
back to our graph formalism

424
00:23:29,832 --> 00:23:31,290
we could solve this
problem, right?

425
00:23:31,290 --> 00:23:34,640
Because here we have our
graph and the overlaps

426
00:23:34,640 --> 00:23:38,410
are written in on the
edges of the number

427
00:23:38,410 --> 00:23:40,844
bases that each one of
these reads overlaps.

428
00:23:40,844 --> 00:23:43,010
And all we need to do is
to trace through this graph

429
00:23:43,010 --> 00:23:45,450
to find the original string.

430
00:23:45,450 --> 00:23:50,360
So here is one
tracing, which gives

431
00:23:50,360 --> 00:23:53,980
a total overlap of 39, which
actually faithfully reproduces

432
00:23:53,980 --> 00:23:57,680
the original string, right?

433
00:23:57,680 --> 00:24:02,150
However, that's not
the best tracing.

434
00:24:02,150 --> 00:24:05,810
A better tracing through this
graph or path through the graph

435
00:24:05,810 --> 00:24:09,900
would be this, which
gives us more overlap

436
00:24:09,900 --> 00:24:11,580
and gives us a shorter string.

437
00:24:11,580 --> 00:24:13,480
But as we know, even
though it's better

438
00:24:13,480 --> 00:24:16,700
according to this metric,
it isn't really optimum

439
00:24:16,700 --> 00:24:19,560
because it gives us
the wrong answer.

440
00:24:19,560 --> 00:24:23,070
It's better but wrong.

441
00:24:23,070 --> 00:24:25,940
So we're going to have to take
into account other things when

442
00:24:25,940 --> 00:24:29,620
we do our assembly and
our tracing of this graph

443
00:24:29,620 --> 00:24:33,190
to be able to come up with
the best possible assembly.

444
00:24:35,730 --> 00:24:40,800
So if we increase
the read length

445
00:24:40,800 --> 00:24:44,450
as was pointed out to
span appropriately,

446
00:24:44,450 --> 00:24:49,820
we will be able to reconstruct
the original sequence.

447
00:24:49,820 --> 00:24:53,414
And the point of this
example is that we

448
00:24:53,414 --> 00:24:55,830
need to consider this when
we're thinking about recovering

449
00:24:55,830 --> 00:24:58,570
repeat structures in genomes.

450
00:24:58,570 --> 00:25:04,850
So if we don't have
long enough reads,

451
00:25:04,850 --> 00:25:09,620
in this case reads
of length 8, we're

452
00:25:09,620 --> 00:25:12,600
not going to go to recover
the original repeat structure.

453
00:25:15,390 --> 00:25:20,270
And if we look at this,
repeats are really

454
00:25:20,270 --> 00:25:22,570
the bane of assemblers
in some sense.

455
00:25:22,570 --> 00:25:26,740
And as you know, roughly
50% of the human genome

456
00:25:26,740 --> 00:25:30,110
is repetitive content.

457
00:25:30,110 --> 00:25:34,620
So we need to be very, very
careful in terms of the way

458
00:25:34,620 --> 00:25:39,079
that we utilize reads to
be able to recover the best

459
00:25:39,079 --> 00:25:40,620
approximation of
our genome sequence.

460
00:25:44,140 --> 00:25:47,680
So here's another
example where we

461
00:25:47,680 --> 00:25:52,880
look at l is minimum
over length and k

462
00:25:52,880 --> 00:25:55,360
is the length of the reads.

463
00:25:55,360 --> 00:25:57,500
And you can see the
sequence that we're

464
00:25:57,500 --> 00:25:59,333
trying to recover-- It
was the best of times

465
00:25:59,333 --> 00:26:03,800
it was the worst of
times and the output

466
00:26:03,800 --> 00:26:07,680
from our greedy SCS assembler.

467
00:26:07,680 --> 00:26:10,620
And as you can see,
we need to get up

468
00:26:10,620 --> 00:26:15,080
to a read length of
13 characters for us

469
00:26:15,080 --> 00:26:18,475
to be able to properly assemble
that original sentence.

470
00:26:21,980 --> 00:26:25,390
So the essential message
here is that unless you

471
00:26:25,390 --> 00:26:29,860
have reads that are long
enough to span repeats,

472
00:26:29,860 --> 00:26:31,370
you're not going
to go to recover

473
00:26:31,370 --> 00:26:35,480
the original sequence exactly.

474
00:26:35,480 --> 00:26:41,410
And this can be also thought
of in the following example.

475
00:26:41,410 --> 00:26:45,650
Imagine you have repeats
that are tandem repeats out

476
00:26:45,650 --> 00:26:46,931
at the end of a sequence.

477
00:26:46,931 --> 00:26:48,430
And we're using the
English language

478
00:26:48,430 --> 00:26:51,604
here because it's easier
to see than if I put up

479
00:26:51,604 --> 00:26:52,770
a bunch of genomic sequence.

480
00:26:52,770 --> 00:26:56,330
But, of course, the
principles are the same.

481
00:26:56,330 --> 00:26:58,110
You can see that
unless we have reads

482
00:26:58,110 --> 00:27:01,680
that actually are anchored
and unique sequence

483
00:27:01,680 --> 00:27:04,560
and span out towards
a repetitive sequence,

484
00:27:04,560 --> 00:27:08,535
we can't really tell
how many times the word

485
00:27:08,535 --> 00:27:09,285
bells is repeated.

486
00:27:12,000 --> 00:27:14,010
Another possibility
is that we can

487
00:27:14,010 --> 00:27:16,500
actually coming from both sides.

488
00:27:16,500 --> 00:27:20,450
And if we can anchor our reads
and unique sequence on both

489
00:27:20,450 --> 00:27:24,000
the left and the right side
of a repetitive element,

490
00:27:24,000 --> 00:27:26,940
then we can figure out how many
copies of something like bells

491
00:27:26,940 --> 00:27:28,740
is present.

492
00:27:28,740 --> 00:27:30,860
But in the absence of that,
we really can't do it.

493
00:27:30,860 --> 00:27:34,020
In fact, we wind up with a
structure looks like this.

494
00:27:34,020 --> 00:27:39,900
We wind up with-- there
it is-- a structure where

495
00:27:39,900 --> 00:27:41,820
we have-- let's
just say that there

496
00:27:41,820 --> 00:27:45,300
are four different
stretches of genome

497
00:27:45,300 --> 00:27:47,430
in disparate parts
of chromosomes

498
00:27:47,430 --> 00:27:49,910
and we repeat sequence
in the middle.

499
00:27:49,910 --> 00:27:53,340
The blue parts of
the chromosomes

500
00:27:53,340 --> 00:27:54,700
are unique sequence.

501
00:27:54,700 --> 00:27:58,450
And the red parts are
repetitive sequences.

502
00:27:58,450 --> 00:28:02,770
What will happen is that if
the reads aren't long enough,

503
00:28:02,770 --> 00:28:06,580
we'll be able to find out in
each one of the four locations

504
00:28:06,580 --> 00:28:10,310
that we've gone from unique
sequence to repeat sequence.

505
00:28:10,310 --> 00:28:12,000
And then we will get
lost in the middle

506
00:28:12,000 --> 00:28:16,030
of this identical
repeated sequence.

507
00:28:16,030 --> 00:28:18,059
And then on the
right-hand side we'll

508
00:28:18,059 --> 00:28:20,100
once again transition back
from repeated sequence

509
00:28:20,100 --> 00:28:21,470
to unique sequence.

510
00:28:21,470 --> 00:28:24,250
But we won't know how to put
things together in the middle.

511
00:28:24,250 --> 00:28:24,750
Right?

512
00:28:24,750 --> 00:28:26,208
We won't be able
to figure out what

513
00:28:26,208 --> 00:28:31,150
the path is through these
repetitive elements.

514
00:28:31,150 --> 00:28:36,500
So that's the essential point
I'd like to make about repeats.

515
00:28:36,500 --> 00:28:39,330
And we can now turn to
the question of layout

516
00:28:39,330 --> 00:28:44,760
and how to process an overlap
graph towards making contigs.

517
00:28:44,760 --> 00:28:48,650
This is the actual layout graph.

518
00:28:48,650 --> 00:28:51,940
When we think about
that sentence up there.

519
00:28:51,940 --> 00:28:55,270
And we say the minimum over
that length is four characters.

520
00:28:55,270 --> 00:28:58,750
And we have seven-character
reads out of the sequence.

521
00:28:58,750 --> 00:29:04,470
You can see it's a
pretty messy graph.

522
00:29:04,470 --> 00:29:09,220
If we clean up the graph by
removing the redundant edges,

523
00:29:09,220 --> 00:29:14,430
the edges like
this that span over

524
00:29:14,430 --> 00:29:17,450
reads and are implied
by other reads,

525
00:29:17,450 --> 00:29:20,690
we can remove edges that
are transitive over one

526
00:29:20,690 --> 00:29:22,210
reads or two reads.

527
00:29:22,210 --> 00:29:28,440
Now, my presentation
is going to talk

528
00:29:28,440 --> 00:29:29,980
about how to remove these edges.

529
00:29:29,980 --> 00:29:31,960
However, as I said
at the outset,

530
00:29:31,960 --> 00:29:37,110
if you use the algorithm
by Simpson et al.,

531
00:29:37,110 --> 00:29:39,400
you actually don't generate
these transitive edges

532
00:29:39,400 --> 00:29:41,440
in the first place.

533
00:29:41,440 --> 00:29:43,760
But assuming that you
didn't use an algorithm

534
00:29:43,760 --> 00:29:45,410
and you did generate
them, you want

535
00:29:45,410 --> 00:29:49,480
to get rid of these
transitive edges like so.

536
00:29:49,480 --> 00:29:53,190
And it starts getting
somewhat simpler

537
00:29:53,190 --> 00:29:54,870
as you begin
simplifying the graph,

538
00:29:54,870 --> 00:29:58,030
removing these transitive edges.

539
00:29:58,030 --> 00:30:02,370
And then we can remove
edges that skip two nodes.

540
00:30:02,370 --> 00:30:05,210
So here's what happens after
you remove the single transitive

541
00:30:05,210 --> 00:30:06,140
edges in this graph.

542
00:30:06,140 --> 00:30:07,055
Yes?

543
00:30:07,055 --> 00:30:10,464
AUDIENCE: So it seems that the
transitive and verbal edges

544
00:30:10,464 --> 00:30:12,755
gave us a little bit more
information about the genome.

545
00:30:12,755 --> 00:30:18,124
Do we lose some useful
ordering principles by--

546
00:30:18,124 --> 00:30:20,040
PROFESSOR: They provide
redundant information.

547
00:30:20,040 --> 00:30:22,666
They don't really provide
any additional information.

548
00:30:22,666 --> 00:30:25,165
It's the same linear sequence
that's implied by those edges.

549
00:30:28,012 --> 00:30:28,845
Any other questions?

550
00:30:32,640 --> 00:30:37,820
So we can then remove
edges that span two nodes.

551
00:30:37,820 --> 00:30:41,450
And we get an even
simpler graph like this.

552
00:30:41,450 --> 00:30:43,900
Now this is beginning to look
more tractable because we

553
00:30:43,900 --> 00:30:48,270
can look at this and we
can output contigs that

554
00:30:48,270 --> 00:30:51,150
correspond to linear
portions of the graph, which

555
00:30:51,150 --> 00:30:53,370
should be linear sequence.

556
00:30:53,370 --> 00:31:00,590
And when we do that what we
wind up with are two contigs.

557
00:31:00,590 --> 00:31:03,400
And there's just a bit of
problem in the middle, which

558
00:31:03,400 --> 00:31:06,980
is that we're unable to
resolve the bit in the middle

559
00:31:06,980 --> 00:31:10,350
and as a consequence,
we know that that

560
00:31:10,350 --> 00:31:15,490
is the number of terms that
are in that original sentence

561
00:31:15,490 --> 00:31:17,550
because we didn't
have a read long

562
00:31:17,550 --> 00:31:21,670
enough to be able
to resolve that.

563
00:31:21,670 --> 00:31:24,390
The other problem
that we can have

564
00:31:24,390 --> 00:31:26,430
in doing this kind of
layout is that when

565
00:31:26,430 --> 00:31:30,120
there are portions of
the genome that occur

566
00:31:30,120 --> 00:31:34,200
or sequences in the genome that
occur multiple times, when we

567
00:31:34,200 --> 00:31:37,308
actually do this
layout, we may find

568
00:31:37,308 --> 00:31:38,807
that the portions
of the genome that

569
00:31:38,807 --> 00:31:42,990
occur in two disparate locations
line up with one another.

570
00:31:42,990 --> 00:31:45,470
And it may be that as you
exit the portion that's

571
00:31:45,470 --> 00:31:48,540
shared you get a
mismatched base.

572
00:31:48,540 --> 00:31:51,110
So that mismatch
could be because you

573
00:31:51,110 --> 00:31:53,400
have disparate parts of
the genome that actually

574
00:31:53,400 --> 00:31:54,944
have very similar sequence.

575
00:31:54,944 --> 00:31:56,360
Or it could be
that you had a read

576
00:31:56,360 --> 00:31:59,400
error at the end of your read.

577
00:31:59,400 --> 00:32:03,010
And it's difficult to tell the
two apart except by the amount

578
00:32:03,010 --> 00:32:05,190
of coverage that you have.

579
00:32:05,190 --> 00:32:06,690
We'll talk about
how to prune graphs

580
00:32:06,690 --> 00:32:09,930
like this in a few moments.

581
00:32:09,930 --> 00:32:14,420
But in any event, assuming
that we have pruned the graph,

582
00:32:14,420 --> 00:32:16,154
we have done our overlap.

583
00:32:16,154 --> 00:32:17,070
We've done our layout.

584
00:32:17,070 --> 00:32:19,950
We've found our paths to
the graph for our contigs.

585
00:32:19,950 --> 00:32:22,500
And then what we find
is that for each contig,

586
00:32:22,500 --> 00:32:24,810
we have many reads.

587
00:32:24,810 --> 00:32:26,990
And we're going to
take those reads.

588
00:32:26,990 --> 00:32:28,910
And we're going to look at them.

589
00:32:28,910 --> 00:32:30,950
And as you recall,
we could either

590
00:32:30,950 --> 00:32:34,100
have errors causing
disagreement among the reads.

591
00:32:34,100 --> 00:32:37,870
We could have allelic
differences between mom and dad

592
00:32:37,870 --> 00:32:41,270
causing those errors, well, not
really errors-- differences.

593
00:32:41,270 --> 00:32:43,810
And then we can take
a consensus to come up

594
00:32:43,810 --> 00:32:48,130
with what the haploid genome is.

595
00:32:48,130 --> 00:32:54,610
So that's the essential idea
of a overlap layout consensus

596
00:32:54,610 --> 00:32:55,480
assembler.

597
00:32:55,480 --> 00:32:58,590
We compute the overlap graph.

598
00:32:58,590 --> 00:33:01,170
During the layout phase we
actually simplify the graph.

599
00:33:01,170 --> 00:33:02,710
And we find pass through it.

600
00:33:02,710 --> 00:33:05,230
And during the consensus
phase, we take our reads,

601
00:33:05,230 --> 00:33:11,280
and we build a consensus
sequence of the genome.

602
00:33:11,280 --> 00:33:17,890
And as I said, this graph
building can be slow.

603
00:33:17,890 --> 00:33:19,746
Although, we'll
talk about how slow

604
00:33:19,746 --> 00:33:21,830
it is here in just a moment.

605
00:33:21,830 --> 00:33:25,440
And the challenge is that
modern sequencing data sets

606
00:33:25,440 --> 00:33:28,820
are hundreds of
millions of reads.

607
00:33:28,820 --> 00:33:33,730
So let's talk about a
contemporary overlap-based

608
00:33:33,730 --> 00:33:36,680
assembler-- something called
the stream graph assembler,

609
00:33:36,680 --> 00:33:39,905
which is done over at
the Sanger in the UK.

610
00:33:39,905 --> 00:33:42,030
And there are three separate
steps it goes through.

611
00:33:42,030 --> 00:33:45,110
The first step is it
tries to correct reads.

612
00:33:45,110 --> 00:33:47,070
And the way it does
this is it actually

613
00:33:47,070 --> 00:33:50,340
looks at all the
k-mers that occur in

614
00:33:50,340 --> 00:33:53,470
reads-- it tries to find
sequences that are very, very

615
00:33:53,470 --> 00:33:57,090
rare and find sequences
that are nearby

616
00:33:57,090 --> 00:33:59,780
in sequence base
that aren't as rare.

617
00:33:59,780 --> 00:34:04,520
And it can correct bases that it
believes are sequencing errors.

618
00:34:04,520 --> 00:34:06,090
The next step is
assembly once it

619
00:34:06,090 --> 00:34:09,030
has taken all these
reads and corrected them.

620
00:34:09,030 --> 00:34:10,850
It indexes all the
reads as I suggested

621
00:34:10,850 --> 00:34:14,330
earlier using an FM index.

622
00:34:14,330 --> 00:34:18,790
And then it can find the overlap
from that FM index directly.

623
00:34:18,790 --> 00:34:22,170
And part of the assembly
process is throwing away

624
00:34:22,170 --> 00:34:23,639
duplicate reads
and throwing away

625
00:34:23,639 --> 00:34:26,250
reads that have
low quality scores.

626
00:34:26,250 --> 00:34:28,449
So that's the filtering step.

627
00:34:28,449 --> 00:34:34,860
It then has the set of
contigs that it has generated.

628
00:34:34,860 --> 00:34:36,500
And it does something
quite interesting

629
00:34:36,500 --> 00:34:39,429
to find the scaffolds is that
it takes the contigs it's

630
00:34:39,429 --> 00:34:42,860
assembled in terms
of linear sequence.

631
00:34:42,860 --> 00:34:45,790
And it completely re-indexes
them once again using

632
00:34:45,790 --> 00:34:47,394
an FM index.

633
00:34:47,394 --> 00:34:49,643
And then it takes all the
reads that you started with.

634
00:34:49,643 --> 00:34:53,770
And it maps them back
onto the contigs.

635
00:34:53,770 --> 00:34:57,100
And by mapping the paired
reads back on to the contigs,

636
00:34:57,100 --> 00:35:00,640
it can actually figure
out what contigs

637
00:35:00,640 --> 00:35:03,890
should be formed into scaffolds
where there are holes that

638
00:35:03,890 --> 00:35:07,960
are breached by
these longer reads.

639
00:35:07,960 --> 00:35:11,420
So it's using the FM indexed
both for correction to find out

640
00:35:11,420 --> 00:35:14,720
nearby k-mers for
assembly to find overlaps

641
00:35:14,720 --> 00:35:17,290
and for scaffolding to
put things together.

642
00:35:17,290 --> 00:35:22,119
And it does its indexing
three different times.

643
00:35:22,119 --> 00:35:24,160
And just to give you an
idea of how long it takes

644
00:35:24,160 --> 00:35:30,300
for a human-sized
genome, it's actually

645
00:35:30,300 --> 00:35:33,300
quite expensive in
terms of CPU time.

646
00:35:33,300 --> 00:35:36,150
It takes many days
have elapsed time

647
00:35:36,150 --> 00:35:41,610
to assemble an entire
human genome right now.

648
00:35:41,610 --> 00:35:45,860
And it's thousands of
CPU hours to actually put

649
00:35:45,860 --> 00:35:47,930
a genome together
starting from scratch.

650
00:35:50,590 --> 00:35:55,440
OK, so that's the essential idea
of an overlap-based assembler.

651
00:35:55,440 --> 00:35:58,780
Are there any questions at all
about overlap-based assemblers?

652
00:35:58,780 --> 00:35:59,643
Yeah?

653
00:35:59,643 --> 00:36:01,958
AUDIENCE: So in the
case of an error ,

654
00:36:01,958 --> 00:36:05,210
it's obvious how
you would call that.

655
00:36:05,210 --> 00:36:08,284
But in an allelic difference,
hypothetically, there

656
00:36:08,284 --> 00:36:11,067
would be 50% of the reads would
have one and 50% of the reads

657
00:36:11,067 --> 00:36:11,430
would have another.

658
00:36:11,430 --> 00:36:11,914
PROFESSOR: That's correct.

659
00:36:11,914 --> 00:36:14,590
AUDIENCE: So in that case
does it assemble-- do you

660
00:36:14,590 --> 00:36:17,400
just bias towards whichever
ones weren't easily amplified?

661
00:36:17,400 --> 00:36:21,960
Or do you assemble
two sequences?

662
00:36:21,960 --> 00:36:24,540
PROFESSOR: Most assemblers
produce a single sequence.

663
00:36:24,540 --> 00:36:29,750
And I don't know how SGA decides
between the different alleles

664
00:36:29,750 --> 00:36:32,710
because I don't recall what
the paper said they did.

665
00:36:32,710 --> 00:36:34,240
But they have to
essentially flip

666
00:36:34,240 --> 00:36:37,860
a coin to come up with
a haploid sequence.

667
00:36:37,860 --> 00:36:38,735
Yes?

668
00:36:38,735 --> 00:36:40,675
AUDIENCE: You said there was
three different times that you

669
00:36:40,675 --> 00:36:41,175
index.

670
00:36:41,175 --> 00:36:44,046
What are the three?

671
00:36:44,046 --> 00:36:45,420
PROFESSOR: Yeah,
the question was

672
00:36:45,420 --> 00:36:47,580
I said there are three
different they indexed.

673
00:36:47,580 --> 00:36:55,620
They indexed at the
outset to find errors.

674
00:36:55,620 --> 00:37:02,900
They indexed the second time
to do the overlap computation.

675
00:37:02,900 --> 00:37:06,070
And they indexed the
third time to realign

676
00:37:06,070 --> 00:37:09,740
all the original reads to the
contigs they have to figure out

677
00:37:09,740 --> 00:37:11,695
which contigs to put
together into scaffolds.

678
00:37:15,328 --> 00:37:16,260
Right?

679
00:37:16,260 --> 00:37:19,430
But they have this essential
foundational platform,

680
00:37:19,430 --> 00:37:21,490
which is the FM index.

681
00:37:21,490 --> 00:37:23,450
And so they use that
over and over again

682
00:37:23,450 --> 00:37:24,700
to be able to do the assembly.

683
00:37:28,045 --> 00:37:29,295
These are all great questions.

684
00:37:32,340 --> 00:37:35,295
All right, any other questions
about overlap-based assemblers.

685
00:37:39,080 --> 00:37:41,550
And you can see that if you
think about how much coverage

686
00:37:41,550 --> 00:37:43,800
they get out of an assembler
like this, it's actually,

687
00:37:43,800 --> 00:37:45,841
we'll compare all the
assemblers at the very end.

688
00:37:45,841 --> 00:37:50,730
But if you look at the
number of bases of autosomes

689
00:37:50,730 --> 00:37:57,000
and the X chromosome
covered by an assembly,

690
00:37:57,000 --> 00:37:59,950
you can consider
that as a function

691
00:37:59,950 --> 00:38:04,180
of the minimum alignment
length to a referenced genome.

692
00:38:04,180 --> 00:38:07,830
And as the minimum
alignment length goes up,

693
00:38:07,830 --> 00:38:10,270
that means you have to match
longer and longer portions

694
00:38:10,270 --> 00:38:16,590
of the reference genome for
your assembly contig to count.

695
00:38:16,590 --> 00:38:19,126
You can see that the number
of bases dropped somewhat.

696
00:38:19,126 --> 00:38:20,500
In here they're
showing that they

697
00:38:20,500 --> 00:38:24,630
do better than another
assembler called SOAPdenovo.

698
00:38:24,630 --> 00:38:27,930
But they do get a
fairly good coverage.

699
00:38:27,930 --> 00:38:31,250
On the other hand,
they don't get coverage

700
00:38:31,250 --> 00:38:34,950
anywhere near as good as
Lander-Waterman might suggest

701
00:38:34,950 --> 00:38:36,637
because the coverage
should suggest

702
00:38:36,637 --> 00:38:38,470
that the probability
of uncovered base using

703
00:38:38,470 --> 00:38:42,700
Lander-Waterman would be roughly
e to the minus 40th-- something

704
00:38:42,700 --> 00:38:43,200
like that.

705
00:38:43,200 --> 00:38:46,600
And e to the minus 40th is like
4 times 10 to the minus 18.

706
00:38:46,600 --> 00:38:48,754
So they're not
anywhere near what

707
00:38:48,754 --> 00:38:50,670
we would think the
Lander-Waterman bound would

708
00:38:50,670 --> 00:38:52,350
be for assembly.

709
00:38:55,690 --> 00:38:59,810
So we've talked about these
overlap-based assemblers.

710
00:38:59,810 --> 00:39:02,950
Now I'm going to turn to
De Bruijn graph assemblers.

711
00:39:02,950 --> 00:39:05,490
How many people have heard
of De Bruijn graphs before?

712
00:39:05,490 --> 00:39:07,560
Anybody?

713
00:39:07,560 --> 00:39:11,210
One person?

714
00:39:11,210 --> 00:39:15,860
So before we talk about De
Bruijn graphs themselves,

715
00:39:15,860 --> 00:39:17,640
let's just talk terminology.

716
00:39:17,640 --> 00:39:23,415
So when I'm using
terms we're all

717
00:39:23,415 --> 00:39:27,550
on the same page where we were
talking about k-mers where

718
00:39:27,550 --> 00:39:32,080
the word mer is from
the Greek "part."

719
00:39:32,080 --> 00:39:34,380
And we talk about 4-mers
of an original sequence

720
00:39:34,380 --> 00:39:37,900
as a sequence that's
four bases long.

721
00:39:37,900 --> 00:39:40,570
And we can think about
all of the 3-mers

722
00:39:40,570 --> 00:39:43,080
of an original sequence.

723
00:39:43,080 --> 00:39:45,790
So we talk a lot about k-mers.

724
00:39:45,790 --> 00:39:51,620
And a k minus 1-mer is a
substring of length k minus 1

725
00:39:51,620 --> 00:39:52,855
obviously from a k-mer.

726
00:39:55,560 --> 00:39:58,670
So if we think about the
collection of reads--

727
00:39:58,670 --> 00:40:03,790
here these are our
super-simple economy sequencers

728
00:40:03,790 --> 00:40:06,360
producing reads of only
length three, which

729
00:40:06,360 --> 00:40:07,300
is pretty desperate.

730
00:40:07,300 --> 00:40:09,670
But at any rate we'll go
with that for the time being.

731
00:40:09,670 --> 00:40:12,750
And we think about
each one of these reads

732
00:40:12,750 --> 00:40:19,040
as having a left k minus 1-mer
and a right k minus 1-mer.

733
00:40:19,040 --> 00:40:22,960
We split them into
two halves that way.

734
00:40:22,960 --> 00:40:33,320
And we're going to build a
graph that is as follows.

735
00:40:33,320 --> 00:40:36,420
We're going to take all of the
k minus 1-mers-- in this case

736
00:40:36,420 --> 00:40:38,030
the 2-mers.

737
00:40:38,030 --> 00:40:40,850
And for each read,
we're going to draw

738
00:40:40,850 --> 00:40:46,340
an edge between its left
2-mer and its right 2-mer.

739
00:40:46,340 --> 00:40:49,690
OK, once again, for
each read, these sort

740
00:40:49,690 --> 00:40:51,890
of anemic,
three-base-pair reads,

741
00:40:51,890 --> 00:40:54,140
we're going to draw an
edge between its left 2-mer

742
00:40:54,140 --> 00:40:55,000
and its right 2-mer.

743
00:40:55,000 --> 00:40:58,020
And they overlap in one base.

744
00:40:58,020 --> 00:41:02,100
So all of the graphs that are
De Bruijn graphs, the edges

745
00:41:02,100 --> 00:41:05,860
represent an
overlap of one base.

746
00:41:05,860 --> 00:41:07,330
OK?

747
00:41:07,330 --> 00:41:10,390
So if you look at the
graph at the bottom,

748
00:41:10,390 --> 00:41:15,140
that represents the
overlaps present

749
00:41:15,140 --> 00:41:17,650
in the original sequence.

750
00:41:17,650 --> 00:41:21,260
You note that we have
AA as one of the 2-mers.

751
00:41:21,260 --> 00:41:26,310
And its left half and right half
obviously overlap by one base.

752
00:41:26,310 --> 00:41:30,170
The triple-A read has
AA as its left read

753
00:41:30,170 --> 00:41:33,910
and AA as a right read--
thay overlap at one base.

754
00:41:33,910 --> 00:41:38,330
And that's why we have that
circular edge from A to itself.

755
00:41:38,330 --> 00:41:42,830
And the next edge
from AA to AB comes

756
00:41:42,830 --> 00:41:47,200
from the next
read-- the AAB read.

757
00:41:50,060 --> 00:41:56,640
So each edge then represents
an overlap of one base.

758
00:41:56,640 --> 00:41:59,330
And therefore, each
edge represents

759
00:41:59,330 --> 00:42:01,950
a unique k-mer sequence.

760
00:42:01,950 --> 00:42:04,690
So the way to think
about this graph

761
00:42:04,690 --> 00:42:08,970
is it that all of the edges
represent the original reads.

762
00:42:08,970 --> 00:42:13,340
And we have represented the
k minus 1 words as the nodes.

763
00:42:13,340 --> 00:42:16,150
OK?

764
00:42:16,150 --> 00:42:21,550
So we can take this graph
then and generalize this idea.

765
00:42:21,550 --> 00:42:27,440
And if we look at
how the graph changes

766
00:42:27,440 --> 00:42:29,850
as we add more
structure, here you

767
00:42:29,850 --> 00:42:31,840
see that we've added an extra b.

768
00:42:31,840 --> 00:42:35,530
And we get another edge in the
graph back to the same node.

769
00:42:35,530 --> 00:42:37,030
So when we're
building these graphs,

770
00:42:37,030 --> 00:42:40,245
if possible, we reuse a
node that already exists.

771
00:42:42,900 --> 00:42:46,320
Now the way to think
about coming back

772
00:42:46,320 --> 00:42:48,360
to the original sequence
is finding a path

773
00:42:48,360 --> 00:42:52,890
through this graph and emitting
sequence as we trace the path.

774
00:42:52,890 --> 00:42:54,680
And we would like
to have a path that

775
00:42:54,680 --> 00:42:58,130
traverses all of the nodes.

776
00:42:58,130 --> 00:43:01,180
And so we have some
definitions here,

777
00:43:01,180 --> 00:43:07,390
which is that a node is
balanced if its indegree equals

778
00:43:07,390 --> 00:43:09,400
it's outdegree.

779
00:43:09,400 --> 00:43:13,960
And you can see that not all
the nodes are balanced down

780
00:43:13,960 --> 00:43:16,660
the graph of the lower,
right-hand corner.

781
00:43:16,660 --> 00:43:18,720
And it's connected
if all the components

782
00:43:18,720 --> 00:43:20,970
or nodes can be reached.

783
00:43:20,970 --> 00:43:25,650
And a Eulerian walk visit
each edge exactly once,

784
00:43:25,650 --> 00:43:30,690
which is what we would like to
actually take a De Bruijn graph

785
00:43:30,690 --> 00:43:34,750
and emit a genome sequence.

786
00:43:34,750 --> 00:43:37,130
Now, not all graphs
have these walks.

787
00:43:40,010 --> 00:43:42,290
And graphs do our Eulerian.

788
00:43:42,290 --> 00:43:47,520
And we won't distinguish
different types

789
00:43:47,520 --> 00:43:50,430
of these graphs.

790
00:43:50,430 --> 00:43:54,379
And if a graph has two
semi-balanced nodes

791
00:43:54,379 --> 00:43:56,170
and all the rest of
the nodes are balanced,

792
00:43:56,170 --> 00:43:59,650
then it will have
a walk through it.

793
00:43:59,650 --> 00:44:04,990
So if we think about
our original graph,

794
00:44:04,990 --> 00:44:09,360
there are two arguments
for it having such a walk.

795
00:44:09,360 --> 00:44:14,120
The first argument is
that we show the walk.

796
00:44:14,120 --> 00:44:18,730
And the second is that we
have two semi-balanced nodes

797
00:44:18,730 --> 00:44:20,355
and the rest of the
nodes are balanced.

798
00:44:23,620 --> 00:44:26,340
So the reason that
we care about this

799
00:44:26,340 --> 00:44:30,100
is that we want to study
cases where this goes wrong.

800
00:44:33,760 --> 00:44:37,800
So to build a De Bruijn
graph of a genome,

801
00:44:37,800 --> 00:44:42,120
we're going to take our
original sequence reads.

802
00:44:42,120 --> 00:44:45,570
And we're going to take
all the k-mers that

803
00:44:45,570 --> 00:44:48,570
occur in those reads.

804
00:44:48,570 --> 00:44:53,230
And we're going to add
edges to a De Bruijn graph

805
00:44:53,230 --> 00:44:54,310
based upon those k-mers.

806
00:44:54,310 --> 00:45:04,530
So if we have a read like
this, and we consider

807
00:45:04,530 --> 00:45:07,660
a k-mer in the read,
we're going to add an edge

808
00:45:07,660 --> 00:45:11,070
in the graph between
the left k minus 1-mer

809
00:45:11,070 --> 00:45:14,250
and the right k minus 1-mer.

810
00:45:14,250 --> 00:45:18,440
And we'll do that for every
single k-mer in the read.

811
00:45:18,440 --> 00:45:22,230
Now note what this does is
it destroys some information.

812
00:45:22,230 --> 00:45:26,230
It destroys information
about the ordering

813
00:45:26,230 --> 00:45:30,230
of certain of the k-mers in this
read just destroying their read

814
00:45:30,230 --> 00:45:34,400
contiguity in order to make
some simplifying assumptions

815
00:45:34,400 --> 00:45:39,360
to represent the
sequence ordering

816
00:45:39,360 --> 00:45:43,940
of these k minus
1-mers in the graph.

817
00:45:43,940 --> 00:45:48,910
So we build the
graph in this way

818
00:45:48,910 --> 00:45:56,120
and if I were to build
the graph like this,

819
00:45:56,120 --> 00:45:59,370
what is the minimum
sequence overlap for two

820
00:45:59,370 --> 00:46:02,420
reads to actually share an
edge in the resulting graph?

821
00:46:05,200 --> 00:46:08,720
Can anybody see how
long the sequence

822
00:46:08,720 --> 00:46:10,210
must be in the
second read for it

823
00:46:10,210 --> 00:46:13,270
to actually overlap at
edge with the first read?

824
00:46:20,570 --> 00:46:23,000
Well, if this second read
also has a k-mer, right?

825
00:46:25,684 --> 00:46:28,100
It's going to produce another
structure just like this one

826
00:46:28,100 --> 00:46:30,480
if these two do overlap.

827
00:46:30,480 --> 00:46:34,180
And thus the edge produced
by this read and the edge

828
00:46:34,180 --> 00:46:39,520
by this read will
overlap like this.

829
00:46:39,520 --> 00:46:47,750
And thus all of the nodes that
came from this part of read one

830
00:46:47,750 --> 00:46:49,470
will feed into this graph.

831
00:46:49,470 --> 00:46:51,110
And then all the
nodes to come out

832
00:46:51,110 --> 00:46:54,640
of this k-mer from the
purple read will come out

833
00:46:54,640 --> 00:46:57,030
of it like so, right?

834
00:46:57,030 --> 00:46:59,320
And thus when we're
tracing the graph,

835
00:46:59,320 --> 00:47:01,410
the idea is that the
graph will be connected.

836
00:47:01,410 --> 00:47:03,500
And we'll be able to
come between these reads

837
00:47:03,500 --> 00:47:05,912
and reconstruct
the sequence that

838
00:47:05,912 --> 00:47:07,120
was suggested by the overlap.

839
00:47:10,360 --> 00:47:15,120
The thing, however, you should
note in this-- yes, question?

840
00:47:15,120 --> 00:47:18,960
AUDIENCE: So you're
picking two k minus 1

841
00:47:18,960 --> 00:47:22,970
reads there-- are those
from different reads?

842
00:47:22,970 --> 00:47:24,464
Or from the white read?

843
00:47:24,464 --> 00:47:26,130
PROFESSOR: No, it's
from the white read.

844
00:47:26,130 --> 00:47:30,550
These are the 2k minus 1-mers
that came out of this read.

845
00:47:30,550 --> 00:47:32,185
So they actually overlap.

846
00:47:32,185 --> 00:47:34,560
AUDIENCE: Yeah,
but then you were

847
00:47:34,560 --> 00:47:38,401
talking about how the one
was purple in that case.

848
00:47:38,401 --> 00:47:40,900
PROFESSOR: Right, well, this
is the same sequence let's say.

849
00:47:40,900 --> 00:47:44,420
This is the same, exact
sequence down here.

850
00:47:44,420 --> 00:47:47,990
So if it's the same,
exact sequence,

851
00:47:47,990 --> 00:47:50,930
it will have the
same k minus 1-mers.

852
00:47:50,930 --> 00:47:54,230
And when we build the graph
if a node already exists,

853
00:47:54,230 --> 00:47:55,100
we reuse it.

854
00:47:58,430 --> 00:48:01,280
And thus if we
reuse the nodes that

855
00:48:01,280 --> 00:48:04,850
were created when we
built the graph nodes

856
00:48:04,850 --> 00:48:09,450
and edges for the white read,
then when the purple read comes

857
00:48:09,450 --> 00:48:11,700
along, we're going to
put another edge here

858
00:48:11,700 --> 00:48:14,370
between these two k minus 1-mers
because they are contained here

859
00:48:14,370 --> 00:48:16,594
as well.

860
00:48:16,594 --> 00:48:18,260
So these are identical
sequences to this

861
00:48:18,260 --> 00:48:21,430
because these two reads overlap.

862
00:48:21,430 --> 00:48:23,800
And this part is the same
sequence as that part.

863
00:48:23,800 --> 00:48:26,280
AUDIENCE: Yeah, so why do
you need k minus 1-mers

864
00:48:26,280 --> 00:48:30,250
if you have overlapped k?

865
00:48:30,250 --> 00:48:31,990
PROFESSOR: Because
the way we're finding

866
00:48:31,990 --> 00:48:35,260
these overlaps is
through the graph.

867
00:48:35,260 --> 00:48:39,654
And we're not indexing
things of size k, right?

868
00:48:39,654 --> 00:48:41,320
We're indexing things
of size k minus 1.

869
00:48:47,130 --> 00:48:55,502
In each edge represents
a sequence of length k

870
00:48:55,502 --> 00:48:57,460
because we know this
sequence and this sequence

871
00:48:57,460 --> 00:48:59,320
are overlapped by one base.

872
00:49:02,150 --> 00:49:04,940
So when we find an edge that's
the same between the white

873
00:49:04,940 --> 00:49:07,150
and the purple read,
we know that they're

874
00:49:07,150 --> 00:49:08,370
overlapping by k bases.

875
00:49:11,040 --> 00:49:13,019
Is that making sense to you?

876
00:49:13,019 --> 00:49:13,560
AUDIENCE: No.

877
00:49:13,560 --> 00:49:17,928
PROFESSOR: No, OK, so
let's try it again.

878
00:49:17,928 --> 00:49:19,136
AUDIENCE: You can keep going.

879
00:49:19,136 --> 00:49:20,094
PROFESSOR: No, it's OK.

880
00:49:24,140 --> 00:49:27,580
Let's just start with the purple
read to start for a moment

881
00:49:27,580 --> 00:49:29,570
because I think if
you have a question,

882
00:49:29,570 --> 00:49:31,500
other people may
have a question.

883
00:49:31,500 --> 00:49:38,220
So we have this sequence, which
is this sequence right here,

884
00:49:38,220 --> 00:49:39,050
right?

885
00:49:39,050 --> 00:49:41,600
And then we have
this sequence, which

886
00:49:41,600 --> 00:49:44,070
is the sequence right here.

887
00:49:44,070 --> 00:49:47,170
They overlap by one base.

888
00:49:47,170 --> 00:49:50,100
And so we put an edge between
them like this in the graph.

889
00:49:50,100 --> 00:49:50,600
OK?

890
00:49:53,380 --> 00:49:56,235
AUDIENCE: Don't they overlap
by more than one base?

891
00:49:56,235 --> 00:50:00,805
They can only contain
one base from each k-mer.

892
00:50:00,805 --> 00:50:01,680
PROFESSOR: I'm sorry.

893
00:50:01,680 --> 00:50:02,665
That's what I meant.

894
00:50:02,665 --> 00:50:03,165
Yeah.

895
00:50:06,210 --> 00:50:12,070
And then the same thing
is true down here.

896
00:50:15,730 --> 00:50:20,969
And so we will find this k minus
1-mer and this k minus 1-mer.

897
00:50:20,969 --> 00:50:21,885
And then they overlap.

898
00:50:39,350 --> 00:50:46,920
For genome assembly, we
record the forward and reverse

899
00:50:46,920 --> 00:50:48,980
complement reads in twin nodes.

900
00:50:48,980 --> 00:50:51,920
And we're not going to
show those because it just

901
00:50:51,920 --> 00:50:53,540
complicates our
graphs without really

902
00:50:53,540 --> 00:50:56,690
adding any illustrative power.

903
00:50:56,690 --> 00:50:59,854
And we always choose k to
be odd so that a node can't

904
00:50:59,854 --> 00:51:01,145
be its own reversed complement.

905
00:51:07,210 --> 00:51:19,200
And here is the graph growing
if we think about k equals 5.

906
00:51:19,200 --> 00:51:21,360
So we have reads of length five.

907
00:51:21,360 --> 00:51:25,910
And we are adding
sequences to the graph.

908
00:51:25,910 --> 00:51:28,290
And you note that
the graph is acyclic

909
00:51:28,290 --> 00:51:30,390
until we get to the
repeated sequence.

910
00:51:30,390 --> 00:51:34,000
And we get to the second long
the sequence comes back around

911
00:51:34,000 --> 00:51:37,060
begins a looping back on itself.

912
00:51:37,060 --> 00:51:42,660
And if we consider the last
part of this De Bruijn graph

913
00:51:42,660 --> 00:51:47,170
construction, then we wind
up with the finished graph

914
00:51:47,170 --> 00:51:48,860
on the right-hand side.

915
00:51:48,860 --> 00:51:51,930
And you can see the
multiplicity of the edges

916
00:51:51,930 --> 00:51:53,640
correspond to the
number of times

917
00:51:53,640 --> 00:51:55,665
the long is repeated
in this graph.

918
00:51:58,550 --> 00:52:03,370
So once again, repeats are
causing the circular structure,

919
00:52:03,370 --> 00:52:06,460
which only could be resolved if
we had sufficiently long reads,

920
00:52:06,460 --> 00:52:08,465
which we don't have in
this particular case.

921
00:52:12,830 --> 00:52:15,920
However, if we consider
perfect sequencing

922
00:52:15,920 --> 00:52:18,800
we always have a
path to the graph.

923
00:52:18,800 --> 00:52:25,750
And the reason is
that the leftmost part

924
00:52:25,750 --> 00:52:31,040
of the genome, so to speak,
is going to be semi-balanced.

925
00:52:31,040 --> 00:52:33,370
And the rightmost part is
going to be semi-balanced.

926
00:52:33,370 --> 00:52:38,350
And all the parts in between
are going to be balanced.

927
00:52:38,350 --> 00:52:42,190
So the k minus 1-mer on the
very left end is semi-balanced

928
00:52:42,190 --> 00:52:45,650
and the k minus 1-mer on
the right is semi-balanced.

929
00:52:45,650 --> 00:52:50,220
And all the nodes in
between are balanced.

930
00:52:50,220 --> 00:52:55,470
Now, this does not allow
for errors of course.

931
00:52:55,470 --> 00:53:04,320
And we talk about following
this Eulerian walk

932
00:53:04,320 --> 00:53:06,349
to find the original sequence.

933
00:53:06,349 --> 00:53:08,640
But the question we can ask
ourselves is whether or not

934
00:53:08,640 --> 00:53:10,430
this walk always
really corresponds

935
00:53:10,430 --> 00:53:13,250
to the original genome sequence.

936
00:53:13,250 --> 00:53:16,010
It turns out I can show
you this example, which

937
00:53:16,010 --> 00:53:20,200
is we have this graph
for this sequence.

938
00:53:20,200 --> 00:53:25,040
And there are two different
walks through this graph.

939
00:53:25,040 --> 00:53:31,070
And the two different
walks produced

940
00:53:31,070 --> 00:53:32,630
two different sequences.

941
00:53:32,630 --> 00:53:35,080
And they depend
upon which way you

942
00:53:35,080 --> 00:53:39,580
start walking from the node AB.

943
00:53:39,580 --> 00:53:43,620
So once again, here we
have seen that even when

944
00:53:43,620 --> 00:53:48,480
we have a path to the graph,
the path may not be unique.

945
00:53:48,480 --> 00:53:51,290
It may not be able to
generate the original sequence

946
00:53:51,290 --> 00:53:52,170
that we started with.

947
00:53:55,750 --> 00:54:01,670
So the other problem
we can have when

948
00:54:01,670 --> 00:54:04,960
we are building a graph like
this is that gaps in coverage

949
00:54:04,960 --> 00:54:07,810
can create holes in the graph.

950
00:54:07,810 --> 00:54:11,970
So if we omit
certain of our reads,

951
00:54:11,970 --> 00:54:16,640
we'll come up with a graph
that is broken into two parts.

952
00:54:16,640 --> 00:54:18,660
And this corresponds
to the idea that we're

953
00:54:18,660 --> 00:54:22,660
going to create two different
contigs that are contiguous

954
00:54:22,660 --> 00:54:25,720
sequence but will be unable
to fill in the middle part.

955
00:54:25,720 --> 00:54:26,220
OK?

956
00:54:29,370 --> 00:54:39,140
So we also can have differences
in coverage of a graph

957
00:54:39,140 --> 00:54:44,480
when we have extra reads
at particular locations

958
00:54:44,480 --> 00:54:45,710
in the genome.

959
00:54:45,710 --> 00:54:51,030
And that causes the degrees on
the individual nodes to vary

960
00:54:51,030 --> 00:54:56,560
and causes us to not be able
to rely upon the indegree

961
00:54:56,560 --> 00:55:00,629
and outdegree as an
absolute metric for how

962
00:55:00,629 --> 00:55:02,045
to trace a path
through the graph.

963
00:55:07,470 --> 00:55:12,544
And the other thing is that
if you have differences

964
00:55:12,544 --> 00:55:14,460
between the chromosomes,
which we talked about

965
00:55:14,460 --> 00:55:20,450
last time in our overlap
layout consensus assembler,

966
00:55:20,450 --> 00:55:25,150
it also can cause
graphs to split apart

967
00:55:25,150 --> 00:55:27,120
and to have subgraphs
that correspond

968
00:55:27,120 --> 00:55:31,020
to one allele versus the
other allele, which is present

969
00:55:31,020 --> 00:55:33,760
perhaps in the main graph.

970
00:55:33,760 --> 00:55:40,000
All right, so it's
actually the case

971
00:55:40,000 --> 00:55:45,710
that these graphs are attractive
for a very important reason,

972
00:55:45,710 --> 00:55:48,180
which is there extraordinarily
efficient to build.

973
00:55:48,180 --> 00:55:51,600
That is in order to
build a graph like this,

974
00:55:51,600 --> 00:55:54,420
you need to take each one
of these k minus 1-mers

975
00:55:54,420 --> 00:55:57,517
and actually find the node,
which you can do by hashing

976
00:55:57,517 --> 00:55:59,100
and then put the
edges into the graph.

977
00:55:59,100 --> 00:56:02,870
And so you find that you need
to put in an edge and two nodes

978
00:56:02,870 --> 00:56:05,210
for each k-mer.

979
00:56:05,210 --> 00:56:08,810
And if you have a hash map that
encoded these nodes and edges,

980
00:56:08,810 --> 00:56:11,460
it's constant time work.

981
00:56:11,460 --> 00:56:14,660
So you wind up
with a graph which

982
00:56:14,660 --> 00:56:18,330
costs order of the
number of reads to build.

983
00:56:18,330 --> 00:56:21,620
So it's a linear time
graph construction problem.

984
00:56:21,620 --> 00:56:26,510
Recall that our last
overlap construction,

985
00:56:26,510 --> 00:56:30,130
we thought we could
get down to N log N.

986
00:56:30,130 --> 00:56:34,430
And here is an example of
sub-setting part of the lambda

987
00:56:34,430 --> 00:56:38,840
phage genome using a De
Bruijn graph assembler.

988
00:56:38,840 --> 00:56:40,850
And you can see that
roughly the time required

989
00:56:40,850 --> 00:56:43,085
to assemble parts
of the genome is

990
00:56:43,085 --> 00:56:45,460
linear in the amount of genome
sequence that you give it.

991
00:56:50,460 --> 00:56:59,290
So these assemblers were
favored early on in the days

992
00:56:59,290 --> 00:57:03,040
of short-read assembly in part
because they were so efficient.

993
00:57:03,040 --> 00:57:05,395
And typically in
some of the projects,

994
00:57:05,395 --> 00:57:06,747
you have very high coverage.

995
00:57:06,747 --> 00:57:08,580
And so you wind up with
graphs that actually

996
00:57:08,580 --> 00:57:11,980
have a huge number of
edges between nodes.

997
00:57:11,980 --> 00:57:15,300
And this can be
summarised in terms

998
00:57:15,300 --> 00:57:17,530
of a graph that simply
annotates the edges

999
00:57:17,530 --> 00:57:21,750
with the number of instances.

1000
00:57:21,750 --> 00:57:25,920
And so you have a weighted graph
on the right-hand side, which

1001
00:57:25,920 --> 00:57:31,660
is easier in some sense to
trace because we can now

1002
00:57:31,660 --> 00:57:36,745
begin to eliminate low-coverage
edges as potential anomalies.

1003
00:57:40,740 --> 00:57:43,780
But the essential idea is to
trace these graphs to produce

1004
00:57:43,780 --> 00:57:45,840
the ultimate genome sequence.

1005
00:57:45,840 --> 00:57:48,480
And in order to
do so, we may need

1006
00:57:48,480 --> 00:57:51,550
to do some error correction.

1007
00:57:51,550 --> 00:57:56,570
So we talked earlier about the
idea that if we have an error,

1008
00:57:56,570 --> 00:58:00,060
we're going to actually produce
a portion of the graph that

1009
00:58:00,060 --> 00:58:03,540
hangs off into outer space.

1010
00:58:03,540 --> 00:58:07,910
And we can cut these
dead-end tips of the graph

1011
00:58:07,910 --> 00:58:13,060
off if they are low coverage
because they presumably

1012
00:58:13,060 --> 00:58:15,740
correspond to errors.

1013
00:58:15,740 --> 00:58:19,540
If we get an error in
the middle of a read,

1014
00:58:19,540 --> 00:58:22,860
we can wind up with a so-called
bubble in the graph, which

1015
00:58:22,860 --> 00:58:25,300
once again is low coverage.

1016
00:58:25,300 --> 00:58:31,800
And we can get rid of these
bubbles in a similar fashion.

1017
00:58:31,800 --> 00:58:36,650
And it's also possible to get
chimeric edges of the graph.

1018
00:58:36,650 --> 00:58:39,730
And those can be caused
by errors as well.

1019
00:58:39,730 --> 00:58:43,591
And we can clip those edges.

1020
00:58:43,591 --> 00:58:45,590
So there are different
kinds of error correction

1021
00:58:45,590 --> 00:58:46,548
we can do in the graph.

1022
00:58:46,548 --> 00:58:48,280
These are all quite heuristic.

1023
00:58:48,280 --> 00:58:51,030
Each assembler has its
own set of heuristics

1024
00:58:51,030 --> 00:58:54,370
for how to deal
with graph anomalies

1025
00:58:54,370 --> 00:59:01,890
and how to eliminate edges in
the graph to permit assembly.

1026
00:59:01,890 --> 00:59:05,880
But these are getting
rid of dead-end tips

1027
00:59:05,880 --> 00:59:08,630
and popping bubbles and
getting rid of chimeric edges

1028
00:59:08,630 --> 00:59:11,240
are important things to
consider for any assembler.

1029
00:59:14,800 --> 00:59:18,160
So the limitations
of these graphs

1030
00:59:18,160 --> 00:59:20,770
are the idea that
we're immediately

1031
00:59:20,770 --> 00:59:26,640
splitting these reads into this
k-mer representation, which

1032
00:59:26,640 --> 00:59:29,210
is destroying information.

1033
00:59:29,210 --> 00:59:33,460
And in order to
overcome this, one

1034
00:59:33,460 --> 00:59:36,170
of the things that people have
done in these De Bruijn graph

1035
00:59:36,170 --> 00:59:42,520
assemblers is to take
the original reads

1036
00:59:42,520 --> 00:59:45,380
and to map them back
on to the graph.

1037
00:59:45,380 --> 00:59:46,955
So when you're
attempting to trace

1038
00:59:46,955 --> 00:59:48,580
the path through the
graph, what you do

1039
00:59:48,580 --> 00:59:49,871
is you take the original reads.

1040
00:59:49,871 --> 00:59:52,102
You thread them
through the graph.

1041
00:59:52,102 --> 00:59:53,560
And you know that
the original read

1042
00:59:53,560 --> 00:59:56,390
represents contiguous
genome sequence.

1043
00:59:56,390 --> 00:59:58,470
So it provides you with
a path through the graph

1044
00:59:58,470 --> 00:59:59,520
that you know is good.

1045
01:00:02,430 --> 01:00:05,100
People have been
doing this in part

1046
01:00:05,100 --> 01:00:08,380
because they didn't want to
go to the full overlap graph

1047
01:00:08,380 --> 01:00:10,890
implementation
because of the cost.

1048
01:00:10,890 --> 01:00:15,140
But I think that these overlap
graph implementations now

1049
01:00:15,140 --> 01:00:18,230
are sufficiently sophisticated
that I personally

1050
01:00:18,230 --> 01:00:20,710
would use them instead of a
De Bruijn graph assembler.

1051
01:00:23,610 --> 01:00:29,580
And so the trade off
really centers around

1052
01:00:29,580 --> 01:00:36,070
speed and space versus accuracy.

1053
01:00:36,070 --> 01:00:42,400
So we can look at some
example assemblers

1054
01:00:42,400 --> 01:00:45,210
and look at their performance.

1055
01:00:45,210 --> 01:00:48,444
But before I do that and
we leave De Bruijn graphs,

1056
01:00:48,444 --> 01:00:50,610
are there any other questions
about De Bruijin graph

1057
01:00:50,610 --> 01:00:51,110
assemblers?

1058
01:00:53,570 --> 01:00:54,480
AUDIENCE: I have one.

1059
01:00:54,480 --> 01:00:55,563
PROFESSOR: Yeah, question.

1060
01:00:55,563 --> 01:00:57,340
AUDIENCE: How long
is k typically?

1061
01:00:57,340 --> 01:00:59,090
PROFESSOR: We're going
to talk about that.

1062
01:00:59,090 --> 01:01:04,260
The k typically is somewhere
around 60-- something

1063
01:01:04,260 --> 01:01:08,834
like that-- Somewhere
in that neighborhood.

1064
01:01:08,834 --> 01:01:10,500
It's actually-- it
has to be odd, right?

1065
01:01:10,500 --> 01:01:14,340
So 61, 57-- something like that.

1066
01:01:14,340 --> 01:01:15,050
Good question.

1067
01:01:15,050 --> 01:01:17,585
Any other questions about
De Bruijin graph assemblers?

1068
01:01:25,150 --> 01:01:32,670
So once again returning
to over our architecture,

1069
01:01:32,670 --> 01:01:34,230
we have these reads.

1070
01:01:34,230 --> 01:01:37,310
We need to produce contigs.

1071
01:01:37,310 --> 01:01:40,430
In the case of
overlap graphs, we're

1072
01:01:40,430 --> 01:01:42,130
going to trace the
overlap graphs.

1073
01:01:42,130 --> 01:01:43,825
In the case of De
Bruijn graphs, we're

1074
01:01:43,825 --> 01:01:45,283
going to trace the
De Bruijn graph.

1075
01:01:45,283 --> 01:01:49,980
For scaffolding, we can use
the read pairs to put scaffolds

1076
01:01:49,980 --> 01:01:52,340
back together again.

1077
01:01:52,340 --> 01:01:57,540
And here is some comparison
of the performance

1078
01:01:57,540 --> 01:01:59,290
of these various assemblers.

1079
01:01:59,290 --> 01:02:05,640
So the first assembler--
SGA-- is an overlap layout

1080
01:02:05,640 --> 01:02:08,280
consensus-style assembler.

1081
01:02:08,280 --> 01:02:11,160
Velvet/Abyss and SOAPdenovo
are all De Bruijn,

1082
01:02:11,160 --> 01:02:12,274
graph-based assemblers.

1083
01:02:12,274 --> 01:02:13,940
So these are all
contemporary assemblers

1084
01:02:13,940 --> 01:02:17,190
that people use for
assembling genomes.

1085
01:02:17,190 --> 01:02:18,870
An important metric
for assemblers

1086
01:02:18,870 --> 01:02:22,560
is something called N50,
which is the size of a contig

1087
01:02:22,560 --> 01:02:28,740
or scaffold where at that length
or larger 50% of the bases

1088
01:02:28,740 --> 01:02:31,600
are present in scaffolds
of that length.

1089
01:02:31,600 --> 01:02:36,340
So, for example, for SGA, they
say that scaffold N50 size

1090
01:02:36,340 --> 01:02:41,390
is 26.3 kilobases, which
means that in scaffolds

1091
01:02:41,390 --> 01:02:44,790
of length 26.3
kilobases or larger,

1092
01:02:44,790 --> 01:02:47,760
half of the bases
of the assembly lie.

1093
01:02:47,760 --> 01:02:52,700
So the larger the N50 is,
the larger the scaffolds

1094
01:02:52,700 --> 01:02:54,670
are that cover things.

1095
01:02:54,670 --> 01:02:58,090
And you want larger and
larger scaffolds or contigs

1096
01:02:58,090 --> 01:03:01,290
so that you have fewer
gaps in your assembly.

1097
01:03:01,290 --> 01:03:07,350
So the N50 number is a
principle comparison metric

1098
01:03:07,350 --> 01:03:10,510
when one is thinking
about assemblers.

1099
01:03:10,510 --> 01:03:16,540
So in this particular case,
for SGA the overlap metric

1100
01:03:16,540 --> 01:03:21,480
was that the reads had to
overlap by at least 75 bases

1101
01:03:21,480 --> 01:03:23,300
or more.

1102
01:03:23,300 --> 01:03:25,100
And these were
100-base pair reads.

1103
01:03:25,100 --> 01:03:26,690
You can see the
details on the read

1104
01:03:26,690 --> 01:03:29,260
data on the bottom line there.

1105
01:03:29,260 --> 01:03:32,790
So as long as the reads
overlap by 75 bases,

1106
01:03:32,790 --> 01:03:35,840
they were put
together in the graph.

1107
01:03:35,840 --> 01:03:37,650
And the De Bruijn
graph assemblers

1108
01:03:37,650 --> 01:03:43,590
each had their own
optimum number for k.

1109
01:03:43,590 --> 01:03:46,050
And the way that you tune
these parameters is you

1110
01:03:46,050 --> 01:03:50,040
run the assembler on
a range of k values.

1111
01:03:50,040 --> 01:03:56,290
And you see which k value
produced the assembly

1112
01:03:56,290 --> 01:03:59,280
with the highest N50.

1113
01:03:59,280 --> 01:04:02,060
And you pick that k.

1114
01:04:02,060 --> 01:04:04,120
Can anybody think
of a reason why

1115
01:04:04,120 --> 01:04:06,290
it is that although
these are all

1116
01:04:06,290 --> 01:04:09,360
roughly in the same ballpark,
different assemblers might have

1117
01:04:09,360 --> 01:04:15,245
different k values given that
the underlying technology is

1118
01:04:15,245 --> 01:04:15,828
quite similar?

1119
01:04:22,400 --> 01:04:24,630
Any guesses about
what is going on here?

1120
01:04:32,990 --> 01:04:35,480
Well, we know that the
differences in the assemblers

1121
01:04:35,480 --> 01:04:38,560
really are rooted in the
way that they are processing

1122
01:04:38,560 --> 01:04:41,500
the graphs and the way that
they are simplifying them.

1123
01:04:41,500 --> 01:04:43,970
And therefore,
one has to imagine

1124
01:04:43,970 --> 01:04:47,590
that the differences lie in the
post-processing of the graph

1125
01:04:47,590 --> 01:04:52,780
once it's built and that
certain assemblers like larger k

1126
01:04:52,780 --> 01:04:53,330
values.

1127
01:04:53,330 --> 01:04:57,590
Whereas other ones can
tolerate smaller k values.

1128
01:04:57,590 --> 01:05:00,750
And you can see if we look
at the running statistics

1129
01:05:00,750 --> 01:05:06,820
for these, that the
performance of SGA

1130
01:05:06,820 --> 01:05:09,120
if you look at the
reference bases covered

1131
01:05:09,120 --> 01:05:11,790
by contigs greater
than one kilobase

1132
01:05:11,790 --> 01:05:14,470
is roughly comparable to
all the other assemblers.

1133
01:05:14,470 --> 01:05:18,270
But its mismatch
performance is much better.

1134
01:05:18,270 --> 01:05:22,550
That is the other assemblers
are producing-- well,

1135
01:05:22,550 --> 01:05:24,920
I take it back except
for SOAPdenovo.

1136
01:05:24,920 --> 01:05:27,830
But it does quite a
good job at correcting

1137
01:05:27,830 --> 01:05:29,830
reads in coming up with
the correct sequence.

1138
01:05:32,730 --> 01:05:36,030
The last lines however tell the
story about running time, which

1139
01:05:36,030 --> 01:05:38,940
is that the overlap
consensus assembler is taking

1140
01:05:38,940 --> 01:05:43,340
41 hours of CPU time for
C. elegans genome assembly.

1141
01:05:43,340 --> 01:05:47,760
Whereas the other assemblers,
the De Bruijn assembler

1142
01:05:47,760 --> 01:05:48,840
are running much faster.

1143
01:05:52,270 --> 01:05:58,790
So the thing that I
wanted to emphasize today

1144
01:05:58,790 --> 01:06:03,750
was that once you have
the final graph whether it

1145
01:06:03,750 --> 01:06:07,130
be an overlap graph
or a De Bruijn graph,

1146
01:06:07,130 --> 01:06:12,110
which represents possible
ways of putting back together

1147
01:06:12,110 --> 01:06:14,910
again the jigsaw
puzzle, it still

1148
01:06:14,910 --> 01:06:18,530
is an art to be able to
build an assembler that

1149
01:06:18,530 --> 01:06:21,530
uses appropriate heuristics
to trace the graph

1150
01:06:21,530 --> 01:06:24,850
to come up with a
genome sequence.

1151
01:06:24,850 --> 01:06:27,560
And I think another
lesson is that repeats

1152
01:06:27,560 --> 01:06:29,740
are very problematic.

1153
01:06:29,740 --> 01:06:34,300
With short reads, we really
cannot resolve repeats exactly.

1154
01:06:34,300 --> 01:06:39,020
As a consequence, when we think
about any reference genome

1155
01:06:39,020 --> 01:06:42,840
that we're dealing
with, if we consider

1156
01:06:42,840 --> 01:06:45,844
the size of the reads that were
used to assemble that genome,

1157
01:06:45,844 --> 01:06:47,260
then we need to
be mindful of what

1158
01:06:47,260 --> 01:06:49,070
that tells us about
whether or not

1159
01:06:49,070 --> 01:06:51,540
the repeat structure that
we're observing in the genome

1160
01:06:51,540 --> 01:06:53,970
is really an accurate
rendition of what's

1161
01:06:53,970 --> 01:06:57,170
going on in the genome itself.

1162
01:06:57,170 --> 01:06:59,840
And finally, I think
that we've talked today

1163
01:06:59,840 --> 01:07:03,470
about the problem of
assembling genomes

1164
01:07:03,470 --> 01:07:07,070
from a set of reads
that represent

1165
01:07:07,070 --> 01:07:13,150
a uniform, single individual
albeit with possibilities

1166
01:07:13,150 --> 01:07:15,250
of differences of
alleles between mom

1167
01:07:15,250 --> 01:07:18,810
and dad in a diploid organism.

1168
01:07:18,810 --> 01:07:21,200
However, environmental
sequencing

1169
01:07:21,200 --> 01:07:25,410
where one takes up sea
water or other samples

1170
01:07:25,410 --> 01:07:27,330
and sequences all
the organisms in it

1171
01:07:27,330 --> 01:07:31,650
and then attempts to assemble
those organisms de novo

1172
01:07:31,650 --> 01:07:33,260
admits the
possibility that there

1173
01:07:33,260 --> 01:07:37,290
are many different genomes
that you're considering.

1174
01:07:37,290 --> 01:07:39,340
And that, of course,
creates a whole new set

1175
01:07:39,340 --> 01:07:40,920
of research problems,
which I think

1176
01:07:40,920 --> 01:07:45,130
are unsolved in part
because of the read links

1177
01:07:45,130 --> 01:07:47,800
that we're currently
dealing with.

1178
01:07:47,800 --> 01:07:50,460
Are there any final
questions about assembly?

1179
01:07:54,360 --> 01:07:54,940
OK, great.

1180
01:07:54,940 --> 01:07:57,310
Well, we will see
you then on Thursday

1181
01:07:57,310 --> 01:08:01,040
where we will talk about
ChIP-seq and IDR analysis.

1182
01:08:01,040 --> 01:08:02,550
Until then, have
a great Wednesday.

1183
01:08:02,550 --> 01:08:04,470
Thank you very much.