1
00:00:01,000 --> 00:00:04,000
Today is my last class with you.
Awe, I'm sorry, too. You guys are

2
00:00:04,000 --> 00:00:08,000
a lot of fun. This has actually
been the most interactive 7.

3
00:00:08,000 --> 00:00:12,000
1 I've ever had. Usually there are
a couple of people who perk up and

4
00:00:12,000 --> 00:00:16,000
say things, but you guys are great
because all sorts of people are

5
00:00:16,000 --> 00:00:20,000
willing to contribute. So,
I've had a wonderful time and

6
00:00:20,000 --> 00:00:24,000
it certainly seems like
you guys have learned a lot.

7
00:00:24,000 --> 00:00:28,000
What I'd like to do for my last
lecture is pick up again a little

8
00:00:28,000 --> 00:00:32,000
bit like I did with genomics and
try to give you a sense of where

9
00:00:32,000 --> 00:00:36,000
things are going. I always
like doing this because I

10
00:00:36,000 --> 00:00:40,000
get to talk about things that
are in none of the textbooks that,

11
00:00:40,000 --> 00:00:44,000
well, I mean, it's just stuff that
many people working in the field

12
00:00:44,000 --> 00:00:48,000
don't necessarily know. And
that's what's so much fun about

13
00:00:48,000 --> 00:00:52,000
teaching introductory biology is
because it only takes a semester for

14
00:00:52,000 --> 00:00:56,000
you guys to get up to the point of
at least being able to understand

15
00:00:56,000 --> 00:01:01,000
what's getting done
on the cutting-edge.

16
00:01:01,000 --> 00:01:05,000
Even if you might not yet be
able to go off and practice it,

17
00:01:05,000 --> 00:01:09,000
you might need a little
more experience for that,

18
00:01:09,000 --> 00:01:13,000
but you'd be surprised,
it's not that much more.

19
00:01:13,000 --> 00:01:17,000
Take maybe Project Lab and you'll
be able to start doing it already.

20
00:01:17,000 --> 00:01:21,000
It's really wonderful that it's
possible to grasp what's going on.

21
00:01:21,000 --> 00:01:25,000
And, in many ways, you guys may
have an advantage in grasping what's

22
00:01:25,000 --> 00:01:29,000
going on because, as
I've already hinted,

23
00:01:29,000 --> 00:01:33,000
biology's undergoing this remarkable
transformation from being a purely

24
00:01:33,000 --> 00:01:37,000
laboratory-based science where each
individual works on his or her own

25
00:01:37,000 --> 00:01:41,000
project to being an
information-based science that

26
00:01:41,000 --> 00:01:45,000
involves an integration of vast
amounts of data across the whole

27
00:01:45,000 --> 00:01:50,000
world and trying to learn things
from this tremendous dataset.

28
00:01:50,000 --> 00:01:52,000
And, in that sense, I think
the new students coming into

29
00:01:52,000 --> 00:01:55,000
the field have a distinct advantage
over those who have been in it.

30
00:01:55,000 --> 00:01:58,000
And certainly the students who
know mathematical and physical and

31
00:01:58,000 --> 00:02:01,000
chemical and other sorts of things,
and aren't scared to write computer

32
00:02:01,000 --> 00:02:04,000
code when they need to write
computer code have a really

33
00:02:04,000 --> 00:02:07,000
great advantage. So,
anyway, all that by way of

34
00:02:07,000 --> 00:02:11,000
introduction. I want to talk about
two subjects today of great interest

35
00:02:11,000 --> 00:02:14,000
to me. One is DNA variation
and one is RNA variation.

36
00:02:14,000 --> 00:02:18,000
The variation of DNA sequence
between individuals within a

37
00:02:18,000 --> 00:02:21,000
population, and in particular our
population, and the other is RNA

38
00:02:21,000 --> 00:02:25,000
variation, the variation in RNA
expression between different cell

39
00:02:25,000 --> 00:02:28,000
types, different tissues.
And the work I'm going to talk

40
00:02:28,000 --> 00:02:32,000
about today is work that I,
and my colleagues, have all been

41
00:02:32,000 --> 00:02:36,000
involved in. And it's
stuff I know and love.

42
00:02:36,000 --> 00:02:40,000
So, feel free to ask questions
about it. I may know the answers,

43
00:02:40,000 --> 00:02:44,000
but what's reasonably fun about
these lectures is if I don't know

44
00:02:44,000 --> 00:02:48,000
the answers it's probably the
case that the answers aren't known.

45
00:02:48,000 --> 00:02:52,000
So, that's good fun because
it's stuff I really do know well,

46
00:02:52,000 --> 00:02:56,000
and I love. So, anyway, here's some
DNA sequence. It's pretty boring.

47
00:02:56,000 --> 00:03:00,000
This is a chunk of sequence
from, let's say, the human genome.

48
00:03:00,000 --> 00:03:04,000
How much does this differ
between any two individuals?

49
00:03:04,000 --> 00:03:09,000
If I were to sequence any two
chromosomes, any two copies of the

50
00:03:09,000 --> 00:03:14,000
chromosome from an individual in
this class or two individuals on

51
00:03:14,000 --> 00:03:19,000
this planet, how much would they
differ? The answer is that much.

52
00:03:19,000 --> 00:03:25,000
That's the average amount of
difference between any two people on

53
00:03:25,000 --> 00:03:30,000
this planet. Not a lot. If you
counted up, it is on average

54
00:03:30,000 --> 00:03:35,000
one nucleotide difference out of 1,
00 nucleotides on average, somewhat

55
00:03:35,000 --> 00:03:41,000
less than one part in 1, 00
or better than 99.9% identity

56
00:03:41,000 --> 00:03:46,000
between any two individuals.
Now, that is a very small amount,

57
00:03:46,000 --> 00:03:51,000
not just in absolute terms,
99.9% identity is a lot,

58
00:03:51,000 --> 00:03:57,000
but in comparative terms with other
species. If I take two chimpanzees

59
00:03:57,000 --> 00:04:02,000
in Africa, on average they will
differ by about twice as much as any

60
00:04:02,000 --> 00:04:07,000
two random humans. And if
I take two orangutans in

61
00:04:07,000 --> 00:04:12,000
Southeast Asia, they will
on average differ by about

62
00:04:12,000 --> 00:04:17,000
eight times as much as any
two humans on this planet.

63
00:04:17,000 --> 00:04:21,000
You guys think the
orangutans all look the same.

64
00:04:21,000 --> 00:04:26,000
They think you all look the same,
and they're right. So, why is this?

65
00:04:26,000 --> 00:04:31,000
Why are humans
amongst mammalian

66
00:04:31,000 --> 00:04:36,000
species relatively limited
in the amount of variation?

67
00:04:36,000 --> 00:04:40,000
Well, it's a direct result
of our population history.

68
00:04:40,000 --> 00:04:45,000
It turns out that the amount of
variation that can be sustained in a

69
00:04:45,000 --> 00:04:50,000
population depends on two things.
At equilibrium, if population has

70
00:04:50,000 --> 00:04:55,000
constant size N for a very long
time and a certain mutation rate,

71
00:04:55,000 --> 00:04:59,000
Mu, you can just write a
piece of arithmetic that says,

72
00:04:59,000 --> 00:05:04,000
well, mutations are always
arising due to new mutations in the

73
00:05:04,000 --> 00:05:09,000
population and mutations are
being lost by genetic drift,

74
00:05:09,000 --> 00:05:14,000
just by random sampling from
generation to generation.

75
00:05:14,000 --> 00:05:17,000
And those two processes, the
creation of new mutations and

76
00:05:17,000 --> 00:05:21,000
the loss of mutations just due to
random sampling in each generation,

77
00:05:21,000 --> 00:05:25,000
sets up an equilibrium, and the
equilibrium defines an equation

78
00:05:25,000 --> 00:05:29,000
there, Pi equals one over one
plus four and Mu reciprocal which

79
00:05:29,000 --> 00:05:33,000
equation you have no need to
memorize whatsoever and possibly

80
00:05:33,000 --> 00:05:36,000
even no need to write down. The
important point is the concept,

81
00:05:36,000 --> 00:05:40,000
that if you know the number of
organisms in the population and you

82
00:05:40,000 --> 00:05:43,000
know the mutation rate, those
set up the bounds of mutation

83
00:05:43,000 --> 00:05:47,000
and drift, and you can
write down how polymorphic,

84
00:05:47,000 --> 00:05:51,000
how heterozygous random individuals
should be at equilibrium.

85
00:05:51,000 --> 00:05:54,000
That is if the population has been
at size N for a very long time.

86
00:05:54,000 --> 00:05:58,000
Well, the expected amount
of heterozygosity for the

87
00:05:58,000 --> 00:06:02,000
human population -- Sorry.
For a population of size 10,

88
00:06:02,000 --> 00:06:06,000
00 would be about one nucleotide in
1300. We have exactly the amount of

89
00:06:06,000 --> 00:06:11,000
heterozygosity you would expect
for a population of about 10,

90
00:06:11,000 --> 00:06:15,000
00 individuals. Yeah, but wait,
we're not a population of 10,000

91
00:06:15,000 --> 00:06:20,000
individuals. Why do we have the
heterozygosity you would expect from

92
00:06:20,000 --> 00:06:25,000
a population of 10, 00
individuals? We're six billion.

93
00:06:25,000 --> 00:06:31,000
It's a reflection
of our history.

94
00:06:31,000 --> 00:06:35,000
Because remember I said that
was the statement about what the

95
00:06:35,000 --> 00:06:38,000
population heterozygosity
should be at equilibrium?

96
00:06:38,000 --> 00:06:42,000
We haven't been six billion
people except very recently.

97
00:06:42,000 --> 00:06:45,000
The human population has
undergone an exponential expansion.

98
00:06:45,000 --> 00:06:49,000
It used to be a relatively small
size, and then it very recently

99
00:06:49,000 --> 00:06:52,000
underwent this huge exponential
expansion. If you actually write

100
00:06:52,000 --> 00:06:56,000
down the equations, the
amount of variation in our

101
00:06:56,000 --> 00:07:00,000
population was determined by that
constant size for a very long time.

102
00:07:00,000 --> 00:07:03,000
And then a rapid exponential
expansion that's basically taken

103
00:07:03,000 --> 00:07:07,000
place in a mere 3, 00
generations, it's much too rapid

104
00:07:07,000 --> 00:07:11,000
to have any affect on the real
variation in our population.

105
00:07:11,000 --> 00:07:15,000
What do I mean by that? What's the
mutation rate per nucleotide in the

106
00:07:15,000 --> 00:07:18,000
human genome? It's on the order of
two times ten to the minus eighth

107
00:07:18,000 --> 00:07:22,000
per generation. In a
mere 3,000 generations,

108
00:07:22,000 --> 00:07:26,000
a tiny mutation rate like two times
ten to the minus eighth is not going

109
00:07:26,000 --> 00:07:30,000
to be able to build
up much more variation.

110
00:07:30,000 --> 00:07:32,000
So you might as well ignore
the last 100,000 years or so.

111
00:07:32,000 --> 00:07:34,000
They're irrelevant to how
much variation we have.

112
00:07:34,000 --> 00:07:36,000
The variation we have was set
by our ancestral population size.

113
00:07:36,000 --> 00:07:38,000
Now, don't get me wrong.
Eventually it will equilibrate.

114
00:07:38,000 --> 00:07:40,000
A couple million years from now we
will have a much higher variation in

115
00:07:40,000 --> 00:07:42,000
the human population as a function
of our size, but the population

116
00:07:42,000 --> 00:07:44,000
variation we have today is set by
the fact that humans derive from a

117
00:07:44,000 --> 00:07:47,000
founding population of about
10, 00 individuals or so.

118
00:07:47,000 --> 00:07:52,000
And that means that the variation
that you see in the human population

119
00:07:52,000 --> 00:07:57,000
is mostly ancestral variations,
the variation that we all walked

120
00:07:57,000 --> 00:08:03,000
around with in Africa.
And, in fact, that makes a

121
00:08:03,000 --> 00:08:08,000
prediction. That would say that if
most of the variation in the human

122
00:08:08,000 --> 00:08:13,000
population is from the ancestral
African founding population then if

123
00:08:13,000 --> 00:08:19,000
I go to any two villages around this
world, in Japan or in Sweden or in

124
00:08:19,000 --> 00:08:24,000
Nigeria, the variance that I
see will largely be identical.

125
00:08:24,000 --> 00:08:30,000
And that prediction
has been well satisfied.

126
00:08:30,000 --> 00:08:34,000
Because when you go and look and you
collect variation in Japan or Sweden

127
00:08:34,000 --> 00:08:38,000
or Africa and you compare it,
90% of the variance are common

128
00:08:38,000 --> 00:08:42,000
across the entire world. Most
variation is common ancestral

129
00:08:42,000 --> 00:08:46,000
variation around the world, and
only a minority of the variance

130
00:08:46,000 --> 00:08:50,000
are new local mutations restricted
to individual populations.

131
00:08:50,000 --> 00:08:54,000
This is so contrary to what people
think because there's a natural

132
00:08:54,000 --> 00:08:58,000
tendency to kind of xenophobia,
to imagine that world populations

133
00:08:58,000 --> 00:09:02,000
are very different in
their genetic background.

134
00:09:02,000 --> 00:09:05,000
But, in point of fact,
they're extremely similar.

135
00:09:05,000 --> 00:09:09,000
So, anyway, there's a
limited amount of variation.

136
00:09:09,000 --> 00:09:13,000
That's why we have such little
variation in the human population.

137
00:09:13,000 --> 00:09:17,000
Now, that variation, humans have
a low rate of genetic variation.

138
00:09:17,000 --> 00:09:20,000
Most of the variance that are out
there are due to common genetic

139
00:09:20,000 --> 00:09:24,000
variance, not rare variance. If
I take your genome and I find a

140
00:09:24,000 --> 00:09:28,000
site of genetic variation at the
point of heterozygosity in your

141
00:09:28,000 --> 00:09:32,000
genome, what's the probability that
somebody else in this class also is

142
00:09:32,000 --> 00:09:36,000
heterozygous for that spot? It
turns out that the odds are about

143
00:09:36,000 --> 00:09:40,000
95% that someone else in this
class will also share that variance.

144
00:09:40,000 --> 00:09:44,000
So that the variance are not
mostly rare, they're mostly common.

145
00:09:44,000 --> 00:09:48,000
And it turns out that some
of this common variation,

146
00:09:48,000 --> 00:09:52,000
that is most of this variation is
likely to be important in the risk

147
00:09:52,000 --> 00:09:56,000
of human genetic diseases. So
human geneticists have gotten

148
00:09:56,000 --> 00:10:00,000
very excited about
the following paradigm.

149
00:10:00,000 --> 00:10:03,000
If there's only a limited amount
of genetic variation in the human

150
00:10:03,000 --> 00:10:06,000
population, actually,
if you do the arithmetic,

151
00:10:06,000 --> 00:10:09,000
there are only about ten million
sites of common variation in the

152
00:10:09,000 --> 00:10:12,000
human population, where
common might be defined as

153
00:10:12,000 --> 00:10:15,000
more than about 1% in the population.
There are only ten million sites.

154
00:10:15,000 --> 00:10:18,000
Folks are saying, well,
why not enumerate them all?

155
00:10:18,000 --> 00:10:22,000
Let's just know them all, and
then let's test each one for its

156
00:10:22,000 --> 00:10:25,000
risk of, say, confirming
susceptibility of diabetes or heart

157
00:10:25,000 --> 00:10:28,000
disease or whatever? After
all, ten million is not as

158
00:10:28,000 --> 00:10:32,000
big a number as it used to be.
We now have the whole sequence of

159
00:10:32,000 --> 00:10:36,000
the human genome. Why not
layer on the sequence of

160
00:10:36,000 --> 00:10:40,000
the human genome all common
human genetic polymorphism?

161
00:10:40,000 --> 00:10:44,000
Now, that's a fairly outrageous
idea but could be a very useful one.

162
00:10:44,000 --> 00:10:48,000
Some of these variance
are important, by the way.

163
00:10:48,000 --> 00:10:52,000
We know that there are two
nucleotides that vary in the gene

164
00:10:52,000 --> 00:10:56,000
apolipoprotein E on chromosome
number 19. Apolipoprotein E is also

165
00:10:56,000 --> 00:11:00,000
an apolipoprotein like we
talked about before with familiar

166
00:11:00,000 --> 00:11:04,000
hypercholesterolemia. But,
in fact, it turns out that

167
00:11:04,000 --> 00:11:08,000
apolipoprotein E is expressed
in the brain. And it turns out,

168
00:11:08,000 --> 00:11:13,000
amongst other tissues, that
it comes in three variances,

169
00:11:13,000 --> 00:11:18,000
the spelling T-T, T-C and C-C
at those two particular spots.

170
00:11:18,000 --> 00:11:22,000
And if you happen to be
homozygous for the E4 variant,

171
00:11:22,000 --> 00:11:27,000
homozygous for the E4 variant, you
have about a 60% to 70% lifetime

172
00:11:27,000 --> 00:11:32,000
risk of Alzheimer's disease.
In this class 13 of you are

173
00:11:32,000 --> 00:11:37,000
homozygous for E4 and have a
high lifetime risk of Alzheimer's.

174
00:11:37,000 --> 00:11:42,000
And it would be fairly trivial to
go across the street to anybody's

175
00:11:42,000 --> 00:11:47,000
lab and test that. Now, I
don't particular recommend

176
00:11:47,000 --> 00:11:52,000
it, and I haven't tested myself for
this variant because there happens

177
00:11:52,000 --> 00:11:57,000
to be no particular therapy
available today to delay the onset

178
00:11:57,000 --> 00:12:01,000
of Alzheimer's
disease. And, therefore,

179
00:12:01,000 --> 00:12:05,000
I don't recommend finding out
about that. But a number of

180
00:12:05,000 --> 00:12:08,000
pharmaceutical companies,
knowing that this is a very

181
00:12:08,000 --> 00:12:11,000
important gene in the pathogenesis
of Alzheimer's disease,

182
00:12:11,000 --> 00:12:15,000
are working on drugs to try to
delay the pathogenesis using this

183
00:12:15,000 --> 00:12:18,000
information. And it may be the
case that five or ten years from now

184
00:12:18,000 --> 00:12:21,000
people will begin to offer drugs
that will delay the onset of

185
00:12:21,000 --> 00:12:25,000
Alzheimer's disease by delaying the
interaction of apolipoprotein E with

186
00:12:25,000 --> 00:12:29,000
a target protein called towe, etc.
So, this is an example of where a

187
00:12:29,000 --> 00:12:33,000
common variant in the population
points us to the basis of a common

188
00:12:33,000 --> 00:12:37,000
disease and has important
therapeutic implications.

189
00:12:37,000 --> 00:12:41,000
There are some other ones,
for example. 5% of you carry a

190
00:12:41,000 --> 00:12:45,000
particular variant in your factor 5
gene which is the clotting cascade.

191
00:12:45,000 --> 00:12:49,000
It's called the leiden variant.
Those 5% of you are going to account

192
00:12:49,000 --> 00:12:53,000
for 50% of the admissions to
emergency rooms for deep venous

193
00:12:53,000 --> 00:12:57,000
clots, for example. The much
higher risk of deep venous

194
00:12:57,000 --> 00:13:02,000
clots. And,
in particular,

195
00:13:02,000 --> 00:13:06,000
there are significant issues if
you have that variant and you are a

196
00:13:06,000 --> 00:13:11,000
woman with taking birth control
pills. Some of you were at higher

197
00:13:11,000 --> 00:13:16,000
risk for diabetes, type
2 adult onset diabetes.

198
00:13:16,000 --> 00:13:20,000
There's a particular variant in the
population that increased your risk

199
00:13:20,000 --> 00:13:25,000
for type 2 diabetes by about
30%. 85% of you have the high-risk

200
00:13:25,000 --> 00:13:30,000
factor, so you might
as well figure you do.

201
00:13:30,000 --> 00:13:35,000
15% of you have a lower risk, et
cetera. And one I'm particularly

202
00:13:35,000 --> 00:13:40,000
interested in here, it
turns out that HIV virus gets

203
00:13:40,000 --> 00:13:46,000
into cells with a co-receptor
encoded by a gene called CCR5.

204
00:13:46,000 --> 00:13:51,000
Well, it turns out that if we go
across the European population,

205
00:13:51,000 --> 00:13:57,000
10% of all chromosomes of
European ancestry have a deletion

206
00:13:57,000 --> 00:14:02,000
within the CCR5 gene. If 10%
of all chromosomes have that

207
00:14:02,000 --> 00:14:06,000
deletion then 10% times 10%, 1%
of all individuals are homozygous

208
00:14:06,000 --> 00:14:10,000
for that deletion. Those
individuals are essentially

209
00:14:10,000 --> 00:14:15,000
immune to infection from HIV.
They are not susceptible. It's not

210
00:14:15,000 --> 00:14:19,000
through immunity, it's
through lack of a receptor.

211
00:14:19,000 --> 00:14:23,000
Yes? You certainly can. It's not
hard. It's a specific known variant.

212
00:14:23,000 --> 00:14:28,000
You could test
for it. Absolutely.

213
00:14:28,000 --> 00:14:31,000
Now, of course, that only
helps the 1% of people who

214
00:14:31,000 --> 00:14:34,000
have that variant. But what
it did do was point to the

215
00:14:34,000 --> 00:14:37,000
pharmaceutical industry that the
interaction between the virus and

216
00:14:37,000 --> 00:14:41,000
that variant is essential. And
now companies are developing

217
00:14:41,000 --> 00:14:44,000
drugs to block the interaction
with that particular protein.

218
00:14:44,000 --> 00:14:48,000
And that tells you that it's
an important protein. Yes?

219
00:14:48,000 --> 00:14:56,000
Over the
whole world?

220
00:14:56,000 --> 00:15:00,000
I just specified European
population for that one.

221
00:15:00,000 --> 00:15:03,000
That one, interestingly, is
not found at as high a frequency

222
00:15:03,000 --> 00:15:06,000
outside of Europe,
and no one knows why,

223
00:15:06,000 --> 00:15:09,000
whether that might have been due
to an ancient selective event or a

224
00:15:09,000 --> 00:15:13,000
genetic drift. By contrast,
the apolipoprotein E

225
00:15:13,000 --> 00:15:16,000
variant, at that frequency of about
3% of people being homozygous and

226
00:15:16,000 --> 00:15:19,000
being at risk for Alzheimer's,
is about the same frequency

227
00:15:19,000 --> 00:15:23,000
everywhere in the world.
So, there's a little bit of

228
00:15:23,000 --> 00:15:26,000
population variation in frequency.
Now, the HIV variant is found

229
00:15:26,000 --> 00:15:30,000
elsewhere but at considerably
lower frequencies there.

230
00:15:30,000 --> 00:15:33,000
And that's an interesting question
as to what causes that variation.

231
00:15:33,000 --> 00:15:36,000
So the notion would be, I've given
you a couple of interesting examples,

232
00:15:36,000 --> 00:15:40,000
but, look, there's only ten million
variants. Just write them all down.

233
00:15:40,000 --> 00:15:43,000
Make one big Excel spreadsheet with
ten million variants along the top

234
00:15:43,000 --> 00:15:47,000
and all the diseases along the rows,
and let's just fill in the matrix

235
00:15:47,000 --> 00:15:50,000
and then we'll really, you
know, this is the way people

236
00:15:50,000 --> 00:15:54,000
think in a post-genomic era.
Now, could you do something like

237
00:15:54,000 --> 00:15:57,000
that? You would have to enumerate
all of the single nucleotide

238
00:15:57,000 --> 00:16:01,000
polymorphisms, or
SNPs we call them,

239
00:16:01,000 --> 00:16:05,000
single nucleotide polymorphisms.
Now, to give you an idea of the

240
00:16:05,000 --> 00:16:09,000
magnitude of this problem, as
recently as 1998, the number of

241
00:16:09,000 --> 00:16:13,000
SNPs that were known in the
human genome was a couple hundred.

242
00:16:13,000 --> 00:16:17,000
But then a project has taken off.
In 1998 an initial SNP map of the

243
00:16:17,000 --> 00:16:21,000
human genome was built here
at MIT that had about 4,

244
00:16:21,000 --> 00:16:25,000
00 of these variants. Then
within the next year or so an

245
00:16:25,000 --> 00:16:29,000
international consortium was
organized here and elsewhere to

246
00:16:29,000 --> 00:16:34,000
begin to collect more of
these genetic variants.

247
00:16:34,000 --> 00:16:38,000
The goal was going to be to find
300, 00 of them within a period of two

248
00:16:38,000 --> 00:16:42,000
years. In fact, that goal
was blown away and within

249
00:16:42,000 --> 00:16:46,000
three years two million of the SNPs
in the human population were found.

250
00:16:46,000 --> 00:16:51,000
And as of today, if you go on the
Web, you'll find the database with

251
00:16:51,000 --> 00:16:55,000
about 7.8 million of the roughly ten
million SNPs in the human population

252
00:16:55,000 --> 00:17:00,000
already known. Now, that
isn't all ten million.

253
00:17:00,000 --> 00:17:03,000
And it takes a while to
collect the last ones, you know,

254
00:17:03,000 --> 00:17:07,000
collecting the last ones are
hard, but we're already the hump of

255
00:17:07,000 --> 00:17:10,000
knowing the majority of common
variation in the human population.

256
00:17:10,000 --> 00:17:14,000
Not just a sequence of the genome,
but a database that already contains

257
00:17:14,000 --> 00:17:17,000
more than half of all common
variation in the population.

258
00:17:17,000 --> 00:17:21,000
So, we could start building
that Excel spreadsheet.

259
00:17:21,000 --> 00:17:24,000
Now, it turns out that it's even a
little bit better than that because

260
00:17:24,000 --> 00:17:28,000
if we look at many
chromosomes in the population,

261
00:17:28,000 --> 00:17:31,000
here are chromosomes in the
population, it turns out that the

262
00:17:31,000 --> 00:17:35,000
common variance on each of those
chromosomes tend to be correlated

263
00:17:35,000 --> 00:17:38,000
with each other. If I
know your genotype at one

264
00:17:38,000 --> 00:17:41,000
variant, like over at this locus,
I know your genotype at the next

265
00:17:41,000 --> 00:17:45,000
locus with reasonably high
probability. There's a lot of local

266
00:17:45,000 --> 00:17:48,000
correlation. So, instead
of looking like a scattered

267
00:17:48,000 --> 00:17:51,000
picture like that,
it's more like this.

268
00:17:51,000 --> 00:17:55,000
If I know that you're red,
red, red you're probably red,

269
00:17:55,000 --> 00:17:58,000
red, red over here. In other words,
these variations occur in blocks

270
00:17:58,000 --> 00:18:01,000
that we called haplotypes.
Here's real data.

271
00:18:01,000 --> 00:18:04,000
Across 111 kilobases of DNA
there's a bunch of variants,

272
00:18:04,000 --> 00:18:08,000
but it turns out that the
variants come in two basic flavors.

273
00:18:08,000 --> 00:18:11,000
98% of all chromosomes are
either this, this, this,

274
00:18:11,000 --> 00:18:14,000
this, this or this,
this, this, this, this.

275
00:18:14,000 --> 00:18:18,000
Then there tends to be sites of
recombination that are actually

276
00:18:18,000 --> 00:18:21,000
hotspots of recombination where
most of the recombination of the

277
00:18:21,000 --> 00:18:24,000
population is concentrated.
And you get a couple of

278
00:18:24,000 --> 00:18:28,000
possibilities here. So, the
human genome can kind of be

279
00:18:28,000 --> 00:18:31,000
broken up into these haplotypes.
Blocks that might be 20,

280
00:18:31,000 --> 00:18:35,000
30, 40, sometimes 100 kilobases long
in which within the block you tend

281
00:18:35,000 --> 00:18:39,000
to have a small number of haplotypes,
or flavors as you might think of

282
00:18:39,000 --> 00:18:43,000
them, that define most of the
chromosomes in the population.

283
00:18:43,000 --> 00:18:46,000
So, in fact, I don't actually
need to know all the variants.

284
00:18:46,000 --> 00:18:50,000
If they're so well
correlated within a block,

285
00:18:50,000 --> 00:18:54,000
if I knew this block structure I
would be able to pick a small number

286
00:18:54,000 --> 00:18:58,000
of SNPs that would serve as a proxy
for that entire block of inheritance

287
00:18:58,000 --> 00:19:01,000
in the population. So,
what you might want to do is

288
00:19:01,000 --> 00:19:04,000
determine that entire haplotype
block structure of hwo they're

289
00:19:04,000 --> 00:19:08,000
related to each other,
and pick out tag snips.

290
00:19:08,000 --> 00:19:11,000
And it turns out that in theory,
a mere 300,000 or so of them would

291
00:19:11,000 --> 00:19:14,000
suffice to proxy for most of
the genome. So, you might want to

292
00:19:14,000 --> 00:19:18,000
declare an international project,
and international haplotype map

293
00:19:18,000 --> 00:19:21,000
project to create a haplotype
map of the human genome.

294
00:19:21,000 --> 00:19:24,000
And indeed, such a project was
declared about a year and a half ago

295
00:19:24,000 --> 00:19:28,000
through some instigation of
scientists and a number of places,

296
00:19:28,000 --> 00:19:31,000
including here. And this
is $100 million project

297
00:19:31,000 --> 00:19:35,000
involving six different countries.
And, it is already more than

298
00:19:35,000 --> 00:19:39,000
halfway done with the task,
and it's very likely that by the

299
00:19:39,000 --> 00:19:42,000
middle of next year, we will
have a pretty good haplotype

300
00:19:42,000 --> 00:19:46,000
map, not just knowing all
the variation, but knowing the

301
00:19:46,000 --> 00:19:50,000
correlation between that variation,
being able to break up the genome

302
00:19:50,000 --> 00:19:53,000
into these blocks. By
the next time I teach 701,

303
00:19:53,000 --> 00:19:57,000
I should be able to show a haplotype
map of the whole human genome

304
00:19:57,000 --> 00:20:01,000
already. That will allow you to
start undertaking systematic studies

305
00:20:01,000 --> 00:20:05,000
of inheritance for different
diseases across populations.

306
00:20:05,000 --> 00:20:08,000
And in fact, people are
already doing things like that.

307
00:20:08,000 --> 00:20:12,000
Here's an example of a study
done here at MIT like this,

308
00:20:12,000 --> 00:20:15,000
where to study inflammatory bowel
disease, there was evidence that

309
00:20:15,000 --> 00:20:19,000
there might be a particular region
of the genome that contained it,

310
00:20:19,000 --> 00:20:22,000
and haplotypes were determined
across this, and blah,

311
00:20:22,000 --> 00:20:26,000
blah, blah, blah, blah, blah,
blah. And this red haplotype

312
00:20:26,000 --> 00:20:29,000
here turns out to confer high risk,
about a two and a half or higher

313
00:20:29,000 --> 00:20:33,000
risk of inflammatory
bowel disease.

314
00:20:33,000 --> 00:20:36,000
And it sits over some genes
involved in immune responses,

315
00:20:36,000 --> 00:20:40,000
certain cytokine genes and all
that. And, things like this have been

316
00:20:40,000 --> 00:20:44,000
done for type 2 diabetes,
schizophrenia, cardiovascular

317
00:20:44,000 --> 00:20:47,000
disease, just right now at the
moment, a dozen or two examples.

318
00:20:47,000 --> 00:20:51,000
But I think we're set for an
explosion in this kind of work.

319
00:20:51,000 --> 00:20:55,000
In addition, you can use this
information to do things beyond

320
00:20:55,000 --> 00:20:59,000
medical genetics. You
can use it for history and

321
00:20:59,000 --> 00:21:03,000
anthropology as well. It
turns out rather interestingly,

322
00:21:03,000 --> 00:21:07,000
that since the human population
originated in Africa and spread out

323
00:21:07,000 --> 00:21:12,000
from Africa all the way around the
world arriving at different places

324
00:21:12,000 --> 00:21:17,000
in different times, you can
trace those migrations by

325
00:21:17,000 --> 00:21:21,000
virtue of rare genetic variants
that arose along the way,

326
00:21:21,000 --> 00:21:26,000
and let you, like a trail of
break crumbs, see the migrations.

327
00:21:26,000 --> 00:21:30,000
So, for example, there are certain
rare genetic variants that we can

328
00:21:30,000 --> 00:21:35,000
see in a South American Indian tribe,
and we can actually see that they

329
00:21:35,000 --> 00:21:40,000
came along this route because
we can see that residual of that.

330
00:21:40,000 --> 00:21:45,000
In fact, we can do things with this
like take a look at Native American

331
00:21:45,000 --> 00:21:50,000
individuals and determine that they
cluster into three distinct genetic

332
00:21:50,000 --> 00:21:55,000
groups that represent three distinct
migrations over the land bridge.

333
00:21:55,000 --> 00:22:00,000
And, you can assign them to
these different migrations.

334
00:22:00,000 --> 00:22:03,000
You can do this on the basis
of mitochondrial genotype,

335
00:22:03,000 --> 00:22:06,000
etc. You can also, for example,
determine when people talk about the

336
00:22:06,000 --> 00:22:09,000
out of Africa migration, there's
now increasing evidence that

337
00:22:09,000 --> 00:22:13,000
there really were two, one that
went this way over the land,

338
00:22:13,000 --> 00:22:16,000
and one that went this way following
along the coast into southeast Asia.

339
00:22:16,000 --> 00:22:19,000
And, it looks like we're now
beginning to get enough evidence of

340
00:22:19,000 --> 00:22:22,000
these two separate migrations by
virtue of the genetic breadcrumbs

341
00:22:22,000 --> 00:22:26,000
that they have
left along the way.

342
00:22:26,000 --> 00:22:30,000
So, it's really a very fascinating
thing of how much you can

343
00:22:30,000 --> 00:22:34,000
reconstruct from looking at genetic
variation, both the common variation

344
00:22:34,000 --> 00:22:38,000
that allows us to recognize medical
risk, and the rare genetic variation

345
00:22:38,000 --> 00:22:43,000
that provides much more
individual trails of things.

346
00:22:43,000 --> 00:22:47,000
None of this is perfect yet.
There's lots to learn. But I think

347
00:22:47,000 --> 00:22:51,000
anthropologists are finding that
the existing human population has a

348
00:22:51,000 --> 00:22:55,000
tremendous amount of its own history
embedded in pattern of genetic

349
00:22:55,000 --> 00:23:00,000
variation across the world.
You can do other things.

350
00:23:00,000 --> 00:23:04,000
I won't spend much time on this.
Well, I'll take a moment on this,

351
00:23:04,000 --> 00:23:09,000
right? There's some very
interesting work of a post-doctoral

352
00:23:09,000 --> 00:23:13,000
fellow here at MIT named Pardese
Sebetti who has been trying to ask,

353
00:23:13,000 --> 00:23:18,000
can we see in the genetic
variation in the population,

354
00:23:18,000 --> 00:23:22,000
signatures, patterns of ancient
selection, or even recent selection

355
00:23:22,000 --> 00:23:27,000
in the human population?
Now, hang onto your seats,

356
00:23:27,000 --> 00:23:32,000
because this will get
just slightly tricky.

357
00:23:32,000 --> 00:23:35,000
But, hang on. It's only a couple
of slides. Here was her idea.

358
00:23:35,000 --> 00:23:39,000
You see, when a mutation
arises in the population,

359
00:23:39,000 --> 00:23:43,000
it usually dies out,
right? Any new mutation just

360
00:23:43,000 --> 00:23:47,000
typically dies out. But,
sometimes by chance it drifts

361
00:23:47,000 --> 00:23:50,000
up to a high frequency.
Random events happen. But it

362
00:23:50,000 --> 00:23:54,000
usually takes a long time to do
that. If some random mutation happens,

363
00:23:54,000 --> 00:23:58,000
and it happens to drift up to high
frequency with no selection on it,

364
00:23:58,000 --> 00:24:02,000
then on average it takes
a long time to do so.

365
00:24:02,000 --> 00:24:05,000
If you want, I could write a
stochastic differential equation

366
00:24:05,000 --> 00:24:09,000
that would say that, but just
take your gut feeling that

367
00:24:09,000 --> 00:24:12,000
if something has no selection on it
and it's a rare event that'll drift

368
00:24:12,000 --> 00:24:16,000
up, when it drifts up it's kind of a
slow process. It was a slow process.

369
00:24:16,000 --> 00:24:20,000
Then over the course of time that
it took to drift to high frequency,

370
00:24:20,000 --> 00:24:23,000
a lot of genetic recombination
would have had to have occurred many

371
00:24:23,000 --> 00:24:27,000
generations. And the correlation
between the genotype at that spot

372
00:24:27,000 --> 00:24:31,000
and genotypes at other
loci would break down.

373
00:24:31,000 --> 00:24:34,000
And there would only be
short-range correlation. So,

374
00:24:34,000 --> 00:24:38,000
in other words, the amount of
correlation between knowing the

375
00:24:38,000 --> 00:24:41,000
genotype here and the genotype here,
maybe allele A here and a C here.

376
00:24:41,000 --> 00:24:45,000
That is an indication of time.
It's a clock almost. It's like

377
00:24:45,000 --> 00:24:49,000
radioactive decay, right,
that genetic recombination

378
00:24:49,000 --> 00:24:52,000
scrambles up the correlations.
And, if something's old, the

379
00:24:52,000 --> 00:24:56,000
correlations go over short distances.
But suppose that something happened.

380
00:24:56,000 --> 00:25:00,000
Some mutation happened
that was very advantageous.

381
00:25:00,000 --> 00:25:03,000
Then, it would have risen to high
frequency quickly because it was

382
00:25:03,000 --> 00:25:07,000
under selection. If
it did so quickly,

383
00:25:07,000 --> 00:25:11,000
then the long-range correlations
would not have had time to break

384
00:25:11,000 --> 00:25:15,000
down, and we'd have a smoking gun.
A smoking gun would be that there

385
00:25:15,000 --> 00:25:18,000
would be a long-range
correlation around that locus,

386
00:25:18,000 --> 00:25:22,000
much longer than you would
expect across the genome.

387
00:25:22,000 --> 00:25:26,000
Things even out of this distance
would show correlation with that,

388
00:25:26,000 --> 00:25:30,000
indicating that this
was a recent event.

389
00:25:30,000 --> 00:25:34,000
So, we just measure across the
genome, and look for this telltale

390
00:25:34,000 --> 00:25:39,000
sign of common variance that have
very long range correlation that

391
00:25:39,000 --> 00:25:44,000
indicate that they're very recent.
So, a plot of the allele frequency,

392
00:25:44,000 --> 00:25:49,000
common variance, sorry, if something
has a common high frequency and

393
00:25:49,000 --> 00:25:54,000
long-range correlation, you
wouldn't expect that by chance.

394
00:25:54,000 --> 00:25:58,000
So, something that
was common in its

395
00:25:58,000 --> 00:26:02,000
frequency and had long-range
correlation would be a signature of

396
00:26:02,000 --> 00:26:06,000
positive selection. So
anyway, Pardise had this idea,

397
00:26:06,000 --> 00:26:09,000
and she tried it out with
some interesting mutations,

398
00:26:09,000 --> 00:26:13,000
some mutations that confer
resistance to malaria,

399
00:26:13,000 --> 00:26:17,000
one well-known mutation causing
resistance to malaria called G6 PD

400
00:26:17,000 --> 00:26:21,000
and another one that she herself
had proposed as a mutation causing

401
00:26:21,000 --> 00:26:24,000
resistance to malaria,
variants in the CD4 ligand gene.

402
00:26:24,000 --> 00:26:28,000
And to make a long story short,
both the known and her newly

403
00:26:28,000 --> 00:26:32,000
predicted variant showed this
telltale property of having a high

404
00:26:32,000 --> 00:26:36,000
frequency and very
long range correlation.

405
00:26:36,000 --> 00:26:40,000
Well that's very interesting because
she was able to show that each of

406
00:26:40,000 --> 00:26:44,000
these mutations probably were
the result of positive selection.

407
00:26:44,000 --> 00:26:49,000
But what you could do in principle
is test every variant in the human

408
00:26:49,000 --> 00:26:53,000
genome this way: take any variant,
look at its frequency, and compare

409
00:26:53,000 --> 00:26:58,000
it to the long range correlation
around it, and test every single

410
00:26:58,000 --> 00:27:02,000
variant in the human population to
see which ones might be the result

411
00:27:02,000 --> 00:27:06,000
of long range correlation.
Now, when she proposed this,

412
00:27:06,000 --> 00:27:09,000
this was about a year and
a half ago or two years ago,

413
00:27:09,000 --> 00:27:12,000
this was a pretty nutty idea because
you would need all the variants in

414
00:27:12,000 --> 00:27:15,000
the human population, and
you would need all this

415
00:27:15,000 --> 00:27:18,000
correlation information.
But in fact, as I say, that

416
00:27:18,000 --> 00:27:21,000
information's almost upon us, and
I believed that this experiment,

417
00:27:21,000 --> 00:27:24,000
this analysis to look for all strong
positive selection in the human

418
00:27:24,000 --> 00:27:27,000
genome will in fact be done in
the course of the next 12 months.

419
00:27:27,000 --> 00:27:30,000
So, I'm hoping by next year I can
actually report on a genome-wide

420
00:27:30,000 --> 00:27:33,000
search for all the signatures
of positive selection.

421
00:27:33,000 --> 00:27:36,000
Now, this doesn't detect
all positive selection.

422
00:27:36,000 --> 00:27:39,000
It will detect sufficiently strong
positive selection going back pretty

423
00:27:39,000 --> 00:27:42,000
much only over the 10,
00 years. When you do the

424
00:27:42,000 --> 00:27:45,000
arithmetic, that's how much
power you have. Of course,

425
00:27:45,000 --> 00:27:48,000
10,000 years has been a pretty
interesting time for the human

426
00:27:48,000 --> 00:27:52,000
population, right? The
time of civilization and

427
00:27:52,000 --> 00:27:55,000
population density,
and infectious diseases,

428
00:27:55,000 --> 00:27:58,000
and all that, and I think we'll
have an interesting window into

429
00:27:58,000 --> 00:28:02,000
the change in diet. All
of that should come out of

430
00:28:02,000 --> 00:28:06,000
something like this. So,
there's a lot of really cool

431
00:28:06,000 --> 00:28:10,000
information in DNA variation
to be had. All right,

432
00:28:10,000 --> 00:28:14,000
that's one half. The other half of
what I would like to talk about is

433
00:28:14,000 --> 00:28:18,000
totally different. It's
not about inherited DNA

434
00:28:18,000 --> 00:28:22,000
variation. It's about somatic
differences between tissues in RNA

435
00:28:22,000 --> 00:28:26,000
variation. So,
let's shift gears.

436
00:28:26,000 --> 00:28:30,000
RNA variation: let me start
by giving you an example here.

437
00:28:30,000 --> 00:28:36,000
These are cells from two different
patients with acute leukemia.

438
00:28:36,000 --> 00:28:43,000
Can you spot the difference between
these? Yep? More like bunches of

439
00:28:43,000 --> 00:28:49,000
grapes and all that. Yeah,
it turns out that's just a

440
00:28:49,000 --> 00:28:56,000
reflection of the field of
view you have if you move over

441
00:28:56,000 --> 00:29:02,000
to look like that. But
I mean, that's good.

442
00:29:02,000 --> 00:29:07,000
It's just that it turns out that
that isn't actually a distinction

443
00:29:07,000 --> 00:29:12,000
when you look at more fields.
Anything else? Yep? White blood

444
00:29:12,000 --> 00:29:16,000
cells like different. They
look broken. There's more of

445
00:29:16,000 --> 00:29:21,000
them in this field of view. But
you look at 100 fields of view

446
00:29:21,000 --> 00:29:26,000
and it turns out that's not either.
Well, the reason you're having

447
00:29:26,000 --> 00:29:31,000
trouble spotting any difference
is that highly trained pathologists

448
00:29:31,000 --> 00:29:35,000
can't find any difference either.
I generally agree there's no

449
00:29:35,000 --> 00:29:39,000
difference between these two if
you look at enough fields of view.

450
00:29:39,000 --> 00:29:43,000
But you can convince yourself if
you look that you see things there.

451
00:29:43,000 --> 00:29:46,000
But these actually are two very
different kinds of leukemia.

452
00:29:46,000 --> 00:29:50,000
And, these patients have to
be treated very differently.

453
00:29:50,000 --> 00:29:54,000
But, pathologists cannot determine
which leukemia it is just by looking

454
00:29:54,000 --> 00:29:57,000
at the microscope, it turns out.
This is the work of this man,

455
00:29:57,000 --> 00:30:01,000
Sydney Farber, namesake of the
Dana Farber Cancer Institute here in

456
00:30:01,000 --> 00:30:05,000
Boston, who in the 1950s began
noticing that patients with

457
00:30:05,000 --> 00:30:08,000
leukemias, some of them seemed
different in the way they responded

458
00:30:08,000 --> 00:30:12,000
to a certain treatment, and
he said, look, I think there's

459
00:30:12,000 --> 00:30:16,000
some underlying classification
of these leukemias,

460
00:30:16,000 --> 00:30:19,000
but I can't get any reliable
way to tell it in the microscope.

461
00:30:19,000 --> 00:30:23,000
And he put many years into working
this out, first by noticing certain

462
00:30:23,000 --> 00:30:27,000
difference in enzymes in the cells,
and then people noticed certain

463
00:30:27,000 --> 00:30:31,000
things in cell surface markers,
and some chromosomal rearrangements.

464
00:30:31,000 --> 00:30:34,000
And nowadays, there are a bunch
of test that can be done by a

465
00:30:34,000 --> 00:30:38,000
pathologist when a patient comes
in with acute leukemia to determine

466
00:30:38,000 --> 00:30:42,000
whether they have AML or ALL.
But it turns out that you can't do

467
00:30:42,000 --> 00:30:46,000
it by looking. You
have to do some kind of

468
00:30:46,000 --> 00:30:50,000
immunohystochemical test of
some sort in order to do that.

469
00:30:50,000 --> 00:30:54,000
So this is a triumph of diagnosis.
After 40 years of work, we can now

470
00:30:54,000 --> 00:30:58,000
correctly classify patients
as AML or ALL. And they get the

471
00:30:58,000 --> 00:31:02,000
appropriate treatment. And
if they don't get the right

472
00:31:02,000 --> 00:31:06,000
treatment, they have a
much higher chance of dying.

473
00:31:06,000 --> 00:31:10,000
And if they do get the right
treatment, they have a much higher

474
00:31:10,000 --> 00:31:14,000
chance of living.
So, this is great.

475
00:31:14,000 --> 00:31:18,000
There's only one problem with
the story. It took 40 years,

476
00:31:18,000 --> 00:31:22,000
40 years to sort this out.
That's a long time. Couldn't we do

477
00:31:22,000 --> 00:31:26,000
better? Surely these
cells know what they are.

478
00:31:26,000 --> 00:31:30,000
Surely we could just ask them if
they are. Well, here's the idea.

479
00:31:30,000 --> 00:31:33,000
Suppose we could ask each cell,
please tell us every gene that you

480
00:31:33,000 --> 00:31:37,000
have turned on, and the
level to which you have that

481
00:31:37,000 --> 00:31:40,000
gene expressed.
In other words,

482
00:31:40,000 --> 00:31:44,000
let us summarize each cell, each
tumor by a description of its

483
00:31:44,000 --> 00:31:47,000
complete pattern of gene expression
to 22,000 genes on the human genome.

484
00:31:47,000 --> 00:31:51,000
Let's write down the level
of expression, X1 up to X22,

485
00:31:51,000 --> 00:31:54,000
00 for each of the 22,000
genes of the genome. So,

486
00:31:54,000 --> 00:31:58,000
ever tumor becomes a point in
22, 00 dimensional space, right?

487
00:31:58,000 --> 00:32:01,000
Now clearly, if we had every
tumor described as a point in 22,

488
00:32:01,000 --> 00:32:05,000
00 dimensional space, we ought to
be able to sort out which tumors are

489
00:32:05,000 --> 00:32:09,000
similar to each other, right?
Well, it turns out you can

490
00:32:09,000 --> 00:32:13,000
do that now. These are gene chips,
one of several technologies by which

491
00:32:13,000 --> 00:32:17,000
on a piece of glass are put little
spots, each of which contains a

492
00:32:17,000 --> 00:32:21,000
piece of DNA, a unique DNA sequence.
Actually, many copies of that DNA

493
00:32:21,000 --> 00:32:25,000
sequence are there. Each of
these is a 25 base long DNA

494
00:32:25,000 --> 00:32:29,000
sequence, and I can design this
so whatever DNA sequence you

495
00:32:29,000 --> 00:32:32,000
want is in each spot. The way
that's done is with the same

496
00:32:32,000 --> 00:32:36,000
photolithographic techniques that
are used to make microprocessors.

497
00:32:36,000 --> 00:32:40,000
People have worked out a
chemistry where through a mask,

498
00:32:40,000 --> 00:32:44,000
you shine a light, photodeprotect
certain pixels; the pixels that are

499
00:32:44,000 --> 00:32:48,000
photodeprotected you can chemically
attach an A, then re-protect the

500
00:32:48,000 --> 00:32:52,000
surface. Use a light.
Chemically photodeprotect certain

501
00:32:52,000 --> 00:32:56,000
spots. Wash on a C.
And in this fashion,

502
00:32:56,000 --> 00:33:00,000
since you can randomly
address the spots by light,

503
00:33:00,000 --> 00:33:04,000
and then chemically add bases to
whatever spots are deprotected,

504
00:33:04,000 --> 00:33:08,000
you can simultaneously construct
hundreds of thousands of spots each

505
00:33:08,000 --> 00:33:12,000
containing its own unique
specified oligonucleotide sequence.

506
00:33:12,000 --> 00:33:16,000
And you can get them
in little plastic chips.

507
00:33:16,000 --> 00:33:20,000
And then if you want, all
you do is you take a tumor.

508
00:33:20,000 --> 00:33:24,000
You grind it up. You prepare RNA.
You fluorescently label the RNA

509
00:33:24,000 --> 00:33:28,000
with some appropriate fluorescent
dye. You squirt it into the chip.

510
00:33:28,000 --> 00:33:31,000
You wash it back and forth.
You rock it back and forth,

511
00:33:31,000 --> 00:33:35,000
wash it out, and stick it in a
laser scanner. And it'll see how much

512
00:33:35,000 --> 00:33:38,000
fluorescence is stuck to each spot.
And bingo: you get a readout of the

513
00:33:38,000 --> 00:33:42,000
level of gene expression. I
guess each spot, you should

514
00:33:42,000 --> 00:33:45,000
design it so that this spot has
an oligonucleotide complementary to

515
00:33:45,000 --> 00:33:49,000
gene number one.
And the next one,

516
00:33:49,000 --> 00:33:53,000
an oligonucleotide matching
by Crick-Watson base pairing

517
00:33:53,000 --> 00:33:56,000
complementary to gene number
two and gene number three.

518
00:33:56,000 --> 00:34:00,000
So, if I knew all the genes in the
genome, I could make a detector spot

519
00:34:00,000 --> 00:34:03,000
for each gene in the genome.
And of course we know essentially

520
00:34:03,000 --> 00:34:07,000
all the genes in the genome.
So you can make those detector

521
00:34:07,000 --> 00:34:10,000
spots and you can buy them.
So, you can now get a readout of

522
00:34:10,000 --> 00:34:13,000
all the, I mean, this is
like so cool because when I

523
00:34:13,000 --> 00:34:17,000
started teaching 701, which
wasn't that long ago because I

524
00:34:17,000 --> 00:34:20,000
ain't (sic) that old still, the
way people did an analysis of

525
00:34:20,000 --> 00:34:23,000
gene expression is they used
primitive technologies where they

526
00:34:23,000 --> 00:34:27,000
would analyze one gene at a time,
certain things called northern blots

527
00:34:27,000 --> 00:34:30,000
and things like that,
right? And, you know,

528
00:34:30,000 --> 00:34:34,000
you'd put in a lot of work and you
get the expression level of a gene,

529
00:34:34,000 --> 00:34:37,000
whereas now you can get the
expression of all the genes

530
00:34:37,000 --> 00:34:41,000
simultaneously, and it's
pretty mind boggling that

531
00:34:41,000 --> 00:34:44,000
you can do that. How do
you analyze data like that?

532
00:34:44,000 --> 00:34:48,000
So, we still use northern
blots. It's true. So,

533
00:34:48,000 --> 00:34:51,000
every tumor becomes a vector, and
we get a vector corresponding to

534
00:34:51,000 --> 00:34:55,000
each tumor. So, this line
here is the first tumor,

535
00:34:55,000 --> 00:34:59,000
the second tumor, the third
tumor, the fourth tumor.

536
00:34:59,000 --> 00:35:02,000
The columns here correspond
to genes. There are 22,

537
00:35:02,000 --> 00:35:06,000
00 columns in this matrix, and
I've shown a certain subset of

538
00:35:06,000 --> 00:35:10,000
the columns because these genes here
have the interesting property that

539
00:35:10,000 --> 00:35:14,000
they tend to be high red in the ALL
tumors, and they tend to be low blue

540
00:35:14,000 --> 00:35:18,000
in the AML tumors, whereas
these genes here have the

541
00:35:18,000 --> 00:35:22,000
opposite property. They tend
to be low blue in the ALL

542
00:35:22,000 --> 00:35:26,000
tumors and high red in the AML
tumors. These genes do a pretty

543
00:35:26,000 --> 00:35:30,000
good job of telling
apart these tumors.

544
00:35:30,000 --> 00:35:35,000
So, here's a new tumor.
Patient came in. We analyzed the

545
00:35:35,000 --> 00:35:40,000
RNA, squirted it on the chip. Can
somebody classify that? Louder?

546
00:35:40,000 --> 00:35:45,000
AML. Next? Next?
Congratulations, you're

547
00:35:45,000 --> 00:35:50,000
pathologists. Very good.
That's right, you can do that.

548
00:35:50,000 --> 00:35:56,000
It works. And in fact, in the
study that was done that was

549
00:35:56,000 --> 00:36:01,000
published about this, the
computer was able to get it

550
00:36:01,000 --> 00:36:05,000
right 100% of the time.
Not bad. So now you say,

551
00:36:05,000 --> 00:36:09,000
wait, wait, wait,
but you're cheating.

552
00:36:09,000 --> 00:36:12,000
You're giving it a whole bunch of
knowns. Once I have a whole bunch

553
00:36:12,000 --> 00:36:15,000
of knowns it's not so hard
to classify a new tumor.

554
00:36:15,000 --> 00:36:19,000
What Sydney Farber did was he
discovered in the first place that

555
00:36:19,000 --> 00:36:22,000
there existed two subtypes.
Surely that's harder than

556
00:36:22,000 --> 00:36:26,000
classifying when you're
given a bunch of knowns. And

557
00:36:26,000 --> 00:36:29,000
that's true. So,
suppose instead,

558
00:36:29,000 --> 00:36:33,000
I didn't tell you in advance which
were AML's and which were ALL's,

559
00:36:33,000 --> 00:36:37,000
and I just gave you vectors
corresponding to a large number of

560
00:36:37,000 --> 00:36:41,000
tumors, do you think you would be
able to sort out that they actually

561
00:36:41,000 --> 00:36:49,000
fell into
two clusters?

562
00:36:49,000 --> 00:36:53,000
Could you by computer tell that
there's one class and the other

563
00:36:53,000 --> 00:36:57,000
class? Turns out that you can.
Now, I've made it a little easier

564
00:36:57,000 --> 00:37:02,000
by not listing most of
the 22,000 columns here.

565
00:37:02,000 --> 00:37:06,000
But think about it. Every
tumor is a point in 22,

566
00:37:06,000 --> 00:37:10,000
00 dimensional space. If some
of the tumors are similar,

567
00:37:10,000 --> 00:37:14,000
what can you say about those
points in 22,000 dimensional space?

568
00:37:14,000 --> 00:37:18,000
They're going to be clumped
together. They're near each other.

569
00:37:18,000 --> 00:37:22,000
So, just plot every tumor as a
point in 22,000 dimensional space,

570
00:37:22,000 --> 00:37:26,000
and your question is, do the points
tend to lie in two clumps up in 22,

571
00:37:26,000 --> 00:37:30,000
00 dimensional space? And
there's simple arithmetic you

572
00:37:30,000 --> 00:37:34,000
can learn using linear algebra to
get some separating hyperplane and

573
00:37:34,000 --> 00:37:38,000
ask, do tumors lie on one
side or the other? And,

574
00:37:38,000 --> 00:37:42,000
it turns out the procedures like
that will quickly tell you that

575
00:37:42,000 --> 00:37:46,000
these tumors clump into two very
clear clumps. They're not randomly

576
00:37:46,000 --> 00:37:50,000
distributed. And so,
if you get these tumors,

577
00:37:50,000 --> 00:37:54,000
and you do gene expression on them
and put the data into a computer,

578
00:37:54,000 --> 00:37:58,000
the amount of time it takes the
computer to discover that there were

579
00:37:58,000 --> 00:38:02,000
actually two types of acute leukemia
is about three seconds marked down

580
00:38:02,000 --> 00:38:06,000
from 40 years. That's good. So,
you can reproduce the discovery

581
00:38:06,000 --> 00:38:10,000
of AML and ALL in three seconds.
Now you know what the pathologists

582
00:38:10,000 --> 00:38:14,000
say about this. They
say, oh, give me a break.

583
00:38:14,000 --> 00:38:18,000
It's shooting fish in a barrel.
We know there was a distinction.

584
00:38:18,000 --> 00:38:22,000
Big deal that the computer
can find the distinction.

585
00:38:22,000 --> 00:38:26,000
We knew that there was distinction
there. I know the computer didn't

586
00:38:26,000 --> 00:38:30,000
know it and all that. Tell
us something we don't know.

587
00:38:30,000 --> 00:38:35,000
That's a fair question. So
it turns out that you can ask

588
00:38:35,000 --> 00:38:40,000
some more questions. You
can say, suppose I take now

589
00:38:40,000 --> 00:38:45,000
just the ALL's. Are
they a homogeneous class,

590
00:38:45,000 --> 00:38:50,000
or did they fall into two classes?
It turns out that extending this

591
00:38:50,000 --> 00:38:55,000
work, folks here were able to show
that we can further split that ALL

592
00:38:55,000 --> 00:39:00,000
class. There was a hint that you
might be able to do so because

593
00:39:00,000 --> 00:39:06,000
there's some ALL patients who have
disruptions of a gene called MLL.

594
00:39:06,000 --> 00:39:09,000
And this tends to be a
little more common in infants,

595
00:39:09,000 --> 00:39:13,000
and tends to be associated
with a poor prognosis.

596
00:39:13,000 --> 00:39:16,000
But it was really very unclear
whether this was simply one of a

597
00:39:16,000 --> 00:39:20,000
zillion factoids about some
leukemia patients, whether this was a

598
00:39:20,000 --> 00:39:24,000
fundamental distinction. So,
what happened was folks took a

599
00:39:24,000 --> 00:39:27,000
lot of ALL patients, got
their expression profiles,

600
00:39:27,000 --> 00:39:31,000
and lo and behold it turned out
that ALL itself broke into two very

601
00:39:31,000 --> 00:39:34,000
different clusters. This is
an artist's rendition of a

602
00:39:34,000 --> 00:39:38,000
22,000 dimensional space.
We can't afford a 22,000

603
00:39:38,000 --> 00:39:42,000
dimensional projector here, so
we're just using two dimensions.

604
00:39:42,000 --> 00:39:46,000
But, the two forms of ALL were
quite distinct from each other,

605
00:39:46,000 --> 00:39:50,000
and so actually ALL itself should
be split up into two classes,

606
00:39:50,000 --> 00:39:54,000
ALL plus and minus, or ALL
one and two, or MLL and ALL.

607
00:39:54,000 --> 00:39:58,000
And it turns out that these
forms are quite different.

608
00:39:58,000 --> 00:40:02,000
They have different outcomes and
should be treated differently.

609
00:40:02,000 --> 00:40:07,000
It also turns out that a
particularly good distinction

610
00:40:07,000 --> 00:40:12,000
between these two subtypes of ALL is
found by looking at this particular

611
00:40:12,000 --> 00:40:17,000
gene called the flit-3 kinase.
The flit-3 kinase gene, whatever

612
00:40:17,000 --> 00:40:23,000
that is, was of great interest
because people know that they can

613
00:40:23,000 --> 00:40:28,000
make inhibitors against
certain kinases. And so,

614
00:40:28,000 --> 00:40:33,000
it turned out that an inhibitor
against flit-3 kinases,

615
00:40:33,000 --> 00:40:39,000
against this flit-3
kinase gene product.

616
00:40:39,000 --> 00:40:44,000
If you treat cells with that
inhibitor, cells of this type die,

617
00:40:44,000 --> 00:40:49,000
and cells of this type are
not affected. So in fact,

618
00:40:49,000 --> 00:40:54,000
there's a potential drug use of
flit-3 kinases in the MLL class of

619
00:40:54,000 --> 00:41:00,000
these leukemias, and folks
are trying some clinical

620
00:41:00,000 --> 00:41:05,000
trials now. So, not only
did the analysis of the

621
00:41:05,000 --> 00:41:09,000
gene expression point to two
important sub-types of leukemias,

622
00:41:09,000 --> 00:41:14,000
but the analysis of the gene
expression even suggested potential

623
00:41:14,000 --> 00:41:19,000
targets for therapy. So,
I'll give you a bunch more

624
00:41:19,000 --> 00:41:23,000
examples. I have a bunch
more examples like that there.

625
00:41:23,000 --> 00:41:28,000
They are examples of taking
lymphomas and showing that they can

626
00:41:28,000 --> 00:41:33,000
be split into two different
categories, examples of taking

627
00:41:33,000 --> 00:41:38,000
breast cancers into several
categories, colon cancers.

628
00:41:38,000 --> 00:41:42,000
Basically what's going on right now
is an attempt to reclassify cancers

629
00:41:42,000 --> 00:41:47,000
based not on what they
look like in the microscope,

630
00:41:47,000 --> 00:41:51,000
and based not on what organ
in the body they affect,

631
00:41:51,000 --> 00:41:56,000
but based on, molecularly, what
their description is, because

632
00:41:56,000 --> 00:42:01,000
the molecular description, as
Bob talked to you about with CML

633
00:42:01,000 --> 00:42:05,000
and with Gleveck, turns
out to be a tremendously

634
00:42:05,000 --> 00:42:10,000
powerful way of classifying cancers
because you're able to see what is

635
00:42:10,000 --> 00:42:15,000
the molecular defect and can
make a molecular targeted therapy.

636
00:42:15,000 --> 00:42:20,000
So, these sorts of tools are
quite cool, and I've got to say,

637
00:42:20,000 --> 00:42:25,000
in the last year we've begun using
these expression tools not just to

638
00:42:25,000 --> 00:42:30,000
classify cancers,
but to classify drugs.

639
00:42:30,000 --> 00:42:34,000
We've begun an interesting and
somewhat crazy project to take all

640
00:42:34,000 --> 00:42:38,000
the FDA approved drugs,
put them onto cell types,

641
00:42:38,000 --> 00:42:42,000
and see what they do, that is,
get a signature, a fingerprint,

642
00:42:42,000 --> 00:42:46,000
a gene expression description
of the action of a drug.

643
00:42:46,000 --> 00:42:50,000
And then we hope,
here's the nutty idea,

644
00:42:50,000 --> 00:42:54,000
that we can look up in the computer
which drugs do which things and

645
00:42:54,000 --> 00:42:58,000
might be useful for which diseases,
because we'd put the diseases and

646
00:42:58,000 --> 00:43:02,000
the drugs on an equal footing.
All of them would be described in

647
00:43:02,000 --> 00:43:06,000
terms of their gene
expression patterns. So,

648
00:43:06,000 --> 00:43:10,000
I'll tell you one interesting
example, OK? This is an interesting

649
00:43:10,000 --> 00:43:14,000
enough example. I don't
even have slides for it yet.

650
00:43:14,000 --> 00:43:18,000
It turns out that these patients
with ALL that I've been talking

651
00:43:18,000 --> 00:43:23,000
about, some of the patients
with ALL will respond to the drug

652
00:43:23,000 --> 00:43:27,000
dexamethasone. Some
won't. If you take patients

653
00:43:27,000 --> 00:43:31,000
who respond to dexamethasone,
and patients who are resistant to

654
00:43:31,000 --> 00:43:35,000
dexamethasone, and you
get their gene expression

655
00:43:35,000 --> 00:43:40,000
patterns, you can ask are there some
genes that explain the difference?

656
00:43:40,000 --> 00:43:44,000
And you can get a certain
gene signature, a list of,

657
00:43:44,000 --> 00:43:48,000
say, a dozen or so genes that do a
pretty good job of classifying who's

658
00:43:48,000 --> 00:43:53,000
sensitive and who's resistant.
Then you can go to this database I

659
00:43:53,000 --> 00:43:57,000
was telling you about of the
action of many drugs and say,

660
00:43:57,000 --> 00:44:01,000
do we see any drugs whose effect
would be to produce a signature

661
00:44:01,000 --> 00:44:06,000
of sensitivity? If
we found a drug X,

662
00:44:06,000 --> 00:44:10,000
which when we put it on cells turned
on those genes that correlate with

663
00:44:10,000 --> 00:44:14,000
being sensitive to dexamethasone,
you could hallucinate the following

664
00:44:14,000 --> 00:44:18,000
really happy possibility that when
you added that drug together with

665
00:44:18,000 --> 00:44:22,000
dexamethasone, you might
be able to treat resistant

666
00:44:22,000 --> 00:44:26,000
patients because that drug could
make them sensitive to dexamethasone,

667
00:44:26,000 --> 00:44:30,000
and that you could find that
drug just by looking it up in

668
00:44:30,000 --> 00:44:35,000
a computer database. So, we
tried it and we hit a drug.

669
00:44:35,000 --> 00:44:40,000
There was a certain drug
that came up on the screen,

670
00:44:40,000 --> 00:44:45,000
yes? That's very much in the idea
too. We found a drug that produced

671
00:44:45,000 --> 00:44:49,000
the signature sensitivity, and
tested it in vitro. In vitro,

672
00:44:49,000 --> 00:44:54,000
if you take cells that are
resistant and you add dexamethasone,

673
00:44:54,000 --> 00:44:59,000
nothing happens because they're
resistant. If you add drug X,

674
00:44:59,000 --> 00:45:04,000
nothing happens. But if you add
both drug X plus dexamethasone,

675
00:45:04,000 --> 00:45:08,000
the cells drop dead. It's
now going into clinical trials

676
00:45:08,000 --> 00:45:12,000
in human patients. It turns
out drug X is already a

677
00:45:12,000 --> 00:45:15,000
well FDA approved drug, so
it can be tested in human

678
00:45:15,000 --> 00:45:19,000
patients right away, so
it's going to be tested.

679
00:45:19,000 --> 00:45:22,000
So, the gene expression pattern was
able to tell us to use a drug which

680
00:45:22,000 --> 00:45:26,000
actually had nothing to do with
cancer uses in a cancer setting

681
00:45:26,000 --> 00:45:30,000
because it might do
something helpful.

682
00:45:30,000 --> 00:45:33,000
Now, what's the point of all this?
We can turn up the lights because I

683
00:45:33,000 --> 00:45:37,000
think I'm going to stop the slides
there. The point of all of this,

684
00:45:37,000 --> 00:45:41,000
which is what I've made
again, and I will make again,

685
00:45:41,000 --> 00:45:45,000
because you are the generation
that's going to really live this,

686
00:45:45,000 --> 00:45:48,000
is that biology is
becoming information. Now,

687
00:45:48,000 --> 00:45:52,000
don't get me wrong.
It's not stopping being

688
00:45:52,000 --> 00:45:56,000
biochemistry. It's going to be
biochemistry. It's not stopping

689
00:45:56,000 --> 00:46:00,000
being molecular biology. It's
not stopping any of the things

690
00:46:00,000 --> 00:46:03,000
it was before. 45:57
But it is also becoming

691
00:46:03,000 --> 00:46:07,000
information, that for the first time
we're entering a world where we can

692
00:46:07,000 --> 00:46:11,000
collect vast amounts of information:
all the genetic variants in a

693
00:46:11,000 --> 00:46:15,000
patient, all of the gene
expression pattern in a cell,

694
00:46:15,000 --> 00:46:18,000
or all of the gene expression
pattern induced by a drug,

695
00:46:18,000 --> 00:46:22,000
and that whatever question you're
asking will be informed by being

696
00:46:22,000 --> 00:46:26,000
able to access that whole database.
In no way does it decrease the role

697
00:46:26,000 --> 00:46:30,000
of the individual smart scientist
working on his or her problem.

698
00:46:30,000 --> 00:46:32,000
To the contrary, the
goal is to empower the

699
00:46:32,000 --> 00:46:35,000
individual smart scientist so that
you have all of that information at

700
00:46:35,000 --> 00:46:38,000
your fingertips. There
are databases scattered

701
00:46:38,000 --> 00:46:41,000
around the web that have
sequences from different species,

702
00:46:41,000 --> 00:46:44,000
variations from the human population,
all of these drug database,

703
00:46:44,000 --> 00:46:47,000
etc., etc., etc., etc. It's
a time of tremendous ferment,

704
00:46:47,000 --> 00:46:50,000
a little bit of chaos. You
talk to people in the field,

705
00:46:50,000 --> 00:46:53,000
they say, we're getting deluged by
data. We're getting crushed by the

706
00:46:53,000 --> 00:46:56,000
amount of data. I don't'
know what to do with all

707
00:46:56,000 --> 00:46:59,000
the data. There's
only one solution for a

708
00:46:59,000 --> 00:47:02,000
field in that condition, and
that is young scientists because

709
00:47:02,000 --> 00:47:05,000
the young scientists who come into
the field are the ones who take for

710
00:47:05,000 --> 00:47:08,000
granted, of course we're
going to have all these data.

711
00:47:08,000 --> 00:47:11,000
We love having all these data.
This is just great, couldn't be

712
00:47:11,000 --> 00:47:14,000
happier to have all these data.
We're not put off by it in the

713
00:47:14,000 --> 00:47:17,000
least. That's what's going on.
That's what's so important about

714
00:47:17,000 --> 00:47:20,000
your generation, and that's
why I think it's really

715
00:47:20,000 --> 00:47:23,000
important that even though it's 701
and we're supposed to be teaching

716
00:47:23,000 --> 00:47:26,000
you the basics, it's
important that you see this

717
00:47:26,000 --> 00:47:29,000
stuff because this is the
change that's going on,

718
00:47:29,000 --> 00:47:32,000
and we're counting on this very
much to drive a revolution in health,

719
00:47:32,000 --> 00:47:35,000
a revolution in biomedical research,
and we're counting on you guys very

720
00:47:35,000 --> 00:47:39,000
much to drive that revolution. It
has been a pleasure to teach you

721
00:47:39,000 --> 00:47:43,000
this term. I hope many
of you will stay in touch,

722
00:47:43,000 --> 00:47:48,000
and some of you will go into biology,
and even those of you who don't will

723
00:47:48,000 --> 00:47:53,000
know lots about it and enjoy it.
Thank you very much. [APPLAUSE]