1
00:00:01,040 --> 00:00:03,460
The following content is
provided under a Creative

2
00:00:03,460 --> 00:00:04,870
Commons license.

3
00:00:04,870 --> 00:00:07,910
Your support will help MIT
OpenCourseWare continue to

4
00:00:07,910 --> 00:00:11,560
offer high quality educational
resources for free.

5
00:00:11,560 --> 00:00:14,460
To make a donation or view
additional materials from

6
00:00:14,460 --> 00:00:20,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:20,290 --> 00:00:21,540
ocw.mit.edu.

8
00:00:24,220 --> 00:00:27,450
PROFESSOR: I want to pick up
with a little bit of overlap

9
00:00:27,450 --> 00:00:31,230
just to remind people
where we were.

10
00:00:31,230 --> 00:00:35,490
We had been looking at
clustering, and we looked at a

11
00:00:35,490 --> 00:00:40,690
fairly simple example of using
agglomerative hierarchical

12
00:00:40,690 --> 00:00:46,130
clustering to cluster cities,
based upon how far apart they

13
00:00:46,130 --> 00:00:48,160
were from each other cities.

14
00:00:48,160 --> 00:00:51,570
So, essentially, using this
distance matrix, we could do a

15
00:00:51,570 --> 00:00:54,470
clustering that would
reflect how close

16
00:00:54,470 --> 00:00:57,030
cities were to one another.

17
00:00:57,030 --> 00:01:00,640
And we went through a
agglomerative clustering, and

18
00:01:00,640 --> 00:01:03,730
we saw that we would get a
different answer, depending

19
00:01:03,730 --> 00:01:07,790
upon which linkage criterion
we used.

20
00:01:07,790 --> 00:01:14,060
This is an important issue
because as one is using

21
00:01:14,060 --> 00:01:18,930
clustering, one what has to be
aware that it is related to

22
00:01:18,930 --> 00:01:22,320
these things, and you choose the
wrong linkage criterion,

23
00:01:22,320 --> 00:01:25,440
you might get an answer other
than the most useful.

24
00:01:28,160 --> 00:01:28,232
All right.

25
00:01:28,232 --> 00:01:33,580
I next went on and said, well,
this is pretty easy, because

26
00:01:33,580 --> 00:01:37,430
when we're comparing the
distance between two cities or

27
00:01:37,430 --> 00:01:42,840
the two features, we just
subtract one distance from the

28
00:01:42,840 --> 00:01:44,440
other and we get a number.

29
00:01:44,440 --> 00:01:46,670
It's very straightforward.

30
00:01:46,670 --> 00:01:49,960
I then raised the question,
suppose when we looked at

31
00:01:49,960 --> 00:01:54,330
cities, we looked at a more
complicated way of looking at

32
00:01:54,330 --> 00:01:59,160
them than airline distance.

33
00:01:59,160 --> 00:02:02,710
So the first question, I said,
well, suppose in addition to

34
00:02:02,710 --> 00:02:09,949
the distance by air, we add the
distance by road, or the

35
00:02:09,949 --> 00:02:12,150
average temperature.

36
00:02:12,150 --> 00:02:13,970
Pick what you will.

37
00:02:13,970 --> 00:02:16,310
What do we do?

38
00:02:16,310 --> 00:02:23,390
Well, the answer was we start by
generalizing from a feature

39
00:02:23,390 --> 00:02:31,860
being a single number to the
notion of a feature vector,

40
00:02:31,860 --> 00:02:38,320
where the features used to
describe the city are now

41
00:02:38,320 --> 00:02:42,855
represented by a vector,
typically of numbers.

42
00:02:47,280 --> 00:02:53,280
If the vectors are all in the
same physical units, we could

43
00:02:53,280 --> 00:02:58,410
easily imagine how we might
compare two vectors.

44
00:02:58,410 --> 00:03:02,310
So we might, for example, we
look at the Euclidean distance

45
00:03:02,310 --> 00:03:06,480
between the two just by,
say, subtracting one

46
00:03:06,480 --> 00:03:09,880
vector from the other.

47
00:03:09,880 --> 00:03:13,700
However, if we think about
that, it can be pretty

48
00:03:13,700 --> 00:03:19,190
misleading because, for example,
when we look at a

49
00:03:19,190 --> 00:03:25,480
city, one element of the vector
represents the distance

50
00:03:25,480 --> 00:03:29,930
in miles from another city,
or in fact this case, the

51
00:03:29,930 --> 00:03:32,610
distance in miles
to each city.

52
00:03:32,610 --> 00:03:36,940
And another represents
temperatures.

53
00:03:36,940 --> 00:03:39,630
Well, it's kind of funny to
compare distance, which might

54
00:03:39,630 --> 00:03:42,310
be thousands of miles,
with the temperature

55
00:03:42,310 --> 00:03:45,640
which might be 5 degrees.

56
00:03:45,640 --> 00:03:47,960
A 5 degree difference in average
temperature could be

57
00:03:47,960 --> 00:03:49,380
significant.

58
00:03:49,380 --> 00:03:52,270
Certainly a 20 degree difference
in temperature is

59
00:03:52,270 --> 00:03:57,070
very significant, but a 20 mile
difference in location

60
00:03:57,070 --> 00:03:59,840
might not be very significant.

61
00:03:59,840 --> 00:04:03,850
And so to equally weight a 20
degree temperature difference

62
00:04:03,850 --> 00:04:08,170
and at 20 miles distance
difference might give us a

63
00:04:08,170 --> 00:04:10,610
very peculiar answer.

64
00:04:10,610 --> 00:04:14,980
And so we have to think about
the question of, how are we

65
00:04:14,980 --> 00:04:18,799
going to scale the elements
of the vectors?

66
00:04:34,310 --> 00:04:40,670
Even if we're in the same
units, say inches,

67
00:04:40,670 --> 00:04:42,660
it can be an issue.

68
00:04:42,660 --> 00:04:45,810
So let's look at this example.

69
00:04:45,810 --> 00:04:51,410
Here I've got on the left,
before scaling, something

70
00:04:51,410 --> 00:04:56,140
which we can say is in inches,
height and width.

71
00:04:56,140 --> 00:04:59,420
This is not from a person, but
you could imagine if you were

72
00:04:59,420 --> 00:05:03,120
trying to cluster people, and
you measured their height in

73
00:05:03,120 --> 00:05:07,380
inches and their width in
inches, maybe you don't want

74
00:05:07,380 --> 00:05:09,020
to treat them equally.

75
00:05:09,020 --> 00:05:09,340
Right?

76
00:05:09,340 --> 00:05:11,900
But there's a lot more variance
in height than in

77
00:05:11,900 --> 00:05:15,770
width, or maybe there is
and maybe there isn't.

78
00:05:15,770 --> 00:05:19,580
So here on the left we don't
have any scaling, and we see a

79
00:05:19,580 --> 00:05:22,830
very natural clustering.

80
00:05:22,830 --> 00:05:27,670
On the other hand, we notice on
the y-axis the values range

81
00:05:27,670 --> 00:05:34,390
from not too far from 0
to not too far from 1.

82
00:05:34,390 --> 00:05:39,750
Whereas on the x-axis, the
dynamic range is much less,

83
00:05:39,750 --> 00:05:44,520
not too far from 0 to not
too far from 1/2.

84
00:05:44,520 --> 00:05:50,940
So we have twice the dynamic
range here than we have here.

85
00:05:50,940 --> 00:05:54,520
Therefore, not surprisingly,
when we end up doing the

86
00:05:54,520 --> 00:06:02,180
clustering, width plays
a very important role.

87
00:06:02,180 --> 00:06:04,390
And we end up clustering
it this way,

88
00:06:04,390 --> 00:06:07,110
dividing it along here.

89
00:06:07,110 --> 00:06:11,190
On the other hand, if I take
exactly the same data and

90
00:06:11,190 --> 00:06:18,040
scale it, and now the x-axis
runs from 0 to 1/2 and the

91
00:06:18,040 --> 00:06:24,600
y-axis, roughly again, from 0
to 1, we see that suddenly

92
00:06:24,600 --> 00:06:26,870
when we look at it
geometrically, we end up

93
00:06:26,870 --> 00:06:30,540
getting a very different
look of clustering.

94
00:06:30,540 --> 00:06:32,530
What's the moral?

95
00:06:32,530 --> 00:06:37,850
The moral is you have to think
hard about how to cluster your

96
00:06:37,850 --> 00:06:41,750
features, about how to scale
your features, because it can

97
00:06:41,750 --> 00:06:45,580
have a dramatic influence
on your answer.

98
00:06:45,580 --> 00:06:50,100
We'll see some real life
examples of this shortly.

99
00:06:50,100 --> 00:06:52,300
But these are all the important
things to think

100
00:06:52,300 --> 00:07:00,100
about, and they all, in some
sense, tie up into the same

101
00:07:00,100 --> 00:07:01,640
major point.

102
00:07:01,640 --> 00:07:05,210
Whenever you're doing any kind
of learning, including

103
00:07:05,210 --> 00:07:15,820
clustering, feature, selection,

104
00:07:15,820 --> 00:07:21,705
and scaling is critical.

105
00:07:25,740 --> 00:07:31,420
It is where most of the thinking
ends up going.

106
00:07:31,420 --> 00:07:34,280
And then the rest gets to
be fairly mechanical.

107
00:07:37,530 --> 00:07:42,450
How do we decide what features
to use and how to scale them?

108
00:07:42,450 --> 00:07:45,630
We do that using domain
knowledge.

109
00:07:48,800 --> 00:07:54,800
So we actually have to think
about the objects that we're

110
00:07:54,800 --> 00:07:58,940
trying to learn about and what
the objective of the learning

111
00:07:58,940 --> 00:08:00,190
process is.

112
00:08:03,200 --> 00:08:09,960
So continuing, how do
we do the scaling?

113
00:08:09,960 --> 00:08:13,850
Most of the time, it's done
using some variant of what's

114
00:08:13,850 --> 00:08:15,350
called the Minkowski metric.

115
00:08:18,950 --> 00:08:22,520
It's not nearly as imposing
as it looks.

116
00:08:22,520 --> 00:08:29,040
So the distance between two
vectors, X1 and X2, and then

117
00:08:29,040 --> 00:08:34,270
we use p to talk about,
essentially, the degree we're

118
00:08:34,270 --> 00:08:36,330
going to be using.

119
00:08:36,330 --> 00:08:39,460
So we take the absolute
difference between each

120
00:08:39,460 --> 00:08:47,460
element of X1 and X2, raise it
to the p-th power, sum them

121
00:08:47,460 --> 00:08:52,460
and then take the 1 over p.

122
00:08:52,460 --> 00:08:56,150
Not very complicated,
so let's say p is 2.

123
00:08:56,150 --> 00:08:59,560
That's the one you people
are most familiar with.

124
00:08:59,560 --> 00:09:01,400
Effectively, all we're
doing is getting

125
00:09:01,400 --> 00:09:03,650
the Euclidean distance.

126
00:09:03,650 --> 00:09:06,060
What we looked at when we looked
at the mean squared

127
00:09:06,060 --> 00:09:10,970
distance between two things,
between our errors and our

128
00:09:10,970 --> 00:09:13,250
measured data, between our
measured data and our

129
00:09:13,250 --> 00:09:14,770
predicted data.

130
00:09:14,770 --> 00:09:17,640
We used the mean square error.

131
00:09:17,640 --> 00:09:20,260
That's essentially in Minkowski
distance with p

132
00:09:20,260 --> 00:09:23,060
equal to 2.

133
00:09:23,060 --> 00:09:27,870
That's probably the most
commonly used, but an almost

134
00:09:27,870 --> 00:09:32,900
equally commonly used sets
p equal to 1, and that's

135
00:09:32,900 --> 00:09:34,610
something called the
Manhattan distance.

136
00:09:39,380 --> 00:09:42,480
I suspect at least some of you
have spent time walking around

137
00:09:42,480 --> 00:09:49,860
Manhattan, a small but densely
populated island in New York.

138
00:09:49,860 --> 00:09:53,440
And midtown Manhattan has
the feature that it's

139
00:09:53,440 --> 00:09:54,690
laid out in a grid.

140
00:09:57,120 --> 00:10:05,740
So what you have is a grid,
and you have the avenues

141
00:10:05,740 --> 00:10:11,730
running north-south and the
streets running east-west.

142
00:10:16,520 --> 00:10:21,180
And if you want to walk from,
say, here to here or drive

143
00:10:21,180 --> 00:10:24,460
from here to here, you cannot
take the diagonal because

144
00:10:24,460 --> 00:10:27,200
there are a bunch of buildings
in the way.

145
00:10:27,200 --> 00:10:30,800
And so you have to move either
left or right, or up, or down.

146
00:10:33,480 --> 00:10:38,440
That's the Manhattan distance
between two points.

147
00:10:38,440 --> 00:10:42,950
This is used, in fact, for a
lot of problems, typically

148
00:10:42,950 --> 00:10:46,240
when somebody is comparing the
distance between two genes,

149
00:10:46,240 --> 00:10:51,390
for example, they use a
Manhattan metric rather than a

150
00:10:51,390 --> 00:10:55,700
Euclidean metric to say how
similar two things are.

151
00:10:59,910 --> 00:11:04,550
Just wanted to show that because
it is something that

152
00:11:04,550 --> 00:11:07,170
you will run across in the
literature when you read about

153
00:11:07,170 --> 00:11:08,420
these kinds of things.

154
00:11:19,700 --> 00:11:19,946
All right.

155
00:11:19,946 --> 00:11:25,430
So far, we've talked about
issues where things are

156
00:11:25,430 --> 00:11:27,410
comparable.

157
00:11:27,410 --> 00:11:31,600
And we've been doing that by
representing each element of

158
00:11:31,600 --> 00:11:36,700
the feature vector as a
floating point number.

159
00:11:36,700 --> 00:11:39,650
So we can run a formula like
that by subtracting

160
00:11:39,650 --> 00:11:40,900
one from the other.

161
00:11:43,620 --> 00:11:49,760
But we often, in fact, have to
deal with nominal categories,

162
00:11:49,760 --> 00:11:51,925
things that have names
rather than numbers.

163
00:11:58,370 --> 00:12:04,100
So for clustering people, maybe
we care about eye color,

164
00:12:04,100 --> 00:12:06,650
blue, brown, gray, green.

165
00:12:06,650 --> 00:12:07,940
Hair color.

166
00:12:07,940 --> 00:12:12,594
Well, how do you compare
blue to green?

167
00:12:12,594 --> 00:12:16,160
Do you subtract one
from the other?

168
00:12:16,160 --> 00:12:17,000
Kind of hard to do.

169
00:12:17,000 --> 00:12:19,490
What does it mean to subtract
green from blue?

170
00:12:19,490 --> 00:12:21,700
Well, I guess we could talk
about it in the frequency

171
00:12:21,700 --> 00:12:25,220
domain, enlighten things.

172
00:12:25,220 --> 00:12:30,070
Typically, what we have to do
in that case is, we convert

173
00:12:30,070 --> 00:12:40,030
them to a number and
then have some ways

174
00:12:40,030 --> 00:12:42,380
to relate the numbers.

175
00:12:42,380 --> 00:12:46,960
Again, this is a place where
domain knowledge is critical.

176
00:12:46,960 --> 00:12:51,340
So, for example, we might
convert blue to 0, green to

177
00:12:51,340 --> 00:12:57,210
0.5, and brown to 1, thus
indicating that we think blue

178
00:12:57,210 --> 00:13:02,430
eyes are closer to green eyes
than they are to brown eyes.

179
00:13:02,430 --> 00:13:06,730
I don't know why we think that
but maybe we think that.

180
00:13:06,730 --> 00:13:09,870
Red hair is closer to blonde
hair than it is to black hair.

181
00:13:09,870 --> 00:13:12,530
I don't know.

182
00:13:12,530 --> 00:13:15,120
These are the sorts of things
that are not mathematical

183
00:13:15,120 --> 00:13:17,670
questions, typically,
but judgments that

184
00:13:17,670 --> 00:13:20,980
people have to make.

185
00:13:20,980 --> 00:13:27,450
Once we've converted things to
numbers, we then have to go

186
00:13:27,450 --> 00:13:34,840
back to our old friend of
scaling, which is often called

187
00:13:34,840 --> 00:13:36,090
normalization.

188
00:13:41,940 --> 00:13:47,300
Very often we try and contrive
to have every feature range

189
00:13:47,300 --> 00:13:51,280
between 0 and 1, for example,
so that everything is

190
00:13:51,280 --> 00:13:54,870
normalized to the same
dynamic range,

191
00:13:54,870 --> 00:13:57,040
and then we can compare.

192
00:13:57,040 --> 00:13:59,290
Is that the right thing to do?

193
00:13:59,290 --> 00:14:02,500
Not necessarily, because you
might consider some features

194
00:14:02,500 --> 00:14:04,950
more important than others
and want to give

195
00:14:04,950 --> 00:14:06,730
them a greater weight.

196
00:14:06,730 --> 00:14:08,050
And, again, that's something
we'll come

197
00:14:08,050 --> 00:14:09,300
back to and look at.

198
00:14:11,680 --> 00:14:13,270
All this is a bit abstract.

199
00:14:13,270 --> 00:14:15,720
I now want to look
at an example.

200
00:14:15,720 --> 00:14:19,730
Let's look at the example
of clustering mammals.

201
00:14:19,730 --> 00:14:23,210
There are, essentially, an
unbounded number of features

202
00:14:23,210 --> 00:14:28,650
you could use, size at birth,
gestation period, lifespan,

203
00:14:28,650 --> 00:14:33,240
length of tail, speed,
eating habits.

204
00:14:33,240 --> 00:14:35,150
You name it.

205
00:14:35,150 --> 00:14:38,420
The choice of features and
weighting will, of course,

206
00:14:38,420 --> 00:14:42,450
have an enormous impact on
what clusters you get.

207
00:14:42,450 --> 00:14:47,910
If you choose size, humans might
appear in one cluster.

208
00:14:47,910 --> 00:14:53,740
If you choose eating habits,
they might appear in another.

209
00:14:53,740 --> 00:14:57,270
How should you choose which
features you want?

210
00:14:57,270 --> 00:15:01,920
You have to begin by choosing,
thinking about the reason

211
00:15:01,920 --> 00:15:04,200
you're doing the clustering
in the first place.

212
00:15:04,200 --> 00:15:08,005
What is it you're trying to
learn about the mammals?

213
00:15:10,510 --> 00:15:13,550
As an example, I'm going
to choose the

214
00:15:13,550 --> 00:15:17,580
objective of eating habits.

215
00:15:17,580 --> 00:15:19,940
I want to cluster mammals
somehow based

216
00:15:19,940 --> 00:15:22,340
upon what they eat.

217
00:15:22,340 --> 00:15:25,610
But I want to do that, and
here's a very important thing

218
00:15:25,610 --> 00:15:30,310
about what we often see in
learning without any direct

219
00:15:30,310 --> 00:15:32,172
information about
what they eat.

220
00:15:34,830 --> 00:15:41,720
Typically, when we're using
machine learning, we're trying

221
00:15:41,720 --> 00:15:47,420
to learn about something
for which we have

222
00:15:47,420 --> 00:15:51,010
limited or no data.

223
00:15:51,010 --> 00:15:56,100
Remember when we talked about
learning, I talked about

224
00:15:56,100 --> 00:15:58,950
learning in which it was
supervised, and which we had

225
00:15:58,950 --> 00:16:04,510
some data, and unsupervised,
in which, essentially, we

226
00:16:04,510 --> 00:16:07,530
don't have any labels.

227
00:16:07,530 --> 00:16:13,790
So let's say we don't have any
labels about what mammals eat,

228
00:16:13,790 --> 00:16:18,770
but we do know a lot about
the mammals themselves.

229
00:16:18,770 --> 00:16:23,380
And, in fact, the hypothesis I'm
going to start with here

230
00:16:23,380 --> 00:16:31,570
is that you can infer people's
or creatures' eating habits

231
00:16:31,570 --> 00:16:37,260
from their dental records,
or their dentitian.

232
00:16:37,260 --> 00:16:41,100
But over time, we have evolved,
all creatures have

233
00:16:41,100 --> 00:16:47,020
evolved, to have teeth that are
related to what they eat,

234
00:16:47,020 --> 00:16:48,680
we can see.

235
00:16:48,680 --> 00:16:56,150
So I managed to procure a
database of dentitian for

236
00:16:56,150 --> 00:16:57,400
various mammals.

237
00:17:02,070 --> 00:17:03,325
There's the laser pointer.

238
00:17:06,470 --> 00:17:10,980
So what I've got here
is the number of

239
00:17:10,980 --> 00:17:12,020
different kinds of teeth.

240
00:17:12,020 --> 00:17:17,099
So the right top incisors, the
right bottom incisors, molars,

241
00:17:17,099 --> 00:17:19,040
et cetera, pre-molars.

242
00:17:19,040 --> 00:17:21,460
Don't worry if you don't know
about teeth very much.

243
00:17:21,460 --> 00:17:23,859
I don't know very much.

244
00:17:23,859 --> 00:17:26,150
And then for each animal,
I have the number of

245
00:17:26,150 --> 00:17:27,400
each kind of tooth.

246
00:17:29,850 --> 00:17:32,590
Actually, I don't have it for
this particular mammal, but

247
00:17:32,590 --> 00:17:33,970
these two I do.

248
00:17:33,970 --> 00:17:35,720
I don't even remember
what they are.

249
00:17:35,720 --> 00:17:36,970
They're cute.

250
00:17:39,910 --> 00:17:40,010
All right.

251
00:17:40,010 --> 00:17:46,200
So I've got that database, and
now I want to try and see what

252
00:17:46,200 --> 00:17:47,560
happens when I cluster them.

253
00:17:51,540 --> 00:17:57,450
The code to do this is not very
complicated, but I should

254
00:17:57,450 --> 00:17:58,930
make a confession about it.

255
00:18:10,330 --> 00:18:12,830
Last night, I won't
say I learned it.

256
00:18:12,830 --> 00:18:15,480
I was reminded of a lesson that
I've often preached in

257
00:18:15,480 --> 00:18:19,410
6.00, is that it's not good to
get your programming done at

258
00:18:19,410 --> 00:18:20,810
the last minute.

259
00:18:20,810 --> 00:18:23,810
So as I was debugging this code
at 2:00 and 3:00 in the

260
00:18:23,810 --> 00:18:28,100
morning today, I was realizing
how inefficient I am at

261
00:18:28,100 --> 00:18:29,600
debugging at that hour.

262
00:18:29,600 --> 00:18:31,990
Maybe for you guys that's
the shank of the day.

263
00:18:31,990 --> 00:18:33,610
For me, it's too late.

264
00:18:33,610 --> 00:18:38,310
I think it all works, but I was
certainly not at my best

265
00:18:38,310 --> 00:18:42,570
as I was debugging last night.

266
00:18:42,570 --> 00:18:42,900
All right.

267
00:18:42,900 --> 00:18:48,160
But at the moment, I don't want
you to spend time working

268
00:18:48,160 --> 00:18:50,550
on the code itself.

269
00:18:50,550 --> 00:18:54,550
I would like you to think a
little bit about the overall

270
00:18:54,550 --> 00:18:58,750
class structure of the code,
which I've got on the first

271
00:18:58,750 --> 00:19:00,000
page of the handout.

272
00:19:02,490 --> 00:19:08,670
So at the bottom of my
hierarchy, I've got something

273
00:19:08,670 --> 00:19:16,440
called a point, and that's an
abstraction of the things to

274
00:19:16,440 --> 00:19:18,220
be clustered.

275
00:19:18,220 --> 00:19:23,780
And I've done it in quite a
generalized way, because, as

276
00:19:23,780 --> 00:19:27,110
you're going to see, the code
we're looking at today, I'm

277
00:19:27,110 --> 00:19:29,790
going to use not only for
clustering mammals but for

278
00:19:29,790 --> 00:19:32,880
clustering all sorts of
other things as well.

279
00:19:32,880 --> 00:19:37,520
So I decided to take the trouble
of building up a set

280
00:19:37,520 --> 00:19:40,550
of classes that would
be useful.

281
00:19:40,550 --> 00:19:46,860
And in this class, I can have
the name of a point, its

282
00:19:46,860 --> 00:19:48,580
original attributes.

283
00:19:48,580 --> 00:19:51,760
That say its original feature
vector, an unscaled feature

284
00:19:51,760 --> 00:19:56,720
vector, and then whether or not
I choose to normalize it.

285
00:19:56,720 --> 00:19:59,580
I might have normalized
features as well.

286
00:19:59,580 --> 00:20:01,700
Again, I don't want you worry
too much about the

287
00:20:01,700 --> 00:20:03,820
details of the code.

288
00:20:03,820 --> 00:20:07,280
And then I have a distance
metric, and I'm just for the

289
00:20:07,280 --> 00:20:09,340
moment using simple Euclidean
distance.

290
00:20:12,130 --> 00:20:16,550
The next element in my
hierarchy, not yet a

291
00:20:16,550 --> 00:20:17,190
hierarchy--

292
00:20:17,190 --> 00:20:20,350
it's still flat--

293
00:20:20,350 --> 00:20:21,600
is a cluster.

294
00:20:29,210 --> 00:20:32,550
And so what a cluster is, you
can think of it as, at some

295
00:20:32,550 --> 00:20:35,990
abstract level, it's just going
to be a set of points,

296
00:20:35,990 --> 00:20:39,170
the points that are
in the cluster.

297
00:20:39,170 --> 00:20:42,605
But I've got some other
operations on it

298
00:20:42,605 --> 00:20:43,855
that will be useful.

299
00:20:46,260 --> 00:20:49,950
I can compute the distance
between two clusters, and as

300
00:20:49,950 --> 00:20:53,220
you'll see, I have single
linkage, Mac Link, max ,

301
00:20:53,220 --> 00:20:56,970
average, the three I talked
about last week.

302
00:20:56,970 --> 00:21:00,320
And also this notion
of a centroid.

303
00:21:00,320 --> 00:21:04,020
We'll come back to that when we
get to k-means clustering.

304
00:21:07,280 --> 00:21:09,910
We don't need to worry right
now about what that is.

305
00:21:15,210 --> 00:21:19,030
Then I'm going to have
a cluster set.

306
00:21:19,030 --> 00:21:20,715
That's another useful
data abstraction.

307
00:21:27,820 --> 00:21:31,740
And that's what you might guess
from its name, just a

308
00:21:31,740 --> 00:21:34,000
set of clusters.

309
00:21:34,000 --> 00:21:36,755
The most interesting operation
there is merge.

310
00:21:40,050 --> 00:21:43,070
As you saw, when we looked at
hierarchical clustering last

311
00:21:43,070 --> 00:21:47,270
week, the key step there is
merging two clusters.

312
00:21:47,270 --> 00:21:54,500
And in doing that, I'm going to
have a function called Find

313
00:21:54,500 --> 00:22:00,960
Closest, which given a metric
and a cluster, finds the

314
00:22:00,960 --> 00:22:05,890
cluster that is most similar to
that, to self, because as

315
00:22:05,890 --> 00:22:08,090
you, again, will recall from
hierarchical clustering,

316
00:22:08,090 --> 00:22:10,600
that's what I merged at
each step is the two

317
00:22:10,600 --> 00:22:11,850
most similar clusters.

318
00:22:14,580 --> 00:22:18,040
And then there's some details
about how it works, which

319
00:22:18,040 --> 00:22:21,080
again, we don't need to worry
about at the moment.

320
00:22:24,430 --> 00:22:30,250
And then I'm going to have a
subclass of point called

321
00:22:30,250 --> 00:22:51,070
Mammal, in which I will
represent each mammal by the

322
00:22:51,070 --> 00:22:53,395
dentitian as we've
looked at before.

323
00:22:57,870 --> 00:23:02,960
Then pretty simply, we can do
a bunch of things with it.

324
00:23:06,010 --> 00:23:09,380
Before we look at the other
details of the code, I want to

325
00:23:09,380 --> 00:23:12,500
now run it and see
what we get.

326
00:23:12,500 --> 00:23:15,960
So I'm just going to use
hierarchical clustering now to

327
00:23:15,960 --> 00:23:21,360
cluster the mammals based upon
this feature vector, which

328
00:23:21,360 --> 00:23:24,650
will be a list of numbers
showing how many of each kind

329
00:23:24,650 --> 00:23:26,930
of tooth the mammals have.

330
00:23:26,930 --> 00:23:28,180
Let's see what we get.

331
00:23:37,850 --> 00:23:40,170
So it's doing the merging.

332
00:23:45,390 --> 00:23:50,750
So we can see the first step, it
merged beavers with ground

333
00:23:50,750 --> 00:23:54,980
hogs and it merged grey
squirrels with porcupines,

334
00:23:54,980 --> 00:23:57,120
wolves and bears.

335
00:23:57,120 --> 00:24:01,230
Various other kinds of things,
like jaguars and cougars, were

336
00:24:01,230 --> 00:24:03,650
a lot alike.

337
00:24:03,650 --> 00:24:06,610
Eventually, it starts doing
more complicated merges.

338
00:24:06,610 --> 00:24:10,470
It merges a cluster containing
only the river otter with one

339
00:24:10,470 --> 00:24:17,050
containing a Martin and a
wolverine, beavers and ground

340
00:24:17,050 --> 00:24:21,220
hogs with squirrels and
porcupines, et cetera.

341
00:24:21,220 --> 00:24:25,940
And at the end, I had it
stop with two clusters.

342
00:24:25,940 --> 00:24:29,310
It came up with these
clusters.

343
00:24:29,310 --> 00:24:33,340
Now we can look at these
clusters and say, all right.

344
00:24:33,340 --> 00:24:34,420
What do we think?

345
00:24:34,420 --> 00:24:38,070
Have we learned anything
interesting?

346
00:24:38,070 --> 00:24:40,480
Do we see anything
in any of these--

347
00:24:40,480 --> 00:24:41,790
do we think it makes sense?

348
00:24:41,790 --> 00:24:45,560
Remember, our goal was to
cluster mammals based upon

349
00:24:45,560 --> 00:24:47,910
what they might eat.

350
00:24:47,910 --> 00:24:50,625
And we can ask, do we think
this corresponds to that?

351
00:24:54,400 --> 00:24:54,710
No.

352
00:24:54,710 --> 00:24:55,030
All right.

353
00:24:55,030 --> 00:24:55,950
Who-- somebody said--

354
00:24:55,950 --> 00:24:59,200
Now, why no?

355
00:24:59,200 --> 00:25:01,650
Go ahead.

356
00:25:01,650 --> 00:25:04,000
AUDIENCE: We've got-- like a
deer doesn't eat similar

357
00:25:04,000 --> 00:25:06,190
things as a dog.

358
00:25:06,190 --> 00:25:09,760
And we've got one type on the
top cluster and a different

359
00:25:09,760 --> 00:25:11,140
kind of bat in the
bottom cluster.

360
00:25:11,140 --> 00:25:13,290
Seems like they would be
even closer together.

361
00:25:13,290 --> 00:25:14,610
PROFESSOR: Well, sorry.

362
00:25:14,610 --> 00:25:16,810
Yeah.

363
00:25:16,810 --> 00:25:21,650
A deer doesn't eat what a dog
eats, and for that matter, we

364
00:25:21,650 --> 00:25:26,050
have humans here, and while
some human are by choice

365
00:25:26,050 --> 00:25:29,930
vegetarians, genetically,
humans are essentially

366
00:25:29,930 --> 00:25:30,740
carnivores.

367
00:25:30,740 --> 00:25:31,810
We know that.

368
00:25:31,810 --> 00:25:33,490
We eat meat.

369
00:25:33,490 --> 00:25:38,300
And here we are with a bunch
of herbivores, typically.

370
00:25:38,300 --> 00:25:40,620
Things are strange.

371
00:25:40,620 --> 00:25:43,260
By the way, bats might end up
being in ones, because some

372
00:25:43,260 --> 00:25:47,590
bats eat fruit, other bat eat
insects, but who knows?

373
00:25:47,590 --> 00:25:53,200
So I'm not very happy.

374
00:25:53,200 --> 00:25:56,950
Why do you think we got this
clustering that maybe isn't

375
00:25:56,950 --> 00:25:58,200
helping us very much?

376
00:26:02,680 --> 00:26:07,210
Well, let's go look at
what we did here.

377
00:26:07,210 --> 00:26:08,480
Let's look at test 0.

378
00:26:13,050 --> 00:26:16,560
So I said I wanted
two clusters.

379
00:26:16,560 --> 00:26:19,820
I don't want it to print all
the steps along the way.

380
00:26:19,820 --> 00:26:22,670
I'm going to print the
history at the end.

381
00:26:22,670 --> 00:26:24,390
And scaling is identity.

382
00:26:27,900 --> 00:26:32,700
Well, let's go back and look
at some of the data here.

383
00:26:43,500 --> 00:26:46,590
What we can see is--

384
00:26:46,590 --> 00:26:49,660
or maybe we can't see too
quickly, looking at all this--

385
00:26:49,660 --> 00:26:55,570
is some kinds of teeth have
a relatively small range.

386
00:26:55,570 --> 00:26:58,130
Other kinds of teeth
have a big range.

387
00:27:00,820 --> 00:27:05,980
And so, at the moment, we're not
doing any normalization,

388
00:27:05,980 --> 00:27:09,050
and maybe what we're doing is
getting something distorted

389
00:27:09,050 --> 00:27:12,250
where we're only looking at a
certain kind of tooth because

390
00:27:12,250 --> 00:27:16,180
it has a larger dynamic range.

391
00:27:16,180 --> 00:27:27,250
And in fact, if we look at the
code, we can go back up and

392
00:27:27,250 --> 00:27:35,820
let's look at Build Mammal
Points and Read Mammal Data.

393
00:27:35,820 --> 00:27:41,670
So Build Mammal Points calls
Read Mammal Data, and then

394
00:27:41,670 --> 00:27:42,510
builds the points.

395
00:27:42,510 --> 00:27:46,150
So Read Mammal Data is the
interesting piece.

396
00:27:46,150 --> 00:27:55,150
And what we can see here
is, as we read it in--

397
00:27:55,150 --> 00:27:59,820
this is just simply reading
things in, ignoring comments,

398
00:27:59,820 --> 00:28:02,560
keeping track of things--

399
00:28:02,560 --> 00:28:07,350
and then we come down here,
I might do some scaling.

400
00:28:10,010 --> 00:28:18,160
So Point.Scale feature is using
the scaling argument.

401
00:28:18,160 --> 00:28:19,850
Where's that coming from?

402
00:28:25,430 --> 00:28:36,170
If we look at Mammal Teeth, here
from the mammal class, we

403
00:28:36,170 --> 00:28:39,950
see that there are two ways to
scale it, identity, where we

404
00:28:39,950 --> 00:28:42,810
just multiply every element
in the vector by 1.

405
00:28:45,350 --> 00:28:46,880
That doesn't change anything.

406
00:28:46,880 --> 00:28:50,220
Or what I've called
1 over max.

407
00:28:50,220 --> 00:28:53,940
And here, I've looked at the
maximum number of each kind of

408
00:28:53,940 --> 00:28:58,200
tooth and I'm dividing
1 by that.

409
00:28:58,200 --> 00:29:01,440
So here we could have up
to three of those.

410
00:29:01,440 --> 00:29:02,930
Here we could have
four of those.

411
00:29:02,930 --> 00:29:06,790
We could have six of this kind
of tooth, whatever it is.

412
00:29:06,790 --> 00:29:11,770
And so we can see, by dividing
by the max, I'm now putting

413
00:29:11,770 --> 00:29:17,610
all of the different kinds of
teeth on the same scale.

414
00:29:17,610 --> 00:29:19,600
I'm normalizing.

415
00:29:19,600 --> 00:29:24,340
And now we'll see, does that
make a difference?

416
00:29:24,340 --> 00:29:27,050
Well, since we're dividing
by 6 here and 3 here, it

417
00:29:27,050 --> 00:29:29,810
certainly could make
a difference.

418
00:29:29,810 --> 00:29:33,840
It's a significant scaling
factor, 2X.

419
00:29:33,840 --> 00:29:38,385
So let's go and change the
code, or change the test.

420
00:29:43,430 --> 00:29:50,370
And let's look at Test 0--

421
00:29:53,250 --> 00:29:55,080
0, not "O"--

422
00:29:55,080 --> 00:30:01,700
with scale set to 1 over max.

423
00:30:01,700 --> 00:30:04,960
You'll notice, by the way, that
rather than using some

424
00:30:04,960 --> 00:30:09,430
obscure code, like scale equals
12, I use strings so I

425
00:30:09,430 --> 00:30:11,410
remember what they are.

426
00:30:11,410 --> 00:30:16,100
It's, I think, a pretty useful
programming trick.

427
00:30:16,100 --> 00:30:16,385
Whoops.

428
00:30:16,385 --> 00:30:20,540
Did I use the wrong
name for this?

429
00:30:20,540 --> 00:30:21,790
Should be scaling?

430
00:30:34,360 --> 00:30:35,610
So off it's going.

431
00:30:40,190 --> 00:30:47,080
Now we get a different set of
things, and as far as I know,

432
00:30:47,080 --> 00:30:49,760
once we've scaled things, we
get what I think is a much

433
00:30:49,760 --> 00:30:53,650
more sensible pair, where I
think what we essentially have

434
00:30:53,650 --> 00:30:58,290
is the herbivores down here,
and the carnivores up here.

435
00:31:06,290 --> 00:31:06,335
Ok.

436
00:31:06,335 --> 00:31:08,780
I don't care how much you
know about teeth.

437
00:31:08,780 --> 00:31:11,470
The point is scaling
can really matter.

438
00:31:11,470 --> 00:31:13,420
You have to look at it, and you
have to think about what

439
00:31:13,420 --> 00:31:15,160
you're doing.

440
00:31:15,160 --> 00:31:18,430
And the interesting thing here
is that without any direct

441
00:31:18,430 --> 00:31:22,010
evidence about what mammals
eat, we are able to use

442
00:31:22,010 --> 00:31:26,180
machine learning, clustering in
this case, to infer a new

443
00:31:26,180 --> 00:31:31,410
fact that we have some mammals
that are similar in what they

444
00:31:31,410 --> 00:31:37,000
eat, and some mammals that are
also similar, some groups.

445
00:31:37,000 --> 00:31:41,510
Now I can't infer from this
herbivores versus carnivores

446
00:31:41,510 --> 00:31:43,950
because I didn't have any
labels to start with.

447
00:31:43,950 --> 00:31:47,310
But what I can infer is that,
whatever they eat, there's

448
00:31:47,310 --> 00:31:50,870
something similar about these
animals, and something similar

449
00:31:50,870 --> 00:31:52,630
about these animals.

450
00:31:52,630 --> 00:31:55,110
And there's a difference between
the groups in C1 and

451
00:31:55,110 --> 00:31:57,620
the groups in C0.

452
00:31:57,620 --> 00:32:01,510
I can then go off and look at
some points in each of these

453
00:32:01,510 --> 00:32:06,540
and then try and figure out
how to label them later.

454
00:32:06,540 --> 00:32:12,070
OK, let's look at a difference
data set, a far more

455
00:32:12,070 --> 00:32:14,160
interesting one, a richer one.

456
00:32:25,670 --> 00:32:27,840
Now, let's not look at
that version of it.

457
00:32:27,840 --> 00:32:30,050
That's too hard to read.

458
00:32:30,050 --> 00:32:39,500
Let's look at the Excel
spreadsheet.

459
00:32:39,500 --> 00:32:44,060
So this is a database I found
online of every county in the

460
00:32:44,060 --> 00:32:50,510
United States, and a bunch of
features about that county.

461
00:32:50,510 --> 00:32:52,010
So for each county
in the United

462
00:32:52,010 --> 00:32:54,360
States, we have its name.

463
00:32:56,890 --> 00:32:59,370
The first part of the name
is the state it's in.

464
00:32:59,370 --> 00:33:02,580
The second part of the name is
the name of the county, and a

465
00:33:02,580 --> 00:33:07,180
bunch of things, like the
average value of homes, how

466
00:33:07,180 --> 00:33:11,040
much poverty, its population
density, its population

467
00:33:11,040 --> 00:33:16,610
change, how many people are
over 65, et cetera.

468
00:33:16,610 --> 00:33:19,260
So the thing I want you to
notice, of course, is while

469
00:33:19,260 --> 00:33:23,850
everything is a number, the
scales are very different.

470
00:33:23,850 --> 00:33:28,830
It's a big difference between
the percent of something,

471
00:33:28,830 --> 00:33:33,120
which will go between 0 and
100, and the population

472
00:33:33,120 --> 00:33:38,530
density, which ranges over a
much larger dynamic range.

473
00:33:38,530 --> 00:33:43,130
So we can immediately suspect
that scaling is going to be an

474
00:33:43,130 --> 00:33:44,380
issue here.

475
00:33:46,570 --> 00:33:50,400
So we now have a bunch of code
that we can use that I've

476
00:33:50,400 --> 00:33:52,360
written to process this.

477
00:33:56,080 --> 00:34:07,760
It uses the same clusters that
we have here, except I've

478
00:34:07,760 --> 00:34:11,310
added a kind of Point
called the County.

479
00:34:11,310 --> 00:34:14,429
Looks very different from a
mammal, but the good news is I

480
00:34:14,429 --> 00:34:17,909
got to reuse a lot of my code.

481
00:34:17,909 --> 00:34:21,040
Now let's run a test.

482
00:34:21,040 --> 00:34:26,810
We'll go down here to Test 3,
and we'll see whether we can

483
00:34:26,810 --> 00:34:28,634
do hierarchical clustering
of the counties.

484
00:34:38,500 --> 00:34:40,050
Whoops.

485
00:34:40,050 --> 00:34:43,850
Test 3 wants the name
of what we're doing.

486
00:34:43,850 --> 00:34:44,940
So we'll give it the name.

487
00:34:44,940 --> 00:34:46,190
It's Counties.Text.

488
00:34:48,610 --> 00:34:52,114
I just exported the spreadsheet
as a text file.

489
00:34:55,110 --> 00:35:00,930
Well, we can wait a while for
this, but I'm not going to.

490
00:35:00,930 --> 00:35:04,480
Let's think about what we know
that hierarchical clustering

491
00:35:04,480 --> 00:35:08,590
and how long this is
likely to take.

492
00:35:08,590 --> 00:35:10,550
I'll give you a hint.

493
00:35:10,550 --> 00:35:16,870
There are approximately 3,100
counties in the United States.

494
00:35:16,870 --> 00:35:19,470
I'll bet none of you could
have guessed that number.

495
00:35:22,170 --> 00:35:25,160
How many comparisons do we have
to find the two counties

496
00:35:25,160 --> 00:35:26,700
that are most similar
to each other?

497
00:35:32,035 --> 00:35:36,400
Comparing each county with every
other county, how many

498
00:35:36,400 --> 00:35:38,340
comparisons is that
going to be?

499
00:35:38,340 --> 00:35:39,795
Yeah.

500
00:35:39,795 --> 00:35:41,670
AUDIENCE: It's 3,100 choose 2.

501
00:35:41,670 --> 00:35:42,540
PROFESSOR: Right.

502
00:35:42,540 --> 00:35:45,850
So that will be 3,100 squared.

503
00:35:45,850 --> 00:35:48,916
Thank you.

504
00:35:48,916 --> 00:35:53,076
And that's just the first
step in the cluster.

505
00:35:53,076 --> 00:35:58,530
To perform the next merge, we'll
have to do it again.

506
00:36:01,090 --> 00:36:06,310
So in fact, as we've looked at
last time, it's going to be a

507
00:36:06,310 --> 00:36:11,010
very long and tedious process,
and one I'm not

508
00:36:11,010 --> 00:36:12,260
going to wait for.

509
00:36:14,460 --> 00:36:16,600
So I'm going to interrupt and
we're going to look at a

510
00:36:16,600 --> 00:36:17,850
smaller example.

511
00:36:22,970 --> 00:36:32,700
Here I've just got only the
counties in New England, a

512
00:36:32,700 --> 00:36:36,550
much smaller number than
3,100, and I'm going to

513
00:36:36,550 --> 00:36:40,820
cluster them using the exact
same clustering code we used

514
00:36:40,820 --> 00:36:42,430
for the mammals.

515
00:36:42,430 --> 00:36:44,090
It's just that the
points are now

516
00:36:44,090 --> 00:36:48,190
counties instead of mammals.

517
00:36:48,190 --> 00:36:51,350
And we got two clusters.

518
00:36:51,350 --> 00:36:54,550
Middlesex County in
Massachusetts happens to be

519
00:36:54,550 --> 00:36:57,430
the county in which
MIT is located.

520
00:36:57,430 --> 00:37:00,130
And all the others--

521
00:37:00,130 --> 00:37:03,420
well, you know, MIT is a pretty
distinctive place.

522
00:37:03,420 --> 00:37:06,560
Maybe that's what did it.

523
00:37:06,560 --> 00:37:08,630
I don't quite think so.

524
00:37:08,630 --> 00:37:13,180
Someone got a hypothesis about
why we got this rather strange

525
00:37:13,180 --> 00:37:15,440
clustering?

526
00:37:15,440 --> 00:37:21,700
And is it because Middlesex
contains MIT and Harvard both?

527
00:37:21,700 --> 00:37:24,310
This really surprised me, by the
way, when I first ran it.

528
00:37:24,310 --> 00:37:27,970
I said, how can this be?

529
00:37:27,970 --> 00:37:33,880
So I went and I started looking
at the data, and what

530
00:37:33,880 --> 00:37:41,130
I found is that Middlesex County
has about 600,000 more

531
00:37:41,130 --> 00:37:45,430
people than any other county
in New England.

532
00:37:45,430 --> 00:37:46,430
Who knew?

533
00:37:46,430 --> 00:37:48,500
I would have guessed Suffolk,
where Boston is, was the

534
00:37:48,500 --> 00:37:49,710
biggest county.

535
00:37:49,710 --> 00:37:52,550
But, in fact, Middlesex is
enormous relative to every

536
00:37:52,550 --> 00:37:54,820
other county in New England.

537
00:37:54,820 --> 00:37:58,690
And it turns out that difference
of 600,000, when I

538
00:37:58,690 --> 00:38:03,330
didn't scale things, just
swamped everything else.

539
00:38:03,330 --> 00:38:06,610
And so all I'm really getting
here is a clustering that

540
00:38:06,610 --> 00:38:10,650
depends on the population.

541
00:38:10,650 --> 00:38:13,520
Middlesex is big relative
to everything else and,

542
00:38:13,520 --> 00:38:15,130
therefore, that's what I get.

543
00:38:15,130 --> 00:38:18,350
And it ignores things like
education level and housing

544
00:38:18,350 --> 00:38:21,610
prices, and all those other
things because the differences

545
00:38:21,610 --> 00:38:27,190
are small relative to 600,000.

546
00:38:27,190 --> 00:38:31,160
Well, let's turn scaling on.

547
00:38:31,160 --> 00:38:34,405
To do that, I want to show you
how I do this scaling.

548
00:38:38,690 --> 00:38:41,230
I did not, given the number
of features and number of

549
00:38:41,230 --> 00:38:44,430
counties, do what I did for
mammals and count them by hand

550
00:38:44,430 --> 00:38:46,520
to see what the maximum was.

551
00:38:46,520 --> 00:38:49,290
I decided it would be a lot
faster even at 2:00 in the

552
00:38:49,290 --> 00:38:53,400
morning to write
code to do it.

553
00:38:53,400 --> 00:38:54,855
So I've got some code here.

554
00:38:58,210 --> 00:39:03,140
I've got Build County Points,
just like Build Mammal Points

555
00:39:03,140 --> 00:39:06,640
and Read County Data, like
Read Mammal Data.

556
00:39:06,640 --> 00:39:10,940
But the difference here is,
along the way, as I'm reading

557
00:39:10,940 --> 00:39:13,070
in each county, I'm keeping
track of the

558
00:39:13,070 --> 00:39:16,670
maximum for each feature.

559
00:39:16,670 --> 00:39:18,420
And then I'm just going
to just do the scaling

560
00:39:18,420 --> 00:39:19,950
automatically.

561
00:39:19,950 --> 00:39:23,820
So exactly the one over max
scaling I did for mammals'

562
00:39:23,820 --> 00:39:27,150
teeth, I'm going to do it for
counties, but I've just

563
00:39:27,150 --> 00:39:32,360
written some code to automate
that process because I knew I

564
00:39:32,360 --> 00:39:35,910
would never be able
to count them.

565
00:39:35,910 --> 00:39:37,360
All right, so now let's
see what happens if

566
00:39:37,360 --> 00:39:38,610
we run it that way.

567
00:39:42,600 --> 00:39:52,380
Test 3, New England, and
Scale equals True.

568
00:39:52,380 --> 00:39:54,910
I'm either scaling it or not,
is the way I wrote this one.

569
00:40:09,910 --> 00:40:12,710
And with the scaling on
again, I get a very

570
00:40:12,710 --> 00:40:13,760
different set of clusters.

571
00:40:13,760 --> 00:40:16,020
What have we got?

572
00:40:16,020 --> 00:40:18,506
Where's Middlesex?

573
00:40:18,506 --> 00:40:20,350
It's in one of these
2 clusters.

574
00:40:20,350 --> 00:40:21,130
Oh, here it is.

575
00:40:21,130 --> 00:40:23,970
It's C0.

576
00:40:23,970 --> 00:40:26,420
But it's with Fairfield,
Connecticut and Hartford,

577
00:40:26,420 --> 00:40:31,340
Connecticut and Providence,
Rhode Island.

578
00:40:31,340 --> 00:40:32,750
It's a different answer.

579
00:40:32,750 --> 00:40:35,420
Is it a better answer?

580
00:40:35,420 --> 00:40:37,350
It's not a meaningful
question, right?

581
00:40:37,350 --> 00:40:40,900
It depends what I'm trying to
infer, what we hope to learn

582
00:40:40,900 --> 00:40:44,040
from the clustering, and that's
a question we're going

583
00:40:44,040 --> 00:40:48,130
to come back to on Tuesday in
some detail with the counties,

584
00:40:48,130 --> 00:40:52,180
and look at how, by using
different kinds of scaling or

585
00:40:52,180 --> 00:40:55,610
different kinds of features, we
can learn different things

586
00:40:55,610 --> 00:40:59,170
about the counties
in this country.

587
00:40:59,170 --> 00:41:01,460
Before I do that, however,
I want to move

588
00:41:01,460 --> 00:41:04,550
away from New England.

589
00:41:04,550 --> 00:41:07,930
Remember we're focusing on New
England because it took too

590
00:41:07,930 --> 00:41:12,630
long to do hierarchical
clustering of 3,100 counties.

591
00:41:12,630 --> 00:41:14,300
But that's what I want to do.

592
00:41:14,300 --> 00:41:16,410
It's no good to just
say, I'm sorry.

593
00:41:16,410 --> 00:41:17,170
It took too long.

594
00:41:17,170 --> 00:41:18,770
I give up.

595
00:41:18,770 --> 00:41:21,770
Well, the good news is there
are other clustering

596
00:41:21,770 --> 00:41:26,040
mechanisms that are much
more efficient.

597
00:41:26,040 --> 00:41:31,360
We'll later see they, too,
have their own faults.

598
00:41:31,360 --> 00:41:36,720
But we're going to look at
k-means clustering, which has

599
00:41:36,720 --> 00:41:41,420
the big advantage of being fast
enough that we can run it

600
00:41:41,420 --> 00:41:43,750
on very big data sets.

601
00:41:43,750 --> 00:41:48,620
In fact, it is roughly linear
in the number of counties.

602
00:41:48,620 --> 00:41:52,730
And as we've seen before, when
n gets very large, anything

603
00:41:52,730 --> 00:41:58,270
that's worse than linear is
likely to be ineffective.

604
00:41:58,270 --> 00:42:00,420
So let's think about
how k-means works.

605
00:42:03,710 --> 00:42:07,870
Step one, is you choose k.

606
00:42:11,790 --> 00:42:15,430
k is the total number of
clusters you want to have when

607
00:42:15,430 --> 00:42:16,680
you're done.

608
00:42:18,830 --> 00:42:22,140
So you start by saying, I want
to take the counties and split

609
00:42:22,140 --> 00:42:25,020
them into k-clusters.

610
00:42:25,020 --> 00:42:29,400
2 clusters, 20 clusters, a 100
clusters, 1,000 clusters.

611
00:42:29,400 --> 00:42:33,860
You have to choose k
in the beginning.

612
00:42:33,860 --> 00:42:38,520
And that it's one of the issues
that you have with

613
00:42:38,520 --> 00:42:42,640
k-means clustering is,
how do you choose k?

614
00:42:42,640 --> 00:42:46,500
We can talk about that later.

615
00:42:46,500 --> 00:43:02,630
Once I've chosen k, I choose k
points as initial centroids.

616
00:43:02,630 --> 00:43:10,070
You may remember earlier today
we saw this centroid method in

617
00:43:10,070 --> 00:43:12,200
class cluster.

618
00:43:12,200 --> 00:43:13,450
So what's a centroid?

619
00:43:19,400 --> 00:43:24,260
You've got a cluster, and in
the clusters, you've got a

620
00:43:24,260 --> 00:43:25,830
bunch of points scattered
around.

621
00:43:28,940 --> 00:43:32,640
The centroid you can think of
as, quote, "the average

622
00:43:32,640 --> 00:43:38,490
point," the center
of the cluster.

623
00:43:38,490 --> 00:43:41,480
The centroid need not be any of
the points in the cluster.

624
00:43:44,690 --> 00:43:46,630
So, again, you need
some metric.

625
00:43:46,630 --> 00:43:48,980
But let's say we're
using Euclidean.

626
00:43:48,980 --> 00:43:50,670
It's easy to see on the board.

627
00:43:50,670 --> 00:43:52,300
The centroid is kind of there.

628
00:43:57,440 --> 00:44:06,350
Now let's assume that we're
going to start by choosing

629
00:44:06,350 --> 00:44:10,340
k-point from the initial
set and labeling each

630
00:44:10,340 --> 00:44:11,590
of them as a centroid.

631
00:44:14,400 --> 00:44:16,740
We often--

632
00:44:16,740 --> 00:44:18,960
in fact, quite typically--

633
00:44:18,960 --> 00:44:20,225
choose these at random.

634
00:44:29,900 --> 00:44:34,110
So we now have k randomly chosen
points, each of which

635
00:44:34,110 --> 00:44:35,360
we're going to call centroid.

636
00:44:43,100 --> 00:44:51,310
The next step is to
assign each point

637
00:44:51,310 --> 00:44:52,560
to the nearest centroid.

638
00:45:00,770 --> 00:45:02,490
So we've got k-centroids.

639
00:45:02,490 --> 00:45:07,280
We usually choose a
small k, say 50.

640
00:45:07,280 --> 00:45:12,510
And now we have to compare each
of the 3,100 counties to

641
00:45:12,510 --> 00:45:16,860
each of the 50 centroids, and
put each one in the correct

642
00:45:16,860 --> 00:45:18,265
thing, in the closest.

643
00:45:20,940 --> 00:45:28,350
So it's 50 times 3,100, which
is a lot smaller number than

644
00:45:28,350 --> 00:45:31,960
3,100 squared.

645
00:45:31,960 --> 00:45:34,630
So now I've got a clustering.

646
00:45:34,630 --> 00:45:38,380
Kind of strange, because what it
looks like depends on this

647
00:45:38,380 --> 00:45:40,230
random choice.

648
00:45:40,230 --> 00:45:45,580
So there's no reason to expect
that the initial assignment

649
00:45:45,580 --> 00:45:47,025
will give me anything
very useful.

650
00:45:51,390 --> 00:46:00,160
Step (4) is, for each
of the k-clusters,

651
00:46:00,160 --> 00:46:01,410
choose a new centroid.

652
00:46:17,910 --> 00:46:23,060
Now remember, I just chose
at random k-centroids.

653
00:46:23,060 --> 00:46:29,060
Now I actually have a cluster
with a bunch of points in it,

654
00:46:29,060 --> 00:46:33,390
so I could, for example, take
the average of those and

655
00:46:33,390 --> 00:46:34,640
compute a centroid.

656
00:46:37,270 --> 00:46:39,930
And I can either take the
average, or I can take the

657
00:46:39,930 --> 00:46:41,830
point nearest the average.

658
00:46:41,830 --> 00:46:43,080
It doesn't much matter.

659
00:46:48,190 --> 00:46:58,860
And then step (5) is one we've
looked at before, assign each

660
00:46:58,860 --> 00:47:02,270
point to nearest centroid.

661
00:47:02,270 --> 00:47:03,680
So now I'm going to get
a new clustering.

662
00:47:15,510 --> 00:47:27,460
And then, (6) is repeat
(4) and (5) until

663
00:47:27,460 --> 00:47:28,710
the change is small.

664
00:47:35,520 --> 00:47:41,610
So each time I do step (5), I
can keep track of how many

665
00:47:41,610 --> 00:47:46,500
points I've moved from one
cluster to another.

666
00:47:46,500 --> 00:47:52,760
Or each time I do step (4), I
can say how much have I moved

667
00:47:52,760 --> 00:47:54,010
the centroids?

668
00:47:56,550 --> 00:47:59,970
Each of those gives me a measure
of how much change the

669
00:47:59,970 --> 00:48:02,580
new iteration has produced.

670
00:48:02,580 --> 00:48:07,220
When I get to the point where
the iterations are not making

671
00:48:07,220 --> 00:48:08,540
much of a change--

672
00:48:08,540 --> 00:48:10,710
and we'll see what we
might mean by that--

673
00:48:10,710 --> 00:48:13,210
we stop and say, OK, we now
have a good clustering.

674
00:48:17,970 --> 00:48:21,000
So if we think of the complexity
each iteration is

675
00:48:21,000 --> 00:48:25,250
order k-n, where k is the number
of clusters, and n is

676
00:48:25,250 --> 00:48:27,060
the number of points.

677
00:48:27,060 --> 00:48:32,130
And then we do that step for
some number of iterations.

678
00:48:32,130 --> 00:48:35,690
So if the number of iterations
is small, it will converge

679
00:48:35,690 --> 00:48:38,070
quite quickly.

680
00:48:38,070 --> 00:48:42,520
And as we'll see, typically for
k-means, we don't need a

681
00:48:42,520 --> 00:48:47,090
lot of iterations to
get an answer.

682
00:48:47,090 --> 00:48:50,250
It's typically not proportional
to n, in

683
00:48:50,250 --> 00:48:54,010
particular, which is
very important.

684
00:48:54,010 --> 00:48:54,226
All right.

685
00:48:54,226 --> 00:48:57,910
Tuesday, we'll go over the code
for k-means clustering,

686
00:48:57,910 --> 00:49:01,570
and then have some fun playing
with counties and see what we

687
00:49:01,570 --> 00:49:04,470
can learn about where we live.

688
00:49:04,470 --> 00:49:04,642
All right.

689
00:49:04,642 --> 00:49:05,892
Thanks a lot.