1
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,870
Commons license.

3
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,460
To make a donation, or view
additional materials from

6
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:19,290 --> 00:00:21,436
ocw.mit.edu.

8
00:00:21,436 --> 00:00:24,560
PROFESSOR: I want to go back to
where I stopped at the end

9
00:00:24,560 --> 00:00:30,370
of Tuesday's lecture, when you
let me pull a fast one on you.

10
00:00:30,370 --> 00:00:34,360
I ended up with a strong
statement that was

11
00:00:34,360 --> 00:00:37,250
effectively a lie.

12
00:00:37,250 --> 00:00:41,740
I told you that when we drop a
large enough number of pins,

13
00:00:41,740 --> 00:00:45,950
and do a large enough number of
trials, we can look at the

14
00:00:45,950 --> 00:00:51,780
small standard deviation we
get across trials and say,

15
00:00:51,780 --> 00:00:54,220
that means we have
a good answer.

16
00:00:54,220 --> 00:00:57,820
It doesn't change much.

17
00:00:57,820 --> 00:01:02,750
And I said, so we can tell you
that with 95% confidence, the

18
00:01:02,750 --> 00:01:08,170
answer lies between x and y,
where we had the two standard

19
00:01:08,170 --> 00:01:11,280
deviations from the mean.

20
00:01:11,280 --> 00:01:15,390
That's not actually true.

21
00:01:15,390 --> 00:01:19,630
I was confusing the notion
of a statistically sound

22
00:01:19,630 --> 00:01:23,240
conclusion with truth.

23
00:01:23,240 --> 00:01:26,520
The utility of every statistical
test rests on

24
00:01:26,520 --> 00:01:28,720
certain assumptions.

25
00:01:28,720 --> 00:01:30,370
So we talked about independence

26
00:01:30,370 --> 00:01:31,690
and things like that.

27
00:01:31,690 --> 00:01:36,490
But the key assumption is that
our simulation is actually a

28
00:01:36,490 --> 00:01:37,740
model of reality.

29
00:01:40,270 --> 00:01:43,530
You can recall that in designing
the simulation, we

30
00:01:43,530 --> 00:01:48,280
looked at the Buffon-Laplace
mathematics and did a little

31
00:01:48,280 --> 00:01:52,270
algebra from which we derived
the code, wrote the code, ran

32
00:01:52,270 --> 00:01:55,460
the simulation, looked at the
results, did the statistical

33
00:01:55,460 --> 00:01:58,450
results, and smiled.

34
00:01:58,450 --> 00:02:03,180
Well, suppose I had made
a coding error.

35
00:02:03,180 --> 00:02:06,760
So, for example, instead of
that 4 there-- which the

36
00:02:06,760 --> 00:02:08,889
algebra said we should have--

37
00:02:08,889 --> 00:02:10,869
I had mistakenly typed a 2.

38
00:02:15,560 --> 00:02:17,910
Not an impossible error.

39
00:02:17,910 --> 00:02:24,780
Now if we run it, what we're
going to see here is that it

40
00:02:24,780 --> 00:02:29,450
converges quite quickly, it
gives me a small standard

41
00:02:29,450 --> 00:02:33,980
deviation, and I can feel very
confident that my answer that

42
00:02:33,980 --> 00:02:38,970
pi is somewhere around 1.569.

43
00:02:38,970 --> 00:02:41,940
Well, it isn't of course.

44
00:02:41,940 --> 00:02:46,580
We know that that's nowhere
close to the value of pi.

45
00:02:46,580 --> 00:02:49,620
But there's nothing wrong
with my statistics.

46
00:02:49,620 --> 00:02:54,660
It's just that my statistics are
about the simulation, not

47
00:02:54,660 --> 00:02:58,230
about pi itself.

48
00:02:58,230 --> 00:03:01,480
So what's the moral here?

49
00:03:01,480 --> 00:03:06,120
Before believing the results of
any simulation, we have to

50
00:03:06,120 --> 00:03:10,720
have confidence that our
conceptual model is correct.

51
00:03:10,720 --> 00:03:14,000
And that we have correctly
implemented

52
00:03:14,000 --> 00:03:16,720
that conceptual model.

53
00:03:16,720 --> 00:03:19,220
How can we do that?

54
00:03:19,220 --> 00:03:22,860
Well, one thing we can
do is test our

55
00:03:22,860 --> 00:03:25,480
results against reality.

56
00:03:25,480 --> 00:03:31,550
So if I ran this and I said pi
is about 1.57, I could go draw

57
00:03:31,550 --> 00:03:36,810
a circle, and I could crudely
measure the circumference, and

58
00:03:36,810 --> 00:03:38,810
I would immediately know
I'm nowhere close

59
00:03:38,810 --> 00:03:41,340
to the right answer.

60
00:03:41,340 --> 00:03:43,070
And that's the right
thing to do.

61
00:03:43,070 --> 00:03:46,450
And in fact, what a scientist
does when they use a

62
00:03:46,450 --> 00:03:50,390
simulation model to derive
something, they always run

63
00:03:50,390 --> 00:03:54,040
some experiments to see whether
their derived result

64
00:03:54,040 --> 00:03:59,360
is actually at least
plausibly correct.

65
00:03:59,360 --> 00:04:02,610
Statistics are good to show
that we've got the little

66
00:04:02,610 --> 00:04:05,790
details right at the end,
but we've got to do a

67
00:04:05,790 --> 00:04:08,200
sanity check first.

68
00:04:08,200 --> 00:04:11,850
So that's a really important
moral to keep in mind.

69
00:04:11,850 --> 00:04:15,930
Don't get seduced by a
statistical test and confuse

70
00:04:15,930 --> 00:04:19,829
that with truth.

71
00:04:19,829 --> 00:04:25,670
All right, I now want to move
on to look at some more

72
00:04:25,670 --> 00:04:30,890
examples that do the same kind
of thing we've been doing.

73
00:04:30,890 --> 00:04:33,470
And in fact, what we're going
to be looking at is the

74
00:04:33,470 --> 00:04:36,405
interplay between physical
reality,--

75
00:04:39,810 --> 00:04:42,880
some physical system, just
in the real world--

76
00:04:49,720 --> 00:04:55,210
some theoretical models of
the physical system, and

77
00:04:55,210 --> 00:04:56,505
computational models.

78
00:05:04,620 --> 00:05:07,410
Because this is really the
way modern science and

79
00:05:07,410 --> 00:05:09,690
engineering is done.

80
00:05:09,690 --> 00:05:13,710
We start with some physical
situation--

81
00:05:13,710 --> 00:05:17,080
and by physical I don't mean
it has to be bricks and

82
00:05:17,080 --> 00:05:19,070
mortar, or physics,
or biology.

83
00:05:19,070 --> 00:05:21,640
The physical situation
could be the stock

84
00:05:21,640 --> 00:05:23,220
market, if you will,--

85
00:05:23,220 --> 00:05:26,130
some real situation
in the world.

86
00:05:26,130 --> 00:05:30,200
We use some theory to give us
some insight into that, and

87
00:05:30,200 --> 00:05:33,620
when the theory gets too
complicated or doesn't get us

88
00:05:33,620 --> 00:05:38,050
all the way to the answer,
we use computation.

89
00:05:38,050 --> 00:05:41,230
And I now want to talk about
how those things relate to

90
00:05:41,230 --> 00:05:42,750
each other.

91
00:05:42,750 --> 00:05:46,370
So imagine, for example, that
you're a bright student in

92
00:05:46,370 --> 00:05:49,940
high school biology, chemistry,
or physics--

93
00:05:49,940 --> 00:05:53,760
a situation probably all of
you who have been in.

94
00:05:53,760 --> 00:05:57,380
You perform some experiment to
the best of your ability.

95
00:05:57,380 --> 00:05:59,860
But you've done the math and
you know your experimental

96
00:05:59,860 --> 00:06:03,500
results don't actually
match the theory.

97
00:06:03,500 --> 00:06:06,810
What should you do?

98
00:06:06,810 --> 00:06:09,500
Well I suspect you've all
been in this situation.

99
00:06:09,500 --> 00:06:13,570
You could just turn in the
results and risk getting

100
00:06:13,570 --> 00:06:17,140
criticized for poor laboratory
technique.

101
00:06:17,140 --> 00:06:19,330
Some of you may have
done this.

102
00:06:19,330 --> 00:06:21,770
More likely what you've done
is you've calculated the

103
00:06:21,770 --> 00:06:27,050
correct results and turned
those in, risking some

104
00:06:27,050 --> 00:06:29,860
suspicion that they're
too good to be true.

105
00:06:29,860 --> 00:06:32,720
But being smart guys, I suspect
what all of you did in

106
00:06:32,720 --> 00:06:36,690
high school is you calculated
the correct results, looked at

107
00:06:36,690 --> 00:06:40,690
your experimental results, and
met somewhere in between to

108
00:06:40,690 --> 00:06:43,250
introduce a little error, but
not look too foolish.

109
00:06:43,250 --> 00:06:47,560
Have any of you cheated that
way in high school?

110
00:06:47,560 --> 00:06:48,850
Yeah well all right.

111
00:06:48,850 --> 00:06:51,550
We have about two people
who would admit it.

112
00:06:51,550 --> 00:06:55,040
The rest of you are either
exceedingly honorable, or just

113
00:06:55,040 --> 00:06:56,640
don't want to admit it.

114
00:06:56,640 --> 00:06:58,960
I confess, I had fudged
experimental

115
00:06:58,960 --> 00:07:00,450
results in high school.

116
00:07:00,450 --> 00:07:03,730
But no longer, I've
seen the truth.

117
00:07:03,730 --> 00:07:08,260
All right, to do this correctly
you need to have a

118
00:07:08,260 --> 00:07:12,710
sense of how best to model not
only reality, but also

119
00:07:12,710 --> 00:07:15,380
experimental errors.

120
00:07:15,380 --> 00:07:19,480
Typically, the best way to model
experimental errors--

121
00:07:19,480 --> 00:07:21,240
and we need to do this
even when we're not

122
00:07:21,240 --> 00:07:23,390
attempting to cheat--

123
00:07:23,390 --> 00:07:29,330
is to assume some sort of random
perturbation of the

124
00:07:29,330 --> 00:07:31,800
actual data.

125
00:07:31,800 --> 00:07:35,850
And in fact, one of the key
steps forward, which was

126
00:07:35,850 --> 00:07:39,690
really Gauss' big contribution,
was to say we

127
00:07:39,690 --> 00:07:44,030
can typically model experimental
error as normally

128
00:07:44,030 --> 00:07:48,380
distributed, as a Gaussian
distribution.

129
00:07:48,380 --> 00:07:50,440
So let's look at an example.

130
00:07:50,440 --> 00:07:53,110
Let's consider a spring.

131
00:07:53,110 --> 00:07:56,890
Not the current time of year, or
a spring of water, but the

132
00:07:56,890 --> 00:08:00,060
kind of spring you looked
at in 8.01.

133
00:08:00,060 --> 00:08:03,630
The things you compress with
some force then they expand,

134
00:08:03,630 --> 00:08:05,690
or you stretch, and then
they contract.

135
00:08:08,270 --> 00:08:09,770
Springs are great things.

136
00:08:09,770 --> 00:08:15,750
We use them in our cars, our
mattresses, seat belts.

137
00:08:15,750 --> 00:08:20,930
We use them to launch
projectiles, lots of things.

138
00:08:20,930 --> 00:08:23,930
And in fact, as we'll see later,
they're frequently

139
00:08:23,930 --> 00:08:25,610
occurring in biology as well.

140
00:08:28,600 --> 00:08:31,560
I don't want to belabor
this, I presume

141
00:08:31,560 --> 00:08:32,900
you've all taken 8.01.

142
00:08:32,900 --> 00:08:36,200
Do they still do springs
in 8.01?

143
00:08:36,200 --> 00:08:38,380
Yes, good, all right.

144
00:08:38,380 --> 00:08:42,100
So as you know, in 1676--

145
00:08:42,100 --> 00:08:43,909
maybe you didn't
know the date--

146
00:08:43,909 --> 00:08:49,190
the British physicist, Robert
Hooke, formulated Hooke's Law

147
00:08:49,190 --> 00:08:51,940
to explain the behavior
of springs.

148
00:08:51,940 --> 00:08:56,430
And the law is very simple,
it's f equals minus kx.

149
00:09:03,840 --> 00:09:08,810
In other words, the force, f,
stored in the spring is

150
00:09:08,810 --> 00:09:12,990
linearly related to x, the
distance the spring has been

151
00:09:12,990 --> 00:09:14,265
either compressed
or stretched.

152
00:09:16,850 --> 00:09:20,170
OK. so that's Hooke's law,
you've all seen that.

153
00:09:20,170 --> 00:09:23,180
The law holds true for a wide
variety of materials and

154
00:09:23,180 --> 00:09:27,170
systems including many
biological systems.

155
00:09:27,170 --> 00:09:29,620
Of course, it does
not hold for an

156
00:09:29,620 --> 00:09:33,360
arbitrarily large force.

157
00:09:33,360 --> 00:09:37,660
All springs have an elastic
limit and if you stretch them

158
00:09:37,660 --> 00:09:40,580
beyond that the law fails.

159
00:09:40,580 --> 00:09:43,620
Has anyone here ever broken
a Slinky that way.

160
00:09:43,620 --> 00:09:46,030
Where you've just taken the
spring and stretched it so

161
00:09:46,030 --> 00:09:48,220
much that it's no
longer useful.

162
00:09:48,220 --> 00:09:52,780
Well you've exceeded
its elastic limit.

163
00:09:52,780 --> 00:09:58,280
The proportionate of constant
here, k, is

164
00:09:58,280 --> 00:09:59,530
called the spring constant.

165
00:10:05,160 --> 00:10:08,580
And every spring has
a constant, k, that

166
00:10:08,580 --> 00:10:10,700
explains its behavior.

167
00:10:10,700 --> 00:10:13,170
If the spring is stiff like
the suspension in an

168
00:10:13,170 --> 00:10:16,800
automobile, k is big.

169
00:10:19,400 --> 00:10:22,430
If the spring is not stiff like
the spring in a ballpoint

170
00:10:22,430 --> 00:10:25,500
pen, k is small.

171
00:10:25,500 --> 00:10:29,390
The negative sign is there to
indicate that the force

172
00:10:29,390 --> 00:10:33,150
exerted by the spring is in the
opposite direction of the

173
00:10:33,150 --> 00:10:34,510
displacement.

174
00:10:34,510 --> 00:10:37,950
If you pull a spring bring down,
the force exerted by the

175
00:10:37,950 --> 00:10:41,410
spring is going up.

176
00:10:41,410 --> 00:10:45,950
Knowing the spring constant of
a spring is actually a matter

177
00:10:45,950 --> 00:10:48,890
of considerable practical
importance.

178
00:10:48,890 --> 00:10:52,600
It's used to do things like
calibrate scales,--

179
00:10:52,600 --> 00:10:55,650
one can use to weigh
oneself, if one

180
00:10:55,650 --> 00:10:59,220
wants to know the truth--

181
00:10:59,220 --> 00:11:03,840
atomic force microscopes,
lots of kinds of things.

182
00:11:03,840 --> 00:11:07,230
And in fact, recently people
have started worrying about

183
00:11:07,230 --> 00:11:11,680
thinking that you should model
DNA as a spring, and finding

184
00:11:11,680 --> 00:11:14,510
the spring constant for
DNA turns out to be of

185
00:11:14,510 --> 00:11:19,670
considerable use in some
biological experiments.

186
00:11:19,670 --> 00:11:24,550
All right, so generations of
students have learned to

187
00:11:24,550 --> 00:11:29,050
estimate springs using this
very simple experiment.

188
00:11:29,050 --> 00:11:30,880
You've probably most of
you have done this.

189
00:11:36,770 --> 00:11:40,130
Get a picture up here,
all right.

190
00:11:40,130 --> 00:11:45,580
So what you do is you take a
spring and you hang it on some

191
00:11:45,580 --> 00:11:51,230
sort of apparatus, and then you
put a weight of known mass

192
00:11:51,230 --> 00:11:55,250
at the bottom of the spring, and
you measure how much the

193
00:11:55,250 --> 00:11:56,500
spring has stretched.

194
00:11:58,700 --> 00:12:02,730
You then can do the math,
if f equals minus kx.

195
00:12:02,730 --> 00:12:06,440
We also have to know that f
equals m times a, mass times

196
00:12:06,440 --> 00:12:09,210
acceleration.

197
00:12:09,210 --> 00:12:12,570
We know that on this planet at
least the acceleration due to

198
00:12:12,570 --> 00:12:18,630
gravity is roughly 9.81 meters
per second per second, and we

199
00:12:18,630 --> 00:12:22,490
can just do the algebra and
we can calculate k.

200
00:12:22,490 --> 00:12:26,650
So we hang one weight in the
spring, we measure it, we say,

201
00:12:26,650 --> 00:12:28,400
we're done.

202
00:12:28,400 --> 00:12:31,350
We now know what k is
for that spring.

203
00:12:31,350 --> 00:12:33,800
Not so easy, of course, to do
this experiment if the spring

204
00:12:33,800 --> 00:12:35,890
is a strand of DNA.

205
00:12:35,890 --> 00:12:37,890
So you need a slightly
more complicated

206
00:12:37,890 --> 00:12:40,120
apparatus to do that.

207
00:12:44,140 --> 00:12:48,820
This would be all well and
good if we didn't have

208
00:12:48,820 --> 00:12:52,920
experimental error, but we do.

209
00:12:52,920 --> 00:12:55,710
Any experiment we typically
have errors.

210
00:12:55,710 --> 00:12:59,820
So what people do instead is
rather than hanging one weight

211
00:12:59,820 --> 00:13:03,880
on the spring, they hang
different weights--

212
00:13:03,880 --> 00:13:05,620
weights of different mass--

213
00:13:05,620 --> 00:13:07,835
they wait for the spring to stop
moving and they measure

214
00:13:07,835 --> 00:13:12,220
it, and now they have
a series of points.

215
00:13:12,220 --> 00:13:15,660
And they assume that, well I've
got some errors and if we

216
00:13:15,660 --> 00:13:18,990
believe that our errors are
normally distributed some will

217
00:13:18,990 --> 00:13:21,120
be positive, some will
be negative.

218
00:13:21,120 --> 00:13:23,280
And if we do enough experiments
it will kind of

219
00:13:23,280 --> 00:13:28,910
all balance out and we'll be
able to actually get a good

220
00:13:28,910 --> 00:13:33,630
estimate of the spring
constant, k.

221
00:13:33,630 --> 00:13:39,680
I did such an experiment, put
the results in a file.

222
00:13:39,680 --> 00:13:41,470
This is just a format
of the file.

223
00:13:41,470 --> 00:13:44,050
The first line tells us what
it is, it's the distance in

224
00:13:44,050 --> 00:13:47,340
meters and a mass
in kilograms.

225
00:13:47,340 --> 00:13:51,230
And then I just have the two
things separated by a space,

226
00:13:51,230 --> 00:13:52,880
in this case.

227
00:13:52,880 --> 00:13:59,460
So my first experiment, the
distance I measured was 0.0865

228
00:13:59,460 --> 00:14:07,110
and the weight was
0.1 kilograms.

229
00:14:07,110 --> 00:14:10,580
All right, so I've now got
the data, so that's

230
00:14:10,580 --> 00:14:11,960
the physical reality.

231
00:14:11,960 --> 00:14:15,110
I've done my experiment.

232
00:14:15,110 --> 00:14:20,230
I've done some theory telling
me how to calculate k.

233
00:14:20,230 --> 00:14:26,000
And now I'm going to put them
together and write some code.

234
00:14:26,000 --> 00:14:27,250
So let's look at the code.

235
00:14:33,680 --> 00:14:37,570
Think we'll skip over this, and
I'll comment this out, so

236
00:14:37,570 --> 00:14:42,070
we don't see get pi get
estimated over and over again.

237
00:14:42,070 --> 00:14:46,200
So the first piece of code is
pretty simple, it's just

238
00:14:46,200 --> 00:14:48,140
getting the data.

239
00:14:48,140 --> 00:14:51,100
And again, this is typically the
way one ought to structure

240
00:14:51,100 --> 00:14:52,320
these things.

241
00:14:52,320 --> 00:14:57,280
I/O, input/output, is typically
done in a separate

242
00:14:57,280 --> 00:15:00,890
function so that if the format
of the data were changed, I'd

243
00:15:00,890 --> 00:15:03,210
only have to change this,
and not the rest of my

244
00:15:03,210 --> 00:15:05,160
computation.

245
00:15:05,160 --> 00:15:10,340
it opens the file, discards the
header, and then uses a

246
00:15:10,340 --> 00:15:20,360
split to get the x values and
the y values, all right.

247
00:15:20,360 --> 00:15:23,440
So now I just get all the
distances and all the masses--

248
00:15:23,440 --> 00:15:27,170
not the x's and the y's yet,
just distances and masses.

249
00:15:27,170 --> 00:15:30,220
Then I close the file
and return them.

250
00:15:30,220 --> 00:15:32,860
Nothing that you haven't
seen before.

251
00:15:32,860 --> 00:15:35,360
Nothing that you won't get to
write again, and again, and

252
00:15:35,360 --> 00:15:39,530
again, similar kinds
of things.

253
00:15:39,530 --> 00:15:41,340
Then I plot the data.

254
00:15:41,340 --> 00:15:44,760
So here we see something that's
a little bit different

255
00:15:44,760 --> 00:15:47,870
from what we've seen before.

256
00:15:47,870 --> 00:15:49,980
So the first thing I do
is I got my x and

257
00:15:49,980 --> 00:15:52,560
y by calling, GetData.

258
00:15:52,560 --> 00:15:55,690
Then I do a type conversion.

259
00:15:55,690 --> 00:16:00,540
What GetData is returning
is a list.

260
00:16:00,540 --> 00:16:02,830
I'm here going to convert
a list to another

261
00:16:02,830 --> 00:16:06,130
type called an array.

262
00:16:06,130 --> 00:16:12,020
This is a type implemented by
a class supplied by PyLab

263
00:16:12,020 --> 00:16:14,690
which is built on top of
something called NumPy, which

264
00:16:14,690 --> 00:16:16,730
is where it comes from.

265
00:16:16,730 --> 00:16:20,960
An array is kind
of like a list.

266
00:16:20,960 --> 00:16:24,250
It's a sequence of things.

267
00:16:24,250 --> 00:16:28,620
There's some list operations
methods that are not

268
00:16:28,620 --> 00:16:32,310
available, like append, but
it's got some other things

269
00:16:32,310 --> 00:16:34,460
that are extremely valuable.

270
00:16:34,460 --> 00:16:36,660
For example, I can
do point-wise

271
00:16:36,660 --> 00:16:39,770
operations on an array.

272
00:16:39,770 --> 00:16:44,560
So if I multiply an array by
3, what that does is it

273
00:16:44,560 --> 00:16:49,420
multiplies each element by 3.

274
00:16:49,420 --> 00:16:53,690
If I multiply one array
by another, it

275
00:16:53,690 --> 00:16:56,970
does the cross products.

276
00:16:56,970 --> 00:17:01,480
OK, so they're very valuable
for these kinds of things.

277
00:17:01,480 --> 00:17:06,640
Typically, in Python, one starts
with a list, because

278
00:17:06,640 --> 00:17:09,619
lists are more convenient to
build up incrementally than

279
00:17:09,619 --> 00:17:13,359
arrays, and then converts them
to an array so that you can do

280
00:17:13,359 --> 00:17:15,869
the math on them.

281
00:17:15,869 --> 00:17:18,550
For those of you who've seen
MATLAB you're very familiar

282
00:17:18,550 --> 00:17:22,339
with the concept of what
Python calls an array.

283
00:17:22,339 --> 00:17:27,230
Those of you who know C or
Pascal, what it calls an array

284
00:17:27,230 --> 00:17:28,930
has nothing to do with
what Python or

285
00:17:28,930 --> 00:17:30,890
PyLab calls an array.

286
00:17:30,890 --> 00:17:34,040
So can be a little
bit confusing.

287
00:17:34,040 --> 00:17:37,470
Any rate, I convert
them to arrays.

288
00:17:37,470 --> 00:17:40,940
And then what I'll do here, now
that I have an array, I'll

289
00:17:40,940 --> 00:17:46,810
multiply my x values by the
acceleration due to gravity,

290
00:17:46,810 --> 00:17:50,750
this constant 9.81.

291
00:17:50,750 --> 00:17:52,290
And then I'm just going
to plot them.

292
00:17:55,230 --> 00:17:56,970
All right, so let's see
what we get here.

293
00:18:12,150 --> 00:18:14,230
So here I've now plotted
the measure

294
00:18:14,230 --> 00:18:15,480
displacement of the spring.

295
00:18:19,280 --> 00:18:26,600
Force in Newtons, that's the
standard international unit

296
00:18:26,600 --> 00:18:27,430
for measuring force.

297
00:18:27,430 --> 00:18:30,390
It's the amount of force needed
to accelerate a mass of

298
00:18:30,390 --> 00:18:36,260
1 kilogram at a rate of 1 meter
per second per second.

299
00:18:36,260 --> 00:18:38,420
So I've plotted the force
in Newton's against

300
00:18:38,420 --> 00:18:41,125
the distance in meters.

301
00:18:43,760 --> 00:18:45,010
OK.

302
00:18:47,150 --> 00:18:50,390
Now I can go and calculate k.

303
00:18:55,530 --> 00:18:59,870
Well, how am I going
to do that?

304
00:18:59,870 --> 00:19:07,800
Well, before I do that, I'm
going to do something to see

305
00:19:07,800 --> 00:19:11,185
whether or not my data
is sensible.

306
00:19:21,850 --> 00:19:28,090
What we often do, is we have
a theoretical model and the

307
00:19:28,090 --> 00:19:33,340
model here is that the data
should fall on a line, roughly

308
00:19:33,340 --> 00:19:36,155
speaking, modular experimental
errors.

309
00:19:39,070 --> 00:19:43,260
I'm going to now find out
what that line is.

310
00:19:43,260 --> 00:19:46,955
Because if I know that line,
I can compute k.

311
00:19:49,470 --> 00:19:52,360
How does k relate
to that line?

312
00:19:52,360 --> 00:19:55,960
So I plot a line.

313
00:19:55,960 --> 00:19:58,590
And now I can look at the
slope of that line, how

314
00:19:58,590 --> 00:20:01,160
quickly it's changing.

315
00:20:01,160 --> 00:20:03,595
And k will be simply the
inverse of that.

316
00:20:07,550 --> 00:20:08,800
How do I get the line?

317
00:20:11,180 --> 00:20:17,510
Well, I'm going to find a line
that is the best approximation

318
00:20:17,510 --> 00:20:20,150
to the points I have.

319
00:20:20,150 --> 00:20:33,020
So if, for example, I have two
points, a point here and a

320
00:20:33,020 --> 00:20:37,300
point here, I know I
can quote, fit a

321
00:20:37,300 --> 00:20:38,800
line to that curve--

322
00:20:38,800 --> 00:20:40,280
to those points--

323
00:20:40,280 --> 00:20:42,300
it will always be perfect.

324
00:20:42,300 --> 00:20:44,890
It will be a line.

325
00:20:44,890 --> 00:20:47,460
So this is what's
called a fit.

326
00:20:47,460 --> 00:20:53,110
Now if I have a bunch of points
sort of scattered

327
00:20:53,110 --> 00:20:59,290
around, I then have to figure
out, OK, what line is the

328
00:20:59,290 --> 00:21:01,740
closest to those points?

329
00:21:01,740 --> 00:21:03,670
What fits it the best?

330
00:21:03,670 --> 00:21:08,420
And I might say, OK, it's
a line like this.

331
00:21:08,420 --> 00:21:12,300
But in order to do that, in
order to fit a line to more

332
00:21:12,300 --> 00:21:16,500
than two points, I need
some measure of the

333
00:21:16,500 --> 00:21:18,960
goodness of the fit.

334
00:21:18,960 --> 00:21:23,950
Because what I want to choose
here is the best fit.

335
00:21:23,950 --> 00:21:29,050
What line is the best
approximation of the data I've

336
00:21:29,050 --> 00:21:31,390
actually got?

337
00:21:31,390 --> 00:21:42,590
But in order to do that, I need
some objective function

338
00:21:42,590 --> 00:21:49,120
that tells me how good
is a particular fit.

339
00:21:49,120 --> 00:21:52,880
It lets me compare two
fits so that I can

340
00:21:52,880 --> 00:21:54,680
choose the best one.

341
00:21:58,050 --> 00:22:05,820
OK, now if we want to look at
that we have to ask, what

342
00:22:05,820 --> 00:22:09,030
should that be?

343
00:22:09,030 --> 00:22:11,000
There are lots of
possibilities.

344
00:22:11,000 --> 00:22:14,040
One could say, all right let's
find the line that goes

345
00:22:14,040 --> 00:22:16,790
through the most points,
that actually

346
00:22:16,790 --> 00:22:18,880
touches the most points.

347
00:22:18,880 --> 00:22:23,850
The problem with that is it's
really hard, and may be

348
00:22:23,850 --> 00:22:27,860
totally irrelevant, and in fact
you may not find a line

349
00:22:27,860 --> 00:22:30,410
that touches more
than one point.

350
00:22:30,410 --> 00:22:33,870
So we need something
different.

351
00:22:33,870 --> 00:22:38,080
And there is a standard measure
that's typically used

352
00:22:38,080 --> 00:22:40,350
and that's called the
least squares fit.

353
00:22:48,520 --> 00:22:52,940
That's the objective function
that's almost always used in

354
00:22:52,940 --> 00:22:56,690
measuring how good any curve--
or how well, excuse me, any

355
00:22:56,690 --> 00:22:58,805
curve fits a set of points.

356
00:23:01,820 --> 00:23:09,842
What it looks like is the sum
from L equals 0 to L equals

357
00:23:09,842 --> 00:23:20,360
the len of the observed points
minus 1, just because of the

358
00:23:20,360 --> 00:23:22,750
way things will work
in Python.

359
00:23:22,750 --> 00:23:32,840
But the key thing is what we're
summing is the observed

360
00:23:32,840 --> 00:23:44,253
at point L minus the predicted
at point L-squared.

361
00:23:48,110 --> 00:23:51,490
And since we're looking for the
least squares fit, we want

362
00:23:51,490 --> 00:23:55,220
to minimize that.

363
00:23:55,220 --> 00:23:59,670
The smallest difference
we can get.

364
00:23:59,670 --> 00:24:02,630
So there's some things
to notice about this.

365
00:24:02,630 --> 00:24:08,870
Once we have a quote fit, in
this case a line for every x

366
00:24:08,870 --> 00:24:14,280
value the fit predicts
a y value.

367
00:24:14,280 --> 00:24:14,530
Right?

368
00:24:14,530 --> 00:24:16,620
That's what our model does.

369
00:24:16,620 --> 00:24:20,140
Our model in this case will take
the independent variable,

370
00:24:20,140 --> 00:24:27,290
x, the mass, and predict the
dependent variable, the

371
00:24:27,290 --> 00:24:28,540
displacement.

372
00:24:30,860 --> 00:24:33,450
But in addition to the predicted
values, we have the

373
00:24:33,450 --> 00:24:38,000
observed values, these guys.

374
00:24:38,000 --> 00:24:40,550
And now we just measure the
difference between the

375
00:24:40,550 --> 00:24:44,770
predicted and the observed,
square it, and notice by

376
00:24:44,770 --> 00:24:49,190
squaring the difference we have
discarded whether it's

377
00:24:49,190 --> 00:24:51,665
above or below the line--
because we don't care, we just

378
00:24:51,665 --> 00:24:53,120
care how far it's
from the line.

379
00:24:55,650 --> 00:24:59,220
And then we sum all of those up
and the smaller we can make

380
00:24:59,220 --> 00:25:00,695
that, the better our fit is.

381
00:25:03,670 --> 00:25:05,790
Makes sense?

382
00:25:05,790 --> 00:25:09,510
So now how do we find
the best fit?

383
00:25:09,510 --> 00:25:10,800
Well, there's several different

384
00:25:10,800 --> 00:25:12,810
methods you could use.

385
00:25:12,810 --> 00:25:15,080
You can actually do this
using Newton's method.

386
00:25:18,770 --> 00:25:22,600
Under many conditions there are
analytical solutions, so

387
00:25:22,600 --> 00:25:25,080
you don't have to use
approximation you can just

388
00:25:25,080 --> 00:25:26,330
compute it.

389
00:25:26,330 --> 00:25:31,610
And the best news of all, it's
built into PyLab So that's how

390
00:25:31,610 --> 00:25:33,030
you actually do it.

391
00:25:33,030 --> 00:25:37,500
You call the PyLab function
that does it for you.

392
00:25:37,500 --> 00:25:40,310
That function is
called Polyfit.

393
00:25:48,010 --> 00:25:50,873
Polyfit takes three arguments.

394
00:25:53,630 --> 00:26:02,470
It takes all of the observed X
values, all of the observed Y

395
00:26:02,470 --> 00:26:06,885
values, and the degree
of the polynomial.

396
00:26:14,420 --> 00:26:17,410
So I've been talking about
fitting lines.

397
00:26:17,410 --> 00:26:21,720
As we'll see, polyfit can be
used to fit polynomials of

398
00:26:21,720 --> 00:26:24,130
arbitrary degree to data.

399
00:26:24,130 --> 00:26:25,900
So you can fit a line,
you can fit a

400
00:26:25,900 --> 00:26:28,400
parabola, you can fit cubic.

401
00:26:28,400 --> 00:26:30,640
I don't know what it's called,
you can fit a 10th order

402
00:26:30,640 --> 00:26:34,250
polynomial, whatever
you choose here.

403
00:26:37,220 --> 00:26:41,030
And then it returns
some values.

404
00:26:41,030 --> 00:26:53,220
So if we think about it being
a line, we know that it's

405
00:26:53,220 --> 00:26:57,585
defined by the y value is
equal to ax plus b.

406
00:27:00,680 --> 00:27:05,000
Some constant times the x value
plus b, the y-intercept.

407
00:27:08,960 --> 00:27:11,510
So now let's look at it.

408
00:27:11,510 --> 00:27:21,130
We see here in fit data, what
I've done is I've gotten my

409
00:27:21,130 --> 00:27:25,100
values as before, and now I'm
going to say, a,b equals

410
00:27:25,100 --> 00:27:30,400
pyLab.polyfit of xVals,
y values and 1.

411
00:27:30,400 --> 00:27:35,250
Since I'm looking for a
line, the degree is 1.

412
00:27:35,250 --> 00:27:38,740
Once I've got that, I can then
compute the estimated y

413
00:27:38,740 --> 00:27:43,050
values, a times pyLab.array.

414
00:27:43,050 --> 00:27:46,230
I'm turning the x values into
an array, actually I didn't

415
00:27:46,230 --> 00:27:49,710
need to do that since I'd
already done it, that's okay--

416
00:27:49,710 --> 00:27:52,590
plus b.

417
00:27:52,590 --> 00:27:57,520
And now I'll plot it and, by the
way now that I've got my

418
00:27:57,520 --> 00:28:01,460
line, I can also compute k.

419
00:28:06,020 --> 00:28:07,270
And let's see what we get.

420
00:28:27,460 --> 00:28:33,580
All right, I fit a line, and
I've got a linear fit, and I

421
00:28:33,580 --> 00:28:37,460
said my spring constant
k is 21 point --

422
00:28:37,460 --> 00:28:39,940
I've rounded it to 5 digits
just so would fit

423
00:28:39,940 --> 00:28:41,865
nicely on my plot.

424
00:28:44,710 --> 00:28:45,960
OK.

425
00:28:48,170 --> 00:28:54,230
The method that's used to do
this in PyLab is called a

426
00:28:54,230 --> 00:28:55,480
linear regression.

427
00:29:00,880 --> 00:29:03,200
Now you might think it's called
linear regression

428
00:29:03,200 --> 00:29:07,760
because I just used it to find a
line, but in fact that's not

429
00:29:07,760 --> 00:29:10,010
why it's called linear
regression.

430
00:29:10,010 --> 00:29:12,870
Because we can use linear
regression to find a parabola,

431
00:29:12,870 --> 00:29:17,500
or a cubic, or anything else.

432
00:29:17,500 --> 00:29:23,680
The reason it's called linear,
well let's look at an example.

433
00:29:23,680 --> 00:29:26,546
So if I wanted a parabola, I
would have y equals ax-squared

434
00:29:26,546 --> 00:29:27,796
plus bx plus c.

435
00:29:35,070 --> 00:29:41,340
We think of the variables, the
independent variables, as

436
00:29:41,340 --> 00:29:43,515
x-squared and x.

437
00:29:46,300 --> 00:29:54,720
And y is indeed a linear
function of those variables,

438
00:29:54,720 --> 00:29:56,165
because we're adding terms.

439
00:29:58,710 --> 00:30:01,380
Not important that you
understand the details, it is

440
00:30:01,380 --> 00:30:03,970
important that you know that
linear regression can be used

441
00:30:03,970 --> 00:30:06,615
to find polynomials
other than lines.

442
00:30:12,160 --> 00:30:17,990
All right, so we
got this done.

443
00:30:17,990 --> 00:30:21,510
Should we be happy?

444
00:30:21,510 --> 00:30:24,820
We can look at this, we fit
the best line to this data

445
00:30:24,820 --> 00:30:28,735
point, we computed
k, are we done?

446
00:30:34,970 --> 00:30:39,020
Well I'm kind of concerned,
because when I look at my

447
00:30:39,020 --> 00:30:47,670
picture it is the best line I
can fit to this, but wow it's

448
00:30:47,670 --> 00:30:50,750
not a very good fit in
some sense, right.

449
00:30:50,750 --> 00:30:53,730
I look at that line, the
points are pretty

450
00:30:53,730 --> 00:30:55,890
far away from it.

451
00:30:55,890 --> 00:30:58,640
And if it's not a good fit, then
I have to be suspicious

452
00:30:58,640 --> 00:31:03,380
about my value of k, which is
derived from having the model

453
00:31:03,380 --> 00:31:05,820
I get by doing this fit.

454
00:31:05,820 --> 00:31:08,645
Well, all right, let's
try something else.

455
00:31:11,330 --> 00:31:20,340
Let's look at FitData1, which in
addition to doing a linear

456
00:31:20,340 --> 00:31:22,580
fit, I'm going to
fit a cubic --

457
00:31:25,280 --> 00:31:27,310
partly to show you
how to do it.

458
00:31:27,310 --> 00:31:32,980
Here I'm going to say abcd
equals pyLab.polyfit of xVals,

459
00:31:32,980 --> 00:31:36,320
yVals and 3 instead of 1.

460
00:31:36,320 --> 00:31:39,160
So it's a more complex
function.

461
00:31:39,160 --> 00:31:43,020
Let's see what that gives us.

462
00:31:43,020 --> 00:31:45,250
First let me comment that out.

463
00:31:48,960 --> 00:31:53,250
So we're going to now compare
visually what we get when we

464
00:31:53,250 --> 00:31:57,310
get a line fit versus we get a
cubic fit to the same data.

465
00:32:04,680 --> 00:32:10,660
Well it looks to me like a
cubic is a much better

466
00:32:10,660 --> 00:32:14,010
description of the data, a much
better model of the data,

467
00:32:14,010 --> 00:32:15,260
than a line.

468
00:32:19,960 --> 00:32:21,990
Pretty good.

469
00:32:21,990 --> 00:32:23,970
Well, should I be
happy with this?

470
00:32:27,260 --> 00:32:29,880
Well, let's ask ourselves in
one question, why are we

471
00:32:29,880 --> 00:32:31,940
building the model?

472
00:32:31,940 --> 00:32:35,480
We're building the model so that
we can better understand

473
00:32:35,480 --> 00:32:37,180
the spring.

474
00:32:37,180 --> 00:32:40,920
One of the things we often do
with models is use them to

475
00:32:40,920 --> 00:32:43,780
predict values that we have not
been able to run in our

476
00:32:43,780 --> 00:32:46,250
experiments.

477
00:32:46,250 --> 00:32:48,970
So, for example, if you're
building a model of a nuclear

478
00:32:48,970 --> 00:32:52,970
reactor you might want to know
what happens when the power is

479
00:32:52,970 --> 00:32:56,500
turned off for some
period of time.

480
00:32:56,500 --> 00:32:59,000
In fact, if you read today's
paper you noticed they've just

481
00:32:59,000 --> 00:33:01,800
done a simulation model of a
nuclear reactor, in, I think,

482
00:33:01,800 --> 00:33:05,720
Tennessee, and discovered that
if it lost power for more than

483
00:33:05,720 --> 00:33:08,120
two days, it would start
to look like the

484
00:33:08,120 --> 00:33:11,080
nuclear reactors in Japan.

485
00:33:11,080 --> 00:33:13,010
Not a very good thing.

486
00:33:13,010 --> 00:33:14,710
But of course, that's
not an experiment

487
00:33:14,710 --> 00:33:17,230
anyone wants to run.

488
00:33:17,230 --> 00:33:19,770
No one wants to blow up this
nuclear reactor just to see

489
00:33:19,770 --> 00:33:21,190
what happens.

490
00:33:21,190 --> 00:33:25,770
So they do use a simulation
model to predict what would

491
00:33:25,770 --> 00:33:28,840
happen in an experiment
you can't run.

492
00:33:28,840 --> 00:33:33,380
So let's use our model here
to do some predictions.

493
00:33:40,730 --> 00:33:44,350
So here I've taken the same
program, I've called it

494
00:33:44,350 --> 00:33:49,720
FitData2, but what I've done
is I've added a point.

495
00:33:49,720 --> 00:33:54,350
So instead of just looking at
the x values, I'm looking at

496
00:33:54,350 --> 00:34:00,220
something I'm calling extended
x, where I've added a weight

497
00:34:00,220 --> 00:34:06,370
of 1 and a 1/2 kilos to the
spring just to see what would

498
00:34:06,370 --> 00:34:11,110
happen, what the model
would predict.

499
00:34:11,110 --> 00:34:13,940
And other than that, everything
is the same.

500
00:34:26,838 --> 00:34:29,230
Oops, what's happened here?

501
00:34:37,560 --> 00:34:39,710
Probably shouldn't be computing
k here with a

502
00:34:39,710 --> 00:34:40,960
non-linear model.

503
00:34:45,250 --> 00:34:48,969
All right, why is it not?

504
00:34:48,969 --> 00:34:51,670
Come on, there it is.

505
00:34:51,670 --> 00:34:56,169
And now we have to un-comment
this out, un-comment this.

506
00:35:04,470 --> 00:35:09,990
Well it fit the existing data
pretty darn well, but it has a

507
00:35:09,990 --> 00:35:13,180
very strange prediction here.

508
00:35:13,180 --> 00:35:15,640
If you think about our
experiment, it's predicting

509
00:35:15,640 --> 00:35:20,010
not only that the spring stopped
stretching, but that

510
00:35:20,010 --> 00:35:23,810
it goes to above where
it started.

511
00:35:23,810 --> 00:35:27,150
Highly unlikely in
a physical world.

512
00:35:27,150 --> 00:35:33,570
So what we see here is that
while I can easily fit a curve

513
00:35:33,570 --> 00:35:38,430
to the data, it fits it
beautifully, it turns out to

514
00:35:38,430 --> 00:35:40,025
have very bad predictive
value.

515
00:35:43,470 --> 00:35:45,460
What's going on here?

516
00:35:45,460 --> 00:35:51,130
Well, I started this whole
endeavor under an assumption

517
00:35:51,130 --> 00:35:54,930
that there was some theory about
springs, Hooke's law,

518
00:35:54,930 --> 00:35:58,260
and that it should be
a linear model.

519
00:35:58,260 --> 00:36:02,620
Just because my data maybe
didn't fit that theory,

520
00:36:02,620 --> 00:36:05,700
doesn't mean I should just fit
an arbitrary curve and see

521
00:36:05,700 --> 00:36:06,950
what happens.

522
00:36:08,840 --> 00:36:12,780
It is the case that if you're
willing to get a high enough

523
00:36:12,780 --> 00:36:15,070
degree polynomial, you can
get a pretty good fit

524
00:36:15,070 --> 00:36:17,690
to almost any data.

525
00:36:17,690 --> 00:36:19,920
But that doesn't
prove anything.

526
00:36:19,920 --> 00:36:21,170
It's not useful.

527
00:36:23,920 --> 00:36:26,990
It's one of the reasons why when
I read papers I always

528
00:36:26,990 --> 00:36:29,550
like to see the raw data.

529
00:36:29,550 --> 00:36:31,910
I hate it when I read a
technical paper and it just

530
00:36:31,910 --> 00:36:34,600
shows me the curve that they fit
to the data, rather than

531
00:36:34,600 --> 00:36:42,950
the data, because it's easy to
get to the wrong place here.

532
00:36:42,950 --> 00:36:49,160
So let's for the moment
ignore the curves and

533
00:36:49,160 --> 00:36:51,930
look at the raw data.

534
00:36:51,930 --> 00:36:54,970
What do we see here about
the raw data?

535
00:36:54,970 --> 00:37:02,110
Well, it looks like at the
end it's flattening out.

536
00:37:02,110 --> 00:37:06,870
Well, that violates Hooke's law,
which says I should have

537
00:37:06,870 --> 00:37:09,170
a linear relationship.

538
00:37:09,170 --> 00:37:12,660
Suddenly it stopped
being linear.

539
00:37:12,660 --> 00:37:14,590
Have we violated Hooke's law?

540
00:37:18,520 --> 00:37:21,070
Have I done something so strange
that maybe I should

541
00:37:21,070 --> 00:37:24,190
just give up on this
experiment?

542
00:37:24,190 --> 00:37:25,420
What's the deal here?

543
00:37:25,420 --> 00:37:28,950
So, does this data contradict
Hooke's law?

544
00:37:28,950 --> 00:37:30,930
Let me ask that question.

545
00:37:30,930 --> 00:37:32,180
Yes or no?

546
00:37:34,070 --> 00:37:35,320
Who says no?

547
00:37:37,550 --> 00:37:41,711
AUDIENCE: Hooke's law applies
only for small displacements.

548
00:37:41,711 --> 00:37:44,110
PROFESSOR: Well, not
necessarily small.

549
00:37:44,110 --> 00:37:46,875
But only up to an
elastic limit.

550
00:37:46,875 --> 00:37:48,767
AUDIENCE: Which is in the scheme
of inifinitely small.

551
00:37:48,767 --> 00:37:51,505
PROFESSOR: Compared to
infinity [INAUDIBLE].

552
00:37:51,505 --> 00:37:54,135
AUDIENCE: Yes, sorry, up to the
limit where the linearity

553
00:37:54,135 --> 00:37:54,460
breaks down.

554
00:37:54,460 --> 00:37:58,140
PROFESSOR: Exactly right.

555
00:37:58,140 --> 00:38:00,762
Oh, I overthrew my hand here.

556
00:38:00,762 --> 00:38:02,654
AUDIENCE: I'll get it.

557
00:38:02,654 --> 00:38:06,290
PROFESSOR: Pick it up
on your way out.

558
00:38:06,290 --> 00:38:07,310
Exactly, it doesn't.

559
00:38:07,310 --> 00:38:10,880
It just says, probably I
exceeded the elastic limit of

560
00:38:10,880 --> 00:38:13,890
my spring in this experiment.

561
00:38:13,890 --> 00:38:21,920
Well now, let's go back and
let's go back to our original

562
00:38:21,920 --> 00:38:42,330
code and see what happens if I
discard the last six points,

563
00:38:42,330 --> 00:38:43,420
where it's flattened out.

564
00:38:43,420 --> 00:38:46,900
The points that seem to be where
I've exceeded the limit.

565
00:38:46,900 --> 00:38:48,315
So I can easily do that.

566
00:38:51,640 --> 00:38:52,895
Do this little coding hack.

567
00:38:56,210 --> 00:38:58,520
It's so much easier to do
experiments with code than

568
00:38:58,520 --> 00:39:01,810
with physical objects.

569
00:39:01,810 --> 00:39:03,060
Now let's see what we get.

570
00:39:19,820 --> 00:39:22,920
Well, we get something that's
visually a much better fit.

571
00:39:26,620 --> 00:39:28,695
And we get a very different
value of k.

572
00:39:32,630 --> 00:39:35,760
So we're a lot happier here.

573
00:39:35,760 --> 00:39:38,810
And if I fit cubic to this you
would find that the cubic and

574
00:39:38,810 --> 00:39:43,940
the line actually look
a lot alike.

575
00:39:43,940 --> 00:39:50,220
So this is a good
thing, I guess.

576
00:39:50,220 --> 00:39:57,520
On the other hand, how do we
know which line is a better

577
00:39:57,520 --> 00:40:03,180
representation of physical
reality, a better model.

578
00:40:03,180 --> 00:40:09,240
After all, I could delete all
the points except any two and

579
00:40:09,240 --> 00:40:12,100
then I would get a line that was
a perfect fit, R-squared

580
00:40:12,100 --> 00:40:17,110
-- you know the mean squared
error -- would be 0, right?

581
00:40:17,110 --> 00:40:19,350
Because you can fit a line
to any two points.

582
00:40:23,890 --> 00:40:26,340
So again, we're seeing that we
have a question here that

583
00:40:26,340 --> 00:40:29,240
can't be answered
by statistics.

584
00:40:29,240 --> 00:40:33,120
It's not just a question
of how good my fit is.

585
00:40:33,120 --> 00:40:37,600
I have to go back
to the theory.

586
00:40:37,600 --> 00:40:43,820
And what my theory tells me is
that it should be linear, and

587
00:40:43,820 --> 00:40:46,800
I have a theoretical
justification of discarding

588
00:40:46,800 --> 00:40:49,060
those last six points.

589
00:40:49,060 --> 00:40:51,350
It's plausible that I
exceeded the limit.

590
00:40:54,400 --> 00:40:57,960
I don't have a theoretical
justification of deleting six

591
00:40:57,960 --> 00:41:00,750
arbitrary points somewhere in
the middle that I didn't

592
00:41:00,750 --> 00:41:04,550
happen to like because they
didn't fit the data.

593
00:41:04,550 --> 00:41:10,040
So again, the theme that I'm
getting to is this interplay

594
00:41:10,040 --> 00:41:12,650
between physical reality,--

595
00:41:12,650 --> 00:41:14,300
in this case the experiment--

596
00:41:14,300 --> 00:41:17,390
the theoretical model,-- in
this case Hooke's law--

597
00:41:17,390 --> 00:41:21,360
and my computational model,
-- the line I fit to the

598
00:41:21,360 --> 00:41:24,820
experimental data.

599
00:41:24,820 --> 00:41:29,910
OK, let's continue down this
path and I want to look at

600
00:41:29,910 --> 00:41:33,710
another experiment, also with
a spring but this is a

601
00:41:33,710 --> 00:41:36,080
different spring.

602
00:41:36,080 --> 00:41:38,520
Maybe I'll bring in that spring
in the next lecture and

603
00:41:38,520 --> 00:41:39,770
show it to you.

604
00:41:39,770 --> 00:41:41,260
This spring is a
bow and arrow.

605
00:41:41,260 --> 00:41:44,120
Actually the bow
is the spring.

606
00:41:44,120 --> 00:41:47,200
Anyone here ever shot
a bow and arrow?

607
00:41:47,200 --> 00:41:51,260
Well what you know is the
bow has the limbs in it.

608
00:41:51,260 --> 00:41:55,630
And when you pull back the
string, you are putting force

609
00:41:55,630 --> 00:41:58,750
in the limbs, which are
essentially a spring.

610
00:41:58,750 --> 00:42:02,560
And when you release the spring
goes back to the place

611
00:42:02,560 --> 00:42:07,545
it wants to be and fires the
projectile on some trajectory.

612
00:42:12,760 --> 00:42:18,690
I now am interested in looking
at the trajectory followed by

613
00:42:18,690 --> 00:42:20,840
such a projectile.

614
00:42:20,840 --> 00:42:26,410
This, by the way, is where a
lot of this math came from.

615
00:42:26,410 --> 00:42:29,200
People were looking at
projectiles, not typically of

616
00:42:29,200 --> 00:42:33,390
bows, but of artillery shells,
where the force there was the

617
00:42:33,390 --> 00:42:37,710
force of some chemical
reaction.

618
00:42:37,710 --> 00:42:40,680
OK, so once again I've
got some data.

619
00:42:50,250 --> 00:42:54,880
In a file, similar
kind of format.

620
00:42:54,880 --> 00:42:58,160
And I'm going to read that
data in and plot it.

621
00:42:58,160 --> 00:42:59,460
So let's do that.

622
00:43:10,040 --> 00:43:14,310
So I'm going to get my
trajectory data.

623
00:43:14,310 --> 00:43:18,540
The way I did this, by the way,
is I actually did this

624
00:43:18,540 --> 00:43:19,120
experiment.

625
00:43:19,120 --> 00:43:25,580
I fired four arrows from
different distances and

626
00:43:25,580 --> 00:43:29,980
measured the mean height
of the four.

627
00:43:29,980 --> 00:43:34,720
So I'm getting at heights
1, 2, 3, and 4.

628
00:43:34,720 --> 00:43:36,140
Again, don't worry about this.

629
00:43:36,140 --> 00:43:38,640
And then I'm going
to try some fits.

630
00:43:38,640 --> 00:43:40,000
And let's see what
we get here.

631
00:43:57,770 --> 00:44:05,160
So I got my data inches from
launch point, and inches above

632
00:44:05,160 --> 00:44:06,410
launch point.

633
00:44:08,950 --> 00:44:11,073
And then I fit a line to it.

634
00:44:11,073 --> 00:44:13,600
And you can see there's a little
point way down here in

635
00:44:13,600 --> 00:44:16,480
the corner.

636
00:44:16,480 --> 00:44:19,690
The launch point and the target
were at actually the

637
00:44:19,690 --> 00:44:22,450
same height for this
experiment.

638
00:44:22,450 --> 00:44:26,480
And not surprisingly, the bow
was angled up, I guess, the

639
00:44:26,480 --> 00:44:28,710
arrow went up, and then
it came down, and

640
00:44:28,710 --> 00:44:31,010
ended up in the target.

641
00:44:31,010 --> 00:44:32,580
I fit a line to it.

642
00:44:32,580 --> 00:44:35,890
That's the best line I can
fit to these points.

643
00:44:35,890 --> 00:44:40,300
Well, it's not real good.

644
00:44:40,300 --> 00:44:45,390
So let's pretend I didn't know
anything about projectiles.

645
00:44:45,390 --> 00:44:52,020
I can now use computation to try
and understand the theory.

646
00:44:52,020 --> 00:44:53,570
Assume I didn't know
the theory.

647
00:44:53,570 --> 00:44:56,770
And what the theory tells me
here, or what the computation

648
00:44:56,770 --> 00:45:00,100
tells me, the theory that the
arrow travels in a straight

649
00:45:00,100 --> 00:45:01,440
line is not a very good one.

650
00:45:04,240 --> 00:45:08,150
All right, this does not
actually conform at all to the

651
00:45:08,150 --> 00:45:12,440
data, I probably should reject
this theory that says the

652
00:45:12,440 --> 00:45:14,870
arrow goes straight.

653
00:45:14,870 --> 00:45:17,120
If you looked at the arrows,
by the way, in a short

654
00:45:17,120 --> 00:45:19,310
distance it would kind of look
to your eyes like it was

655
00:45:19,310 --> 00:45:21,340
actually going straight.

656
00:45:21,340 --> 00:45:25,520
But in fact, physics tells us
it can't and the model tells

657
00:45:25,520 --> 00:45:27,670
us it didn't.

658
00:45:27,670 --> 00:45:29,150
All right let's try
a different one.

659
00:45:32,620 --> 00:45:36,970
Let's compare the linear
fit to a quadratic fit.

660
00:45:36,970 --> 00:45:39,985
So now I'm using polyfit
with a degree of 2.

661
00:45:44,530 --> 00:45:45,780
See what we get here.

662
00:45:48,100 --> 00:45:52,770
Well our eyes tell us it's not
a perfect fit, but it's a lot

663
00:45:52,770 --> 00:45:56,430
better fit, right.

664
00:45:56,430 --> 00:46:00,470
So this is suggesting that maybe
the arrow is traveling

665
00:46:00,470 --> 00:46:02,365
in a parabola, rather than
a straight line.

666
00:46:06,840 --> 00:46:10,770
The next question is, our eyes
tell us it's better.

667
00:46:10,770 --> 00:46:13,420
How much better?

668
00:46:13,420 --> 00:46:17,570
How do we go about measuring
which fit is better?

669
00:46:21,330 --> 00:46:25,370
Recall that we started by saying
what polyfit is doing

670
00:46:25,370 --> 00:46:29,230
is minimizing the mean
square error.

671
00:46:29,230 --> 00:46:32,090
So one way to compare two fits
would be to say what's the

672
00:46:32,090 --> 00:46:34,600
mean square error of the line?

673
00:46:34,600 --> 00:46:37,570
What's the mean square error
of the parabola?

674
00:46:37,570 --> 00:46:39,860
Well, pretty clear
it's going to be

675
00:46:39,860 --> 00:46:42,470
smaller for the parabola.

676
00:46:42,470 --> 00:46:46,930
So that would tell us OK
it is a better fit.

677
00:46:46,930 --> 00:46:52,790
And in fact computing the mean
square error is a good way to

678
00:46:52,790 --> 00:46:57,380
compare the fit of two
different curves.

679
00:46:57,380 --> 00:47:02,810
On the other hand, it's not
particularly useful for

680
00:47:02,810 --> 00:47:07,720
telling us the goodness of the
fit in absolute terms.

681
00:47:07,720 --> 00:47:10,220
So I can tell you that the
parabola is better than the

682
00:47:10,220 --> 00:47:15,610
line, but in some sense mean
square error can't be used to

683
00:47:15,610 --> 00:47:19,075
tell me how good it is
in an absolute sense.

684
00:47:21,880 --> 00:47:24,070
Why is that so?

685
00:47:24,070 --> 00:47:27,610
It's because mean
square error --

686
00:47:27,610 --> 00:47:31,400
there's a lower bound 0, but
there's no upper bound.

687
00:47:34,950 --> 00:47:37,920
It can go arbitrarily high.

688
00:47:37,920 --> 00:47:41,250
And that is not so good for
something where we're trying

689
00:47:41,250 --> 00:47:45,160
to measure things.

690
00:47:45,160 --> 00:47:48,880
So instead, what we typically
use is something called the

691
00:47:48,880 --> 00:47:50,215
coefficient of determination.

692
00:48:09,450 --> 00:48:11,620
Usually written, for
reasons you'll see

693
00:48:11,620 --> 00:48:12,970
shortly, as r squared.

694
00:48:18,720 --> 00:48:22,940
So the coefficient of
determination, R-squared, is

695
00:48:22,940 --> 00:48:36,100
equal to 1 minus the estimated
error EE over MV, which is the

696
00:48:36,100 --> 00:48:39,570
variance in the measured data.

697
00:48:39,570 --> 00:48:43,200
So we're comparing the ratio
of the estimated error, our

698
00:48:43,200 --> 00:48:47,860
best estimate of the error,
and a measurement of how

699
00:48:47,860 --> 00:48:50,970
variable the data is
to start with.

700
00:48:58,440 --> 00:49:03,010
As we'll see, this value is
always less than 1, less than

701
00:49:03,010 --> 00:49:06,650
or equal to 1, and therefore
R-squared is always going to

702
00:49:06,650 --> 00:49:10,260
be between 0 and 1.

703
00:49:10,260 --> 00:49:13,930
Which gives us a nice way of
thinking about it in an

704
00:49:13,930 --> 00:49:16,980
absolute sense.

705
00:49:16,980 --> 00:49:20,920
All right, so where
are these values?

706
00:49:20,920 --> 00:49:22,920
How do we compute them?

707
00:49:22,920 --> 00:49:27,570
Well, I'm going to explain it
the easiest way I know, which

708
00:49:27,570 --> 00:49:29,195
is by showing you the code.

709
00:49:33,100 --> 00:49:37,450
So I have the measured values
and the estimated values.

710
00:49:37,450 --> 00:49:43,550
The estimated error
is going to be--

711
00:49:43,550 --> 00:49:49,240
I take estimated value, the
value given me by the model,

712
00:49:49,240 --> 00:49:51,710
subtract the measured value,
and square it and

713
00:49:51,710 --> 00:49:52,960
then I just sum them.

714
00:49:55,940 --> 00:49:58,410
All right, this is like what
we looked at for the mean

715
00:49:58,410 --> 00:50:01,960
square error, but I'm not
computing the mean, right.

716
00:50:01,960 --> 00:50:06,400
I'm getting the total of
the estimated errors.

717
00:50:06,400 --> 00:50:10,910
I can then get the measured
mean, which is the measured

718
00:50:10,910 --> 00:50:15,760
sum, divided by the length
of the measurement.

719
00:50:15,760 --> 00:50:19,060
That gives me the mean
of the measured data.

720
00:50:19,060 --> 00:50:22,740
And then my measured variance is
going to be the mean of the

721
00:50:22,740 --> 00:50:30,480
measured data minus each point
of the measured data squared,

722
00:50:30,480 --> 00:50:31,730
and then summing that.

723
00:50:34,340 --> 00:50:36,880
So just as we looked at before
when we looked at the

724
00:50:36,880 --> 00:50:40,230
coefficient of variation, and
standard deviation, by

725
00:50:40,230 --> 00:50:44,210
comparing how far things stray
from the mean, that tells us

726
00:50:44,210 --> 00:50:47,380
how much variance there
is in the data.

727
00:50:47,380 --> 00:50:50,440
And then I'll return
1 minus that.

728
00:50:50,440 --> 00:50:55,600
OK, Tuesday we'll go look
at this in more detail.

729
00:50:55,600 --> 00:50:56,850
Thank you.