1
00:00:00,030 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,830
Commons license.

3
00:00:03,830 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,860 --> 00:00:10,560
offer high-quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:17,190
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,190 --> 00:00:18,440
ocw.mit.edu.

8
00:00:22,530 --> 00:00:23,030
PROFESSOR: OK.

9
00:00:23,030 --> 00:00:26,790
So today we're going to continue
on with some of the

10
00:00:26,790 --> 00:00:28,260
design patterns that we started

11
00:00:28,260 --> 00:00:30,040
talking about last week.

12
00:00:30,040 --> 00:00:34,780
So to recap, there are really
four common steps to taking a

13
00:00:34,780 --> 00:00:36,380
program and then parallelizing
it.

14
00:00:36,380 --> 00:00:39,480
Often you're starting off with
a program that's designed or

15
00:00:39,480 --> 00:00:41,060
written in a sequential
manner.

16
00:00:41,060 --> 00:00:44,420
And what you want to do is find
tasks in the program --

17
00:00:44,420 --> 00:00:47,230
and these are sort of
independent work pieces that

18
00:00:47,230 --> 00:00:49,260
you are going to be able
to decompose from

19
00:00:49,260 --> 00:00:50,950
your sequential code.

20
00:00:50,950 --> 00:00:52,520
You're going to group
tasks together

21
00:00:52,520 --> 00:00:54,930
into threads or processes.

22
00:00:54,930 --> 00:00:57,480
And then you'll essentially map
each one of these threads

23
00:00:57,480 --> 00:00:59,640
or processes down to the
actual hardware.

24
00:00:59,640 --> 00:01:02,200
And that will get you,
eventually when these programs

25
00:01:02,200 --> 00:01:04,500
run, the concurrency
and the performance

26
00:01:04,500 --> 00:01:05,750
speedups that you want.

27
00:01:08,680 --> 00:01:12,120
So as a reminder of what I
talked about last week in

28
00:01:12,120 --> 00:01:15,000
terms of finding the task or
finding the concurrency, you

29
00:01:15,000 --> 00:01:16,920
start off with an application.

30
00:01:16,920 --> 00:01:19,130
You come up with a block
level diagram.

31
00:01:19,130 --> 00:01:23,120
And from that you sort of try
to understand where the time

32
00:01:23,120 --> 00:01:25,520
is spent in the computations
and what are some typical

33
00:01:25,520 --> 00:01:27,860
patterns for how the
computations are carried out.

34
00:01:27,860 --> 00:01:31,680
So we talked about task
decomposition or sort of

35
00:01:31,680 --> 00:01:34,115
independent tasks or tasks that
might be different that

36
00:01:34,115 --> 00:01:35,540
the application is
carrying out.

37
00:01:35,540 --> 00:01:38,310
So in the MPEG encoder, we
looked at decoding the motion

38
00:01:38,310 --> 00:01:42,960
vectors for temporal compression
versus spatial

39
00:01:42,960 --> 00:01:43,470
compression.

40
00:01:43,470 --> 00:01:48,050
It does sort of substantially
different work.

41
00:01:48,050 --> 00:01:49,820
We talked about data
decomposition.

42
00:01:49,820 --> 00:01:53,690
So if you're doing a process
-- so if you have some work

43
00:01:53,690 --> 00:01:56,740
that's really consuming a large
chunk of data, and you

44
00:01:56,740 --> 00:01:59,160
realize that it's applying the
same kind of work to each of

45
00:01:59,160 --> 00:02:02,640
those data pieces, then you can
partition your data into

46
00:02:02,640 --> 00:02:06,480
smaller subsets and apply
the same function

47
00:02:06,480 --> 00:02:07,730
over and over again.

48
00:02:13,970 --> 00:02:16,860
So in the motion compensation
phase, that's one example

49
00:02:16,860 --> 00:02:19,590
where you can replicate the
function and split up the data

50
00:02:19,590 --> 00:02:22,230
stream in different ways
and have these

51
00:02:22,230 --> 00:02:23,920
tasks proceed in parallel.

52
00:02:23,920 --> 00:02:25,250
So that's data decomposition.

53
00:02:25,250 --> 00:02:28,120
And then we talked a little
bit about sort of making a

54
00:02:28,120 --> 00:02:29,880
case for a pipeline
decomposition.

55
00:02:29,880 --> 00:02:32,530
So you have a data assembly
line or producer-consumer

56
00:02:32,530 --> 00:02:35,270
chains, and you essentially
want to recognize those in

57
00:02:35,270 --> 00:02:39,140
your computation and make it so
that you can exploit them

58
00:02:39,140 --> 00:02:41,220
eventually when you're
doing your mapping

59
00:02:41,220 --> 00:02:44,030
down to actual hardware.

60
00:02:44,030 --> 00:02:46,210
But what does it mean for
two tasks to actually be

61
00:02:46,210 --> 00:02:47,700
concurrent?

62
00:02:47,700 --> 00:02:49,980
And how do you know that you
can safely actually run two

63
00:02:49,980 --> 00:02:51,230
tasks in parallel?

64
00:02:51,230 --> 00:02:54,320
So there's something I crudely
went over last time.

65
00:02:54,320 --> 00:02:58,180
So as to make it more concrete,
highlighting

66
00:02:58,180 --> 00:03:01,960
Bernstein's condition, which
says that given two tasks, if

67
00:03:01,960 --> 00:03:06,630
the input set to one task is
different from or does not

68
00:03:06,630 --> 00:03:11,530
intersect with the output set
of another, and vice versa,

69
00:03:11,530 --> 00:03:13,910
and neither task sort of
updates the same data

70
00:03:13,910 --> 00:03:17,660
structures in memory, then
there's really no dependency

71
00:03:17,660 --> 00:03:18,470
issues between them.

72
00:03:18,470 --> 00:03:21,650
You can run them safely
in parallel.

73
00:03:21,650 --> 00:03:26,150
So task T1 and T2, if all the
data that's consumed by T1, so

74
00:03:26,150 --> 00:03:29,840
all the data elements that are
read by T1 are different from

75
00:03:29,840 --> 00:03:33,020
the ones that are read by
T2, then you have --

76
00:03:33,020 --> 00:03:37,130
you know, if T2 is running in
parallel, there's really no

77
00:03:37,130 --> 00:03:40,240
problem with T1 because
it's updating the

78
00:03:40,240 --> 00:03:41,360
orthogonal data set.

79
00:03:41,360 --> 00:03:45,060
Similarly for T2 and T1, any
outputs are different.

80
00:03:45,060 --> 00:03:49,480
So as an example, let's say
you have two tasks.

81
00:03:49,480 --> 00:03:52,240
In T1 you're doing some
basic statements.

82
00:03:52,240 --> 00:03:54,850
And these could be essentially
more coarse grained.

83
00:03:54,850 --> 00:03:56,580
There could be a lot more
computation in here.

84
00:03:56,580 --> 00:04:00,390
I just simplified it for
the illustration.

85
00:04:00,390 --> 00:04:02,980
So you have task a
equals x plus y.

86
00:04:02,980 --> 00:04:06,300
And task two does b
equals x plus z.

87
00:04:06,300 --> 00:04:09,660
So if we look at the read set
for T1, these are all the

88
00:04:09,660 --> 00:04:13,170
variables or data structures
or addresses these that are

89
00:04:13,170 --> 00:04:14,490
read by the first task.

90
00:04:14,490 --> 00:04:16,320
So that's x and y here.

91
00:04:16,320 --> 00:04:19,480
And all the data that's written
or produced by T1.

92
00:04:19,480 --> 00:04:22,320
So here we're just producing
one data value.

93
00:04:22,320 --> 00:04:24,830
And that's going into
location A.

94
00:04:24,830 --> 00:04:26,670
Similarly we can come up
with the read set and

95
00:04:26,670 --> 00:04:28,510
write set for T2.

96
00:04:28,510 --> 00:04:32,220
And so that's shown on here.

97
00:04:32,220 --> 00:04:35,530
So we have -- task T2 has x
plus z in its read set.

98
00:04:35,530 --> 00:04:37,440
And it produces one
data value, b.

99
00:04:37,440 --> 00:04:40,375
If we take the intersection of
the read and write sets for

100
00:04:40,375 --> 00:04:43,550
the different tasks,
then they're empty.

101
00:04:43,550 --> 00:04:45,440
I read something completely
different than what's produced

102
00:04:45,440 --> 00:04:47,510
in this task and vice versa.

103
00:04:47,510 --> 00:04:49,190
And they write to
two completely

104
00:04:49,190 --> 00:04:50,530
different memory locations.

105
00:04:50,530 --> 00:04:53,020
So I can essentially parallelize
these or run these

106
00:04:53,020 --> 00:04:54,810
two tasks in parallel.

107
00:04:54,810 --> 00:04:58,260
So you can extend
this analysis.

108
00:04:58,260 --> 00:05:01,470
And compilers can actually use
this condition to determine

109
00:05:01,470 --> 00:05:03,300
when two tasks can be
parallelized if you're doing

110
00:05:03,300 --> 00:05:05,380
automatic parallelization.

111
00:05:05,380 --> 00:05:07,960
And you'll probably hear more
about these later on.

112
00:05:10,680 --> 00:05:13,990
And so what I focused on last
time were the finding

113
00:05:13,990 --> 00:05:16,530
concurrency patterns.

114
00:05:16,530 --> 00:05:20,630
And I had identified sort of
four design spaces based on

115
00:05:20,630 --> 00:05:24,310
the work that's outlined in the
book by Mattson, Sanders,

116
00:05:24,310 --> 00:05:26,160
and Massingill.

117
00:05:26,160 --> 00:05:31,750
And so starting with two
large sort of concepts.

118
00:05:31,750 --> 00:05:35,210
The first helps you figure out
how you're going to actually

119
00:05:35,210 --> 00:05:36,310
express your algorithm.

120
00:05:36,310 --> 00:05:38,500
So first you find your
concurrency and then you

121
00:05:38,500 --> 00:05:39,640
organize in some way.

122
00:05:39,640 --> 00:05:42,240
And so we're going to talk about
that in more detail.

123
00:05:42,240 --> 00:05:44,790
And then once you've organized
your tasks in some way that

124
00:05:44,790 --> 00:05:47,800
actually expresses your overall
computation, you need

125
00:05:47,800 --> 00:05:52,670
some software construction
utilities or data structures

126
00:05:52,670 --> 00:05:55,810
or mechanisms for actually
orchestrating computations for

127
00:05:55,810 --> 00:05:57,190
which they have also abstracted

128
00:05:57,190 --> 00:05:58,560
out some common patterns.

129
00:05:58,560 --> 00:06:01,840
And so I'll briefly talk
about these as well.

130
00:06:01,840 --> 00:06:03,870
And so on your algorithm
expression side, these are

131
00:06:03,870 --> 00:06:07,740
essentially conceptualization
steps that help you abstract

132
00:06:07,740 --> 00:06:08,500
out your problem.

133
00:06:08,500 --> 00:06:12,490
And you may in fact think
about your algorithm

134
00:06:12,490 --> 00:06:14,710
expression in different ways to
expose different kinds of

135
00:06:14,710 --> 00:06:17,960
concurrency or to be able to
explore different ways of

136
00:06:17,960 --> 00:06:20,610
mapping the concurrency
to hardware.

137
00:06:20,610 --> 00:06:23,540
And so for construction it's
more about actual engineering

138
00:06:23,540 --> 00:06:25,200
and implementation.

139
00:06:25,200 --> 00:06:26,550
So here you're actually thinking
about what do the

140
00:06:26,550 --> 00:06:28,210
data structures look like?

141
00:06:28,210 --> 00:06:30,050
What is the communication
pattern going to look like?

142
00:06:30,050 --> 00:06:32,660
Am I but going to use things
like MPI or OpenMP?

143
00:06:32,660 --> 00:06:36,270
What does that help me with
in terms of doing my

144
00:06:36,270 --> 00:06:37,520
implementation?

145
00:06:40,320 --> 00:06:42,800
So given a collection of
concurrent tasks -- so you've

146
00:06:42,800 --> 00:06:46,080
done your first step in your
four design patterns.

147
00:06:46,080 --> 00:06:47,510
You know, what is
your next step?

148
00:06:47,510 --> 00:06:49,390
And that's really mapping
those tasks that you've

149
00:06:49,390 --> 00:06:52,600
identified down to some sort
of execution units.

150
00:06:52,600 --> 00:06:54,660
So threads are very common.

151
00:06:54,660 --> 00:06:56,970
This is essentially what we've
been using on Cell.

152
00:06:56,970 --> 00:06:59,960
We take our computation and we
wrap it into SPE threads and

153
00:06:59,960 --> 00:07:03,500
then we can execute
those at run time.

154
00:07:03,500 --> 00:07:05,900
So some things to keep in mind
-- although you shouldn't over

155
00:07:05,900 --> 00:07:09,600
constrain yourself in terms
of these considerations.

156
00:07:09,600 --> 00:07:12,150
What is the magnitude of
your parallelism that

157
00:07:12,150 --> 00:07:12,580
you're going to get?

158
00:07:12,580 --> 00:07:15,030
You know, do you want hundreds
or thousands of threads?

159
00:07:15,030 --> 00:07:18,630
Or do you want something
on the order of tens?

160
00:07:18,630 --> 00:07:22,030
And this is because you don't
want to overwhelm the intended

161
00:07:22,030 --> 00:07:23,540
system that you're
going to run on.

162
00:07:23,540 --> 00:07:27,010
So we talked about yesterday on
Cell processor, if you're

163
00:07:27,010 --> 00:07:29,960
creating a lot more than six
threads, then you can create

164
00:07:29,960 --> 00:07:32,660
problems or you essentially
don't get extra parallelism

165
00:07:32,660 --> 00:07:35,710
because each thread is running
to completion on each SPE.

166
00:07:35,710 --> 00:07:38,140
And contact switch overhead
is extremely high.

167
00:07:38,140 --> 00:07:41,010
So you don't want to spend too
much engineering cost to come

168
00:07:41,010 --> 00:07:42,960
up with an algorithm
implementation that's

169
00:07:42,960 --> 00:07:45,940
massively scalable to hundreds
or thousands of threads when

170
00:07:45,940 --> 00:07:47,670
you can't actually exploit it.

171
00:07:47,670 --> 00:07:49,940
But that doesn't mean that you
should over constrain your

172
00:07:49,940 --> 00:07:53,000
implementation to where if now
I want to take your code and

173
00:07:53,000 --> 00:07:55,900
run it on a different machine, I
essentially have to redesign

174
00:07:55,900 --> 00:07:58,410
or re-engineer the
complete process.

175
00:07:58,410 --> 00:08:00,646
So you want to avoid tendencies
to over constrain

176
00:08:00,646 --> 00:08:02,000
the implementation.

177
00:08:02,000 --> 00:08:04,340
And you want to leave your code
in a way that's malleable

178
00:08:04,340 --> 00:08:07,580
so that you can easily make
changes to sort of factor in

179
00:08:07,580 --> 00:08:10,340
new platforms that you want
to run on or new machine

180
00:08:10,340 --> 00:08:14,500
architecture features that you
might want to exploit.

181
00:08:14,500 --> 00:08:17,670
So there are three major
organization principles I'm

182
00:08:17,670 --> 00:08:20,060
going to talk about.

183
00:08:20,060 --> 00:08:22,280
And none of these should be sort
of foreign to you at this

184
00:08:22,280 --> 00:08:24,950
point because we've talked about
them in different ways

185
00:08:24,950 --> 00:08:28,510
in the recitations or in
previous lectures.

186
00:08:28,510 --> 00:08:31,240
And it's really, what is it that
determines sort of the

187
00:08:31,240 --> 00:08:33,290
algorithm structure based on the
set of tasks that you're

188
00:08:33,290 --> 00:08:36,070
actually carrying out
in your computation?

189
00:08:36,070 --> 00:08:38,180
And so there's the principle
that says,

190
00:08:38,180 --> 00:08:40,450
organize things by tasks.

191
00:08:40,450 --> 00:08:42,030
I'm going to talk to that.

192
00:08:42,030 --> 00:08:43,980
And then there's a principle
that says, well, organize

193
00:08:43,980 --> 00:08:46,700
things by how you're doing
the data decomposition.

194
00:08:46,700 --> 00:08:50,090
So in this case how you're
actually distributing the data

195
00:08:50,090 --> 00:08:53,260
or how to the data is laid out
in memory, or how you're

196
00:08:53,260 --> 00:08:55,510
partitioning the data to
actually compute on it

197
00:08:55,510 --> 00:08:58,330
dictates how you should actually
organize your actual

198
00:08:58,330 --> 00:08:59,420
computation.

199
00:08:59,420 --> 00:09:01,780
And then there's organize
by flow of data.

200
00:09:01,780 --> 00:09:04,870
And this is something you'll
hear about more in the next

201
00:09:04,870 --> 00:09:06,540
lecture where we're talking
about streaming.

202
00:09:06,540 --> 00:09:12,210
But in this pattern if there
are specific sort of

203
00:09:12,210 --> 00:09:16,230
computations that take advantage
of high bandwidth

204
00:09:16,230 --> 00:09:18,560
flow of data between
computations, you might want

205
00:09:18,560 --> 00:09:20,520
to exploit that for
concurrency.

206
00:09:20,520 --> 00:09:23,540
And we'll talk about
that as well.

207
00:09:23,540 --> 00:09:24,100
OK.

208
00:09:24,100 --> 00:09:27,070
So a design diagram
for how can you

209
00:09:27,070 --> 00:09:28,660
actually go through process.

210
00:09:28,660 --> 00:09:31,250
So you can ask yourself
a set of questions.

211
00:09:31,250 --> 00:09:35,425
if I want to organize things
by tasks, then there are

212
00:09:35,425 --> 00:09:38,530
essentially two main clusters or
two main computations, two

213
00:09:38,530 --> 00:09:40,010
main patterns.

214
00:09:40,010 --> 00:09:43,700
If the code is recursive, then
you essentially want to apply

215
00:09:43,700 --> 00:09:47,220
a divide and conquer pattern
or divide and conquer

216
00:09:47,220 --> 00:09:48,430
organization.

217
00:09:48,430 --> 00:09:49,780
If it's not recursive.

218
00:09:49,780 --> 00:09:54,990
then you essentially want
to do task parallelism.

219
00:09:54,990 --> 00:09:56,400
So in task parallelism --

220
00:09:56,400 --> 00:09:58,620
you know, I've listed
two examples here.

221
00:09:58,620 --> 00:09:59,850
But really any of the things
that we've talked

222
00:09:59,850 --> 00:10:01,810
about in the past fit.

223
00:10:01,810 --> 00:10:03,940
Ray computation, ray tracing.

224
00:10:03,940 --> 00:10:05,950
So here you're shooting rays
through a scene to try to

225
00:10:05,950 --> 00:10:09,420
determine how to render it.

226
00:10:09,420 --> 00:10:11,950
And really each ray is a
separate and independent

227
00:10:11,950 --> 00:10:13,570
computation step.

228
00:10:13,570 --> 00:10:16,070
In molecular dynamics you're
trying to determine the

229
00:10:16,070 --> 00:10:17,660
non-bonded force calculations.

230
00:10:17,660 --> 00:10:20,160
There are some dependencies,
but really you can do each

231
00:10:20,160 --> 00:10:23,340
calculation for one molecule
or for one atom

232
00:10:23,340 --> 00:10:25,080
independent of any other.

233
00:10:25,080 --> 00:10:27,600
And then there are sort of the
global dependence of having to

234
00:10:27,600 --> 00:10:30,530
update or communicate across all
those molecules that sort

235
00:10:30,530 --> 00:10:33,350
of reflect new positions
in the system.

236
00:10:33,350 --> 00:10:38,500
So the common factors here are
your tasks are associated with

237
00:10:38,500 --> 00:10:39,850
iterations of a loop.

238
00:10:39,850 --> 00:10:43,460
And you can distribute, you
know, each process --

239
00:10:43,460 --> 00:10:44,760
each processor can
do a different

240
00:10:44,760 --> 00:10:47,150
iteration of the loop.

241
00:10:47,150 --> 00:10:51,170
And often you know sort of what
the tasks are before you

242
00:10:51,170 --> 00:10:52,850
actually start your
computation.

243
00:10:52,850 --> 00:10:55,550
Although in some cases, like
in ray tracing, you might

244
00:10:55,550 --> 00:10:59,040
generate more and more threads
as you go along, or more and

245
00:10:59,040 --> 00:11:01,130
more computations because as
the ray is shooting off,

246
00:11:01,130 --> 00:11:05,680
you're calculating
new reflections.

247
00:11:05,680 --> 00:11:08,800
And that creates sort
of extra work.

248
00:11:08,800 --> 00:11:11,330
But largely you have these
independent tasks that you can

249
00:11:11,330 --> 00:11:14,680
encapsulate in threads
and you run them.

250
00:11:14,680 --> 00:11:17,650
And this is sort of -- it might
appear subtle, but there

251
00:11:17,650 --> 00:11:20,730
are algorithm classes where not
all tasks essentially need

252
00:11:20,730 --> 00:11:23,160
to complete for you to
arrive at a solution.

253
00:11:23,160 --> 00:11:24,462
You know, in some cases
you might convert to

254
00:11:24,462 --> 00:11:25,960
an acceptable solution.

255
00:11:25,960 --> 00:11:31,240
And you don't actually need to
go through and exercise all

256
00:11:31,240 --> 00:11:33,630
the computation that's
outstanding for you to say the

257
00:11:33,630 --> 00:11:34,760
program is done.

258
00:11:34,760 --> 00:11:36,230
So there will be a
tricky issue --

259
00:11:36,230 --> 00:11:38,700
I'll revisit this just briefly
later on -- is how do you

260
00:11:38,700 --> 00:11:40,280
determine if your program
has actually

261
00:11:40,280 --> 00:11:43,300
terminated or has completed?

262
00:11:43,300 --> 00:11:47,910
In divide and conquer, this is
really for recursive programs.

263
00:11:47,910 --> 00:11:51,340
You know, you can think of a
well-known sorting algorithm,

264
00:11:51,340 --> 00:11:53,900
merge sort, that classically
fits into this kind of

265
00:11:53,900 --> 00:11:57,350
picture, where you have some
really large array of data

266
00:11:57,350 --> 00:11:58,280
that you want to sort.

267
00:11:58,280 --> 00:12:01,330
You keep subdividing into
smaller and smaller chunks

268
00:12:01,330 --> 00:12:03,610
until you can do local
reorderings.

269
00:12:03,610 --> 00:12:06,320
And then you start merging
things together.

270
00:12:06,320 --> 00:12:09,460
So this gives you sort of a way
to take a problem, divide

271
00:12:09,460 --> 00:12:12,620
it into subproblems. And then
you can split the data at some

272
00:12:12,620 --> 00:12:13,990
point and then you join
it back together.

273
00:12:13,990 --> 00:12:15,540
You merge it.

274
00:12:15,540 --> 00:12:20,160
You might see things like fork
and merge or fork and join

275
00:12:20,160 --> 00:12:22,330
used instead of split
and join.

276
00:12:22,330 --> 00:12:25,790
I've used the terminology that
sort of melds well with some

277
00:12:25,790 --> 00:12:28,210
of the concepts we use in
streaming that you'll see in

278
00:12:28,210 --> 00:12:29,780
the next lecture.

279
00:12:29,780 --> 00:12:32,580
And so in these kinds of
programs, it's not always the

280
00:12:32,580 --> 00:12:35,510
case that each subproblem will
have essentially the same

281
00:12:35,510 --> 00:12:36,710
amount of work to do.

282
00:12:36,710 --> 00:12:41,050
You might need more dynamic load
balancing because each

283
00:12:41,050 --> 00:12:42,390
subproblem --

284
00:12:42,390 --> 00:12:45,210
how you distribute the data
might lead you to do more work

285
00:12:45,210 --> 00:12:47,720
in one problem than
in the other.

286
00:12:47,720 --> 00:12:50,950
So as opposed to some of the
other mechanisms where static

287
00:12:50,950 --> 00:12:53,220
load balancing will work
really well --

288
00:12:53,220 --> 00:12:56,340
and to remind you, static load
balancing essentially says,

289
00:12:56,340 --> 00:12:58,610
you have some work, you assign
it to each of the processors.

290
00:12:58,610 --> 00:13:02,180
And you're going to be
relatively happy with how each

291
00:13:02,180 --> 00:13:04,000
processor's sort of
utilization is

292
00:13:04,000 --> 00:13:05,240
going to be over time.

293
00:13:05,240 --> 00:13:07,430
Nobody's going to be too
overwhelmed with the amount of

294
00:13:07,430 --> 00:13:08,830
work they have to do.

295
00:13:08,830 --> 00:13:11,630
In this case, you might end up
with needing some things for

296
00:13:11,630 --> 00:13:14,580
dynamic load balancing that
says, I'm unhappy with the

297
00:13:14,580 --> 00:13:16,700
work performance
or utilization.

298
00:13:16,700 --> 00:13:19,140
Some processors are more idle
than the others, so you might

299
00:13:19,140 --> 00:13:21,570
want to essentially redistribute
things.

300
00:13:21,570 --> 00:13:24,060
So what we'll talk about --
you know, how does this

301
00:13:24,060 --> 00:13:29,310
concept of divide and conquer
parallelization pattern work

302
00:13:29,310 --> 00:13:30,910
into the actual implementation?

303
00:13:30,910 --> 00:13:35,310
You know, how do I actually
implement a divide and conquer

304
00:13:35,310 --> 00:13:37,060
organization?

305
00:13:37,060 --> 00:13:40,590
The next organization is
organized by data.

306
00:13:40,590 --> 00:13:43,450
So here you have some
computation --

307
00:13:43,450 --> 00:13:44,370
not sure why it's flickering.

308
00:13:44,370 --> 00:13:45,398
AUDIENCE: Check your --

309
00:13:45,398 --> 00:13:47,455
maybe your VGA cables
aren't in good.

310
00:13:50,540 --> 00:14:00,450
PROFESSOR: So in the organize by
data, you essentially want

311
00:14:00,450 --> 00:14:03,400
to apply this if you have a
lot of computation that's

312
00:14:03,400 --> 00:14:06,040
using a shared global data
structure or that's going to

313
00:14:06,040 --> 00:14:08,590
update a central
data structure.

314
00:14:08,590 --> 00:14:11,940
So in molecular dynamics, for
example, you have a huge array

315
00:14:11,940 --> 00:14:15,150
that records the position of
each of the molecules.

316
00:14:15,150 --> 00:14:17,310
And while you can do the
coarse calculations

317
00:14:17,310 --> 00:14:21,160
independently, eventually all
the parallel tasks have to

318
00:14:21,160 --> 00:14:23,700
communicate with the central
data structure and say, here

319
00:14:23,700 --> 00:14:25,980
are the new locations for
all the molecules.

320
00:14:25,980 --> 00:14:29,070
And so that has to go into
a central repository.

321
00:14:29,070 --> 00:14:32,770
And there are different kinds
of sort of decompositions

322
00:14:32,770 --> 00:14:36,160
within this organization.

323
00:14:36,160 --> 00:14:40,650
If your data structure is
recursive, so a link list or a

324
00:14:40,650 --> 00:14:42,750
tree or a graph, then
you can apply the

325
00:14:42,750 --> 00:14:44,630
recursive data pattern.

326
00:14:44,630 --> 00:14:48,250
If it's not, if it's linear,
like an array or a vector,

327
00:14:48,250 --> 00:14:50,450
then you apply geometric
decomposition.

328
00:14:50,450 --> 00:14:53,560
And you've essentially seen
geometric decomposition.

329
00:14:53,560 --> 00:14:56,990
These were some of the labs
that you've already done.

330
00:14:56,990 --> 00:14:59,850
And so the example from
yesterday's recitation, you're

331
00:14:59,850 --> 00:15:04,070
doing an end body simulation in
terms of who is gravitating

332
00:15:04,070 --> 00:15:06,650
towards who, you're calculating
the forces between

333
00:15:06,650 --> 00:15:07,670
pairs of objects.

334
00:15:07,670 --> 00:15:11,060
And depending on the force that
each object feels, you

335
00:15:11,060 --> 00:15:16,170
calculate a new motion vector.

336
00:15:16,170 --> 00:15:20,160
And you use that to update the
position of each body in your,

337
00:15:20,160 --> 00:15:22,790
say, galaxy that you're
simulating.

338
00:15:22,790 --> 00:15:25,940
And so what we talked about
yesterday was given an array

339
00:15:25,940 --> 00:15:30,050
of positions, each processor
gets a sub-chunk of that

340
00:15:30,050 --> 00:15:31,050
position array.

341
00:15:31,050 --> 00:15:34,280
And it knows how to
calculate sort of

342
00:15:34,280 --> 00:15:35,240
locally, based on that.

343
00:15:35,240 --> 00:15:38,500
And then you might also
communicate local chunks to do

344
00:15:38,500 --> 00:15:39,860
more scalable computations.

345
00:15:43,930 --> 00:15:46,850
And recursive data structure are
a little bit more tricky.

346
00:15:46,850 --> 00:15:49,910
So at face value you might think
that there's really no

347
00:15:49,910 --> 00:15:51,580
kind of parallelism you
can get out of a

348
00:15:51,580 --> 00:15:52,970
recursive data structure.

349
00:15:52,970 --> 00:15:55,310
So if you're iterating over a
list and you want to get the

350
00:15:55,310 --> 00:15:58,040
sum, well, you know, I just need
to go through the list.

351
00:15:58,040 --> 00:16:01,020
Can I really parallelize that?

352
00:16:01,020 --> 00:16:04,720
There are, however,
opportunities where you can

353
00:16:04,720 --> 00:16:07,100
reshape the computation in
a way that exposes the

354
00:16:07,100 --> 00:16:08,130
concurrency.

355
00:16:08,130 --> 00:16:11,220
And often what this comes down
to is you're going to do more

356
00:16:11,220 --> 00:16:16,000
work, but it's OK because you're
going to finish faster.

357
00:16:16,000 --> 00:16:18,930
So this kind of work/concurrency
tradeoff, I'm

358
00:16:18,930 --> 00:16:21,740
going to illustrate
with an example.

359
00:16:21,740 --> 00:16:27,450
So in this application
we have some graphs.

360
00:16:27,450 --> 00:16:29,650
And for each node in
a graph, we want to

361
00:16:29,650 --> 00:16:31,170
know what is its root?

362
00:16:31,170 --> 00:16:34,620
So this works well when you have
a forest where not all

363
00:16:34,620 --> 00:16:37,040
the graphs are connected and
given a node you want to know

364
00:16:37,040 --> 00:16:39,750
who is the root of this graph.

365
00:16:39,750 --> 00:16:42,580
So what we can do is essentially
have more

366
00:16:42,580 --> 00:16:45,700
concurrency by changing the way
we actually think about

367
00:16:45,700 --> 00:16:46,630
the algorithm.

368
00:16:46,630 --> 00:16:49,670
So rather than starting with
each node and then, in a

369
00:16:49,670 --> 00:16:53,000
directed graph, following
its successor --

370
00:16:53,000 --> 00:16:55,340
so this is essentially order
n, because for each node we

371
00:16:55,340 --> 00:16:57,600
have to follow n links --

372
00:16:57,600 --> 00:16:59,580
we can think about it slightly
differently.

373
00:16:59,580 --> 00:17:02,510
So what if rather than finding
the successor and then finding

374
00:17:02,510 --> 00:17:05,560
that successor's successor, at
each computational step we

375
00:17:05,560 --> 00:17:07,040
start with a node and
we say who is

376
00:17:07,040 --> 00:17:09,640
your successor's successor?

377
00:17:09,640 --> 00:17:14,000
So we can converge in this
example in three steps.

378
00:17:14,000 --> 00:17:18,160
So from five to six we can say
who is this successor?

379
00:17:18,160 --> 00:17:21,200
So who is the successor's
successor of five?

380
00:17:21,200 --> 00:17:22,590
And that would be two.

381
00:17:22,590 --> 00:17:25,150
And similarly you can do that
for seven and so on.

382
00:17:25,150 --> 00:17:27,520
And so you keep asking
the question.

383
00:17:27,520 --> 00:17:29,540
So you can distribute all
these data structures,

384
00:17:29,540 --> 00:17:31,970
repeatedly ask these questions
out of all these end nodes,

385
00:17:31,970 --> 00:17:34,840
and it leads you to an order log
n solution versus an order

386
00:17:34,840 --> 00:17:36,770
n solution.

387
00:17:36,770 --> 00:17:39,310
But what have I done
in each step?

388
00:17:39,310 --> 00:17:43,850
Well, I've actually created
myself and I've sort of

389
00:17:43,850 --> 00:17:48,220
increased the amount of work
that I'm doing by order n.

390
00:17:48,220 --> 00:17:49,310
Right there.

391
00:17:49,310 --> 00:17:51,370
Right.

392
00:17:51,370 --> 00:17:52,260
Yes.

393
00:17:52,260 --> 00:17:56,680
Because I've essentially for
each node doing n queries, you

394
00:17:56,680 --> 00:17:58,720
know, who's your successor's
successor?

395
00:17:58,720 --> 00:18:01,460
Whereas in a sequential case,
you know, I just need to do it

396
00:18:01,460 --> 00:18:02,980
once for each node.

397
00:18:02,980 --> 00:18:05,070
And that works really well.

398
00:18:05,070 --> 00:18:07,790
So most strategies based on
this pattern of actually

399
00:18:07,790 --> 00:18:11,390
decomposing your computation
according to recursive pattern

400
00:18:11,390 --> 00:18:16,860
lead you to doing much more work
or some increase in the

401
00:18:16,860 --> 00:18:18,510
amount of work.

402
00:18:18,510 --> 00:18:20,080
But you get this back
in because you can

403
00:18:20,080 --> 00:18:21,590
decrease your execution.

404
00:18:21,590 --> 00:18:23,260
And so this is a good
tradeoff that you

405
00:18:23,260 --> 00:18:24,990
might want to consider.

406
00:18:24,990 --> 00:18:27,860
AUDIENCE: In the first one
order n was sequential?

407
00:18:27,860 --> 00:18:28,520
PROFESSOR: Yeah, yeah.

408
00:18:28,520 --> 00:18:30,970
It's a typo.

409
00:18:30,970 --> 00:18:32,220
Yeah.

410
00:18:34,630 --> 00:18:39,130
So organize by flow or organize
by flow of data.

411
00:18:39,130 --> 00:18:41,560
And this is essentially
the pipeline model.

412
00:18:41,560 --> 00:18:44,270
And we talked about this again
in some of the recitations in

413
00:18:44,270 --> 00:18:47,960
terms of SPE to SPE
communication.

414
00:18:47,960 --> 00:18:50,370
Or do you want to organize
based on event-based

415
00:18:50,370 --> 00:18:52,440
mechanisms?

416
00:18:52,440 --> 00:18:56,050
So what these really come down
to is, well, how regular is

417
00:18:56,050 --> 00:18:57,750
the flow of data in
your application?

418
00:18:57,750 --> 00:19:01,445
If you have regular, let's say,
one-way flow through a

419
00:19:01,445 --> 00:19:02,940
stable computation path --

420
00:19:02,940 --> 00:19:05,690
so I've set up my sort of
algorithm structure.

421
00:19:05,690 --> 00:19:08,190
Data is flowing through
it at a regular rate.

422
00:19:08,190 --> 00:19:10,990
The computation graph isn't
changing very much.

423
00:19:10,990 --> 00:19:13,490
Then I can essentially pipeline
things really well.

424
00:19:13,490 --> 00:19:15,870
And this could be a linear chain
of computation or it

425
00:19:15,870 --> 00:19:17,040
could be sort of nonlinear.

426
00:19:17,040 --> 00:19:20,780
There could be branches
in the graph.

427
00:19:20,780 --> 00:19:25,740
And I can use that in a way to
exploit pipeline parallelism.

428
00:19:25,740 --> 00:19:29,440
If I don't have sort of this
nice, regular structure, it

429
00:19:29,440 --> 00:19:33,400
could be events that are
created at run time.

430
00:19:33,400 --> 00:19:37,450
So, for example, you're a
car wash attendant and a

431
00:19:37,450 --> 00:19:38,470
new car comes in.

432
00:19:38,470 --> 00:19:42,480
So you have to find a garage to
assign to it and then turn

433
00:19:42,480 --> 00:19:45,180
on the car wash machine.

434
00:19:45,180 --> 00:19:48,540
So the dynamic threads are
created based on sensory input

435
00:19:48,540 --> 00:19:51,100
that comes in, then you might
want to use an events-based

436
00:19:51,100 --> 00:19:52,150
coordination.

437
00:19:52,150 --> 00:19:54,250
You have irregular
computation.

438
00:19:54,250 --> 00:19:56,680
The computation might vary based
on the data that comes

439
00:19:56,680 --> 00:19:58,420
into your system.

440
00:19:58,420 --> 00:20:03,460
And you might have unpredictable
data flow.

441
00:20:03,460 --> 00:20:07,690
So in the pipeline model, the
things to consider is the

442
00:20:07,690 --> 00:20:10,610
pipeline throughput versus
the pipeline latency.

443
00:20:10,610 --> 00:20:13,720
So the amount of concurrency
in a pipeline is really

444
00:20:13,720 --> 00:20:15,090
limited by the number
of stages.

445
00:20:15,090 --> 00:20:16,450
This is nothing new.

446
00:20:16,450 --> 00:20:19,950
You've seen this, for example,
in super scaled pipelines.

447
00:20:19,950 --> 00:20:23,460
And just as in this case, as in
the case of an architecture

448
00:20:23,460 --> 00:20:25,870
pipeline, the amount of time
it takes you to fill the

449
00:20:25,870 --> 00:20:28,160
pipeline and the amount of time
it takes you to drain the

450
00:20:28,160 --> 00:20:29,940
pipeline can essentially
limit your parallelism.

451
00:20:29,940 --> 00:20:32,550
So you want those to be really
small compared to the actual

452
00:20:32,550 --> 00:20:35,930
computation that you spend
in your pipeline.

453
00:20:35,930 --> 00:20:38,470
And the performance metric is
usually the throughput.

454
00:20:38,470 --> 00:20:39,790
How much data can you
pump through your

455
00:20:39,790 --> 00:20:43,630
pipeline per unit time?

456
00:20:43,630 --> 00:20:46,150
So in video encoding, you know,
it's the frames per

457
00:20:46,150 --> 00:20:48,130
second that you can produce.

458
00:20:48,130 --> 00:20:50,150
And the pipeline latency,
though, is important,

459
00:20:50,150 --> 00:20:52,760
especially in a real-time
application where you need a

460
00:20:52,760 --> 00:20:54,470
result every 10 milliseconds.

461
00:20:54,470 --> 00:20:57,240
You know, your pacemaker for
example has to produce a beep

462
00:20:57,240 --> 00:21:00,990
or a signal to your heart
at specific rates.

463
00:21:00,990 --> 00:21:03,780
So you need to consider what
is your pipeline throughput

464
00:21:03,780 --> 00:21:05,000
versus your pipeline latency?

465
00:21:05,000 --> 00:21:07,590
And that can actually determine
how many stages you

466
00:21:07,590 --> 00:21:10,250
might want to actually decompose
or organize your

467
00:21:10,250 --> 00:21:12,660
application in.

468
00:21:12,660 --> 00:21:15,790
And in the event-based
coordination, these are

469
00:21:15,790 --> 00:21:19,120
interactions of tasks over
unpredictable intervals.

470
00:21:19,120 --> 00:21:22,920
And you're more prone to sort
of deadlocks in these

471
00:21:22,920 --> 00:21:23,600
applications.

472
00:21:23,600 --> 00:21:26,830
Because you might have cyclic
dependencies where one event

473
00:21:26,830 --> 00:21:29,500
can't proceed until it gets
data from another event.

474
00:21:29,500 --> 00:21:32,450
But it can't proceed until it
gets data from another event.

475
00:21:32,450 --> 00:21:35,000
You can create sort of these
complex interactions that

476
00:21:35,000 --> 00:21:36,780
often lead to deadlock.

477
00:21:36,780 --> 00:21:39,580
So you have to sort of be very
careful in structuring things

478
00:21:39,580 --> 00:21:44,680
together so you don't end up
with feedback loops that block

479
00:21:44,680 --> 00:21:47,130
computation progress.

480
00:21:47,130 --> 00:21:50,510
So given sort of these three
organizational structures that

481
00:21:50,510 --> 00:21:53,880
say, you know, I can organize
my computation by task or by

482
00:21:53,880 --> 00:21:57,120
the flow of data or by sort of
the pipeline nature of the

483
00:21:57,120 --> 00:21:59,440
computation, what are the
supporting structures?

484
00:21:59,440 --> 00:22:00,800
How do I actually
implement these?

485
00:22:00,800 --> 00:22:03,510
And so there are many
different supporting

486
00:22:03,510 --> 00:22:04,150
structures.

487
00:22:04,150 --> 00:22:08,880
I've identified sort of four
that occur most often in

488
00:22:08,880 --> 00:22:11,190
literature and in
books and common

489
00:22:11,190 --> 00:22:13,600
terminology that's used.

490
00:22:13,600 --> 00:22:18,090
And so those are SPMD, loop
parallelism, the master/worker

491
00:22:18,090 --> 00:22:19,500
pattern, and the fork/join
pattern.

492
00:22:22,620 --> 00:22:26,070
In the SPMD pattern, you're
talking about a single

493
00:22:26,070 --> 00:22:29,070
program, multiple
data concept.

494
00:22:29,070 --> 00:22:31,000
So here you just have
one program.

495
00:22:31,000 --> 00:22:33,680
You write it once and then you
assign it to each of your

496
00:22:33,680 --> 00:22:34,900
processors to run.

497
00:22:34,900 --> 00:22:35,940
So it's the same program.

498
00:22:35,940 --> 00:22:38,490
It just runs on different
machines.

499
00:22:38,490 --> 00:22:42,710
Now each program or each
instance of the code can have

500
00:22:42,710 --> 00:22:44,490
different control flow
that it takes.

501
00:22:44,490 --> 00:22:46,270
So just because they're running
the same program

502
00:22:46,270 --> 00:22:48,830
doesn't mean the computation
is happening in lock step.

503
00:22:48,830 --> 00:22:53,910
That would be a sort of a SIMD
or vector-like computation.

504
00:22:53,910 --> 00:22:56,550
In this model you can actually
take independent control flow.

505
00:22:56,550 --> 00:22:59,810
It could be different behavior
in each instance of the code.

506
00:22:59,810 --> 00:23:02,140
But you're running the
same code everywhere.

507
00:23:02,140 --> 00:23:04,550
So this is slightly different,
for example, from what you've

508
00:23:04,550 --> 00:23:09,480
seen on Cell, where you have
the PPE thread that creates

509
00:23:09,480 --> 00:23:10,540
SPE threads.

510
00:23:10,540 --> 00:23:13,720
Sometimes the SPE threads are
the same, but it's not always

511
00:23:13,720 --> 00:23:15,560
the case that the PPE
threads and the SPE

512
00:23:15,560 --> 00:23:17,610
threads are the same.

513
00:23:17,610 --> 00:23:21,110
So in the SPMD model there are
really five steps that you do.

514
00:23:21,110 --> 00:23:25,020
You initialize sort of your
computation in the world of

515
00:23:25,020 --> 00:23:27,250
sort of code instances that
you're going to run.

516
00:23:27,250 --> 00:23:29,730
And for each one you obtain
a unique identifier.

517
00:23:29,730 --> 00:23:32,250
And this usually helps them
being able to determine who

518
00:23:32,250 --> 00:23:37,250
needs to communicate with who
or ordering dependencies.

519
00:23:37,250 --> 00:23:41,560
And you run the same program
on each processor.

520
00:23:41,560 --> 00:23:44,030
And what you need to do in this
case is also distribute

521
00:23:44,030 --> 00:23:45,860
your data between each
of the different

522
00:23:45,860 --> 00:23:48,030
instances of your code.

523
00:23:48,030 --> 00:23:50,870
And once, you know, each program
is running, it's

524
00:23:50,870 --> 00:23:53,300
computing on its data,
eventually you need to

525
00:23:53,300 --> 00:23:54,570
finalize in some way.

526
00:23:54,570 --> 00:23:57,790
And so that might mean doing a
reduction to communicate all

527
00:23:57,790 --> 00:24:03,160
the data to one processor to
actually output the value.

528
00:24:03,160 --> 00:24:06,620
And so we saw in SPMD an example
for the numerical

529
00:24:06,620 --> 00:24:09,120
integration for calculating
pi.

530
00:24:09,120 --> 00:24:14,000
And if you remember, so we had
this very simple c loop.

531
00:24:14,000 --> 00:24:19,780
And we showed the MPI
implementation of the c loop.

532
00:24:19,780 --> 00:24:23,070
And so in this code, what we're
doing is we're trying to

533
00:24:23,070 --> 00:24:24,760
determine different intervals.

534
00:24:24,760 --> 00:24:27,980
And for each interval we're
going to calculate a value and

535
00:24:27,980 --> 00:24:33,420
then in the MPI program we're
essentially deciding how big

536
00:24:33,420 --> 00:24:36,640
an interval each process
should run.

537
00:24:36,640 --> 00:24:37,770
So it's the same program.

538
00:24:37,770 --> 00:24:39,640
It runs on every single
machine or

539
00:24:39,640 --> 00:24:41,150
every single processor.

540
00:24:41,150 --> 00:24:46,290
And each processor determines
based on its ID which interval

541
00:24:46,290 --> 00:24:49,340
of the actual integration
to do.

542
00:24:49,340 --> 00:24:51,320
And so in this model
we're distributing

543
00:24:51,320 --> 00:24:53,000
work relatively evenly.

544
00:24:53,000 --> 00:24:56,400
Each processor is doing a
specific chunk that starts at

545
00:24:56,400 --> 00:24:57,930
say some index i.

546
00:24:57,930 --> 00:25:01,810
And if I have 10 processors,
I'm doing 100 steps.

547
00:25:01,810 --> 00:25:06,260
Then you're doing i, i plus
10, i plus 20 and so on.

548
00:25:06,260 --> 00:25:08,320
But I can do a different
distribution.

549
00:25:08,320 --> 00:25:10,050
So the first is a block
distribution.

550
00:25:10,050 --> 00:25:12,490
I can do something called
a cyclic distribution.

551
00:25:12,490 --> 00:25:15,420
So in a cyclic distribution, I
distribute work sort of in a

552
00:25:15,420 --> 00:25:18,080
round robin fashion or
some other mechanism.

553
00:25:18,080 --> 00:25:21,940
So here, you know,
each processor --

554
00:25:21,940 --> 00:25:24,780
sorry.

555
00:25:24,780 --> 00:25:28,480
In the block distribution I sort
of start at interval i

556
00:25:28,480 --> 00:25:31,380
and I go --

557
00:25:31,380 --> 00:25:32,660
sorry.

558
00:25:32,660 --> 00:25:36,180
So each processor gets one
entire slice here.

559
00:25:36,180 --> 00:25:38,260
So I start here and I go
through to completion.

560
00:25:38,260 --> 00:25:41,760
I start here and go through
to completion.

561
00:25:41,760 --> 00:25:45,100
In a cyclic distribution I might
do smaller slices of

562
00:25:45,100 --> 00:25:47,670
each one of those intervals.

563
00:25:47,670 --> 00:25:52,320
And so I greyed out the
components for the block

564
00:25:52,320 --> 00:25:55,040
distribution to show you that
for a contrast here.

565
00:25:58,070 --> 00:26:01,610
There are some challenges
in the SPMD model.

566
00:26:01,610 --> 00:26:02,980
And that is how do you actually

567
00:26:02,980 --> 00:26:05,030
split your data correctly?

568
00:26:05,030 --> 00:26:08,410
You have to distribute your data
in a way that, you know,

569
00:26:08,410 --> 00:26:11,690
doesn't increase contention on
your memory system, where each

570
00:26:11,690 --> 00:26:14,900
actual processor that's assigned
the computation has

571
00:26:14,900 --> 00:26:18,630
data locally to actually
operate on.

572
00:26:18,630 --> 00:26:21,480
And you want to achieve an
even work distribution.

573
00:26:21,480 --> 00:26:23,800
You know, do you need a dynamic
load balancing scheme

574
00:26:23,800 --> 00:26:26,740
or can you use an alternative
pattern

575
00:26:26,740 --> 00:26:28,020
if that's not suitable?

576
00:26:30,740 --> 00:26:34,750
So the second pattern, as
opposed to the SPMD pattern is

577
00:26:34,750 --> 00:26:36,100
loop parallelism pattern.

578
00:26:36,100 --> 00:26:39,220
In this case, this is the best
suited when you actually have

579
00:26:39,220 --> 00:26:42,870
a programming model or a program
that you can't really

580
00:26:42,870 --> 00:26:44,880
change a whole lot or that
you don't really want to

581
00:26:44,880 --> 00:26:46,590
change a whole lot.

582
00:26:46,590 --> 00:26:48,680
Or you have a programming model
that allows you to sort

583
00:26:48,680 --> 00:26:52,790
of identify loops that take up
most of the computation and

584
00:26:52,790 --> 00:26:55,160
then insert annotations or some
ways to automatically

585
00:26:55,160 --> 00:26:57,350
parallelize those loops.

586
00:26:57,350 --> 00:27:01,660
So we saw in the OpenMP example,
you have some loops

587
00:27:01,660 --> 00:27:03,580
you can insert these
pragmas that say,

588
00:27:03,580 --> 00:27:05,000
this loop is parallel.

589
00:27:05,000 --> 00:27:08,390
And the compiler in the run-time
time system can

590
00:27:08,390 --> 00:27:11,270
automatically partition this
loop into smaller chunks.

591
00:27:11,270 --> 00:27:15,880
And then each chunk can
compute in parallel.

592
00:27:15,880 --> 00:27:19,200
And you might apply this scheme
in different ways

593
00:27:19,200 --> 00:27:21,870
depending on how well you
understand your code.

594
00:27:21,870 --> 00:27:25,390
Are you running on a shared
memory machine?

595
00:27:25,390 --> 00:27:27,490
You can't afford to do a whole
lot of restructuring.

596
00:27:27,490 --> 00:27:29,440
Communication costs might
be really expensive.

597
00:27:33,480 --> 00:27:36,910
In the master/worker pattern,
this is really starting to get

598
00:27:36,910 --> 00:27:41,040
closer to what we've
done with the Cell

599
00:27:41,040 --> 00:27:43,000
recitations in the Cell labs.

600
00:27:43,000 --> 00:27:46,850
You have some world of
independent tasks and the

601
00:27:46,850 --> 00:27:50,460
master essentially running and
distributing each of these

602
00:27:50,460 --> 00:27:53,490
tasks to different processors.

603
00:27:53,490 --> 00:27:57,240
So in this case you'd get
several advantages that you

604
00:27:57,240 --> 00:27:58,150
can leverage.

605
00:27:58,150 --> 00:28:00,830
If each of your tasks are varied
in nature -- and they

606
00:28:00,830 --> 00:28:03,390
might finish at different
times or they require

607
00:28:03,390 --> 00:28:06,140
different kinds of resources,
you can use this model to sort

608
00:28:06,140 --> 00:28:10,640
of view your machine as sort of
a non-symmetric processor.

609
00:28:10,640 --> 00:28:12,090
Not everybody is the same.

610
00:28:12,090 --> 00:28:15,040
And you can use this model
really well for that.

611
00:28:15,040 --> 00:28:18,790
So you can distribute these and
then you can do dynamic

612
00:28:18,790 --> 00:28:19,390
load balancing.

613
00:28:19,390 --> 00:28:21,670
Because as processors --

614
00:28:21,670 --> 00:28:25,120
as workers finish you can ship
them more and more data.

615
00:28:25,120 --> 00:28:35,580
So it has some particularly
relevant properties for

616
00:28:35,580 --> 00:28:38,200
heterogeneous computations, but
it's also really good for

617
00:28:38,200 --> 00:28:40,140
when you have a whole lot
of parallelism in your

618
00:28:40,140 --> 00:28:40,790
application.

619
00:28:40,790 --> 00:28:42,400
So something called
embarrassingly parallel

620
00:28:42,400 --> 00:28:46,200
problems. So ray tracing,
molecular dynamics, a lot of

621
00:28:46,200 --> 00:28:48,780
scientific applications have
these massive levels of

622
00:28:48,780 --> 00:28:49,580
parallelism.

623
00:28:49,580 --> 00:28:52,060
And you can use this essentially
work-queue based

624
00:28:52,060 --> 00:28:54,690
mechanism that says I have all
these tasks and I'll just

625
00:28:54,690 --> 00:28:58,700
dispatch them to workers
and compute.

626
00:28:58,700 --> 00:29:00,970
And as I pointed out earlier,
you know, when do you define

627
00:29:00,970 --> 00:29:03,040
your entire computation
to have completed?

628
00:29:03,040 --> 00:29:05,160
You know, sometimes you're
computing a result until

629
00:29:05,160 --> 00:29:09,210
you've reached some result.

630
00:29:09,210 --> 00:29:13,080
And often you're willing to
accept a result within some

631
00:29:13,080 --> 00:29:14,720
range of error.

632
00:29:14,720 --> 00:29:16,155
And you might have some
more threads that

633
00:29:16,155 --> 00:29:17,290
are still in flight.

634
00:29:17,290 --> 00:29:19,860
Do you terminate your
computation then or not?

635
00:29:19,860 --> 00:29:21,580
What are some issues with
synchronization?

636
00:29:21,580 --> 00:29:24,020
If you have so many threads that
are running together, you

637
00:29:24,020 --> 00:29:26,900
know, does the communication
between them to send out these

638
00:29:26,900 --> 00:29:29,970
control messages say, I'm done,
start to overwhelm you?

639
00:29:33,660 --> 00:29:36,240
In the fork/join pattern --

640
00:29:36,240 --> 00:29:40,610
this is really not conceptually
too different in

641
00:29:40,610 --> 00:29:46,570
my mind from the master/worker
model, and also very relevant

642
00:29:46,570 --> 00:29:49,840
to what we've done with Cell.

643
00:29:49,840 --> 00:29:52,300
The main difference might be
that you have tasks that are

644
00:29:52,300 --> 00:29:54,610
dynamically created.

645
00:29:54,610 --> 00:29:56,960
So in the embarrassingly
parallel case, you actually

646
00:29:56,960 --> 00:29:59,170
know the world of all your
potential task that you're

647
00:29:59,170 --> 00:30:00,280
going to run in parallel.

648
00:30:00,280 --> 00:30:02,920
In the fork/join join model
some new computation might

649
00:30:02,920 --> 00:30:06,720
come up as a result of, say,
an event-based mechanism.

650
00:30:06,720 --> 00:30:09,870
So a task might be created
dynamically and then later

651
00:30:09,870 --> 00:30:11,760
terminated or they
might complete.

652
00:30:11,760 --> 00:30:13,900
And so new ones come
up as a result.

653
00:30:13,900 --> 00:30:21,660
AUDIENCE: It almost seems like
you are forking the task in

654
00:30:21,660 --> 00:30:23,910
the forking model.

655
00:30:23,910 --> 00:30:25,758
And then keep assigning
tasks to that.

656
00:30:25,758 --> 00:30:28,203
The fork/join model you
just keep forking at

657
00:30:28,203 --> 00:30:29,670
first virtual box.

658
00:30:29,670 --> 00:30:31,501
Might not be completely
matched to a number of

659
00:30:31,501 --> 00:30:32,751
processor available.

660
00:30:43,430 --> 00:30:44,680
fork them out.

661
00:30:46,910 --> 00:30:52,420
PROFESSOR: So the process that's
equating all these

662
00:30:52,420 --> 00:30:56,250
threads or that's doing all the
forking is often known as

663
00:30:56,250 --> 00:30:58,600
the parent and the
tasks that are

664
00:30:58,600 --> 00:31:00,240
generated are the children.

665
00:31:00,240 --> 00:31:04,580
And eventually essentially the
parent can't continue or can't

666
00:31:04,580 --> 00:31:07,760
resume until its children have
sort of completed or have

667
00:31:07,760 --> 00:31:10,130
reached the join point.

668
00:31:10,130 --> 00:31:15,730
And so those are really some of
the models that we've seen

669
00:31:15,730 --> 00:31:18,870
already, in a lot of cases in
the recitations and labs for

670
00:31:18,870 --> 00:31:20,720
how you run your computations.

671
00:31:20,720 --> 00:31:23,100
And some of you have already
discovered these and actually

672
00:31:23,100 --> 00:31:25,230
are thinking about how your
projects should be sort of

673
00:31:25,230 --> 00:31:30,160
parallelized for your
actual Cell demos.

674
00:31:30,160 --> 00:31:32,610
Some of the other things that
I'm just going to talk about

675
00:31:32,610 --> 00:31:34,020
are communication patterns.

676
00:31:34,020 --> 00:31:37,300
So two lectures ago you saw,
for example, that you have

677
00:31:37,300 --> 00:31:40,430
point to point communication
or you have broadcast

678
00:31:40,430 --> 00:31:41,400
communication.

679
00:31:41,400 --> 00:31:43,430
So in point to point
communication, you have two

680
00:31:43,430 --> 00:31:45,010
tasks that need to
communicate.

681
00:31:45,010 --> 00:31:47,720
And they can send explicit
messages to each other.

682
00:31:47,720 --> 00:31:49,762
These could be control messages
that say I'm done or

683
00:31:49,762 --> 00:31:50,940
I'm waiting for data.

684
00:31:50,940 --> 00:31:53,560
Or they could be data messages
that actually ships you a

685
00:31:53,560 --> 00:31:55,640
particular data element
that you might need.

686
00:31:55,640 --> 00:31:58,250
And again we've seen
this with Cell.

687
00:31:58,250 --> 00:32:01,320
Broadcast says, you know,
I have some result that

688
00:32:01,320 --> 00:32:02,050
everybody needs.

689
00:32:02,050 --> 00:32:05,570
And so I send that out to
everybody by some mechanism.

690
00:32:05,570 --> 00:32:09,530
There is no real broadcast
mechanism on Cell.

691
00:32:09,530 --> 00:32:12,000
The concept I'm going to talk
about though is the reduction

692
00:32:12,000 --> 00:32:15,900
mechanism, which really is the
inverse of the broadcast. So

693
00:32:15,900 --> 00:32:18,070
in the broadcast I have a data
element I need to send to

694
00:32:18,070 --> 00:32:19,060
everybody else.

695
00:32:19,060 --> 00:32:23,090
In the reduction, all of you
have data that I need or all

696
00:32:23,090 --> 00:32:25,370
of us have data that each
somebody else needs.

697
00:32:25,370 --> 00:32:27,990
So what we need to do is
collectively bring that data

698
00:32:27,990 --> 00:32:34,730
together or group it together
and generate an end result.

699
00:32:34,730 --> 00:32:40,160
So a simple example of a
reduction, you have some array

700
00:32:40,160 --> 00:32:42,530
of elements that you want
to add together.

701
00:32:42,530 --> 00:32:45,240
And sort of the result
of the collective

702
00:32:45,240 --> 00:32:47,800
operation is the end sum.

703
00:32:47,800 --> 00:32:52,790
So you have an array of four
elements, A0, A1, A2, and A3.

704
00:32:52,790 --> 00:32:54,080
And you can do a serial
reduction.

705
00:32:54,080 --> 00:32:56,810
I can take A0 and
add it to A1.

706
00:32:56,810 --> 00:32:59,360
And that gives me a result.

707
00:32:59,360 --> 00:33:02,030
And I can take A2 and
add that to it.

708
00:33:02,030 --> 00:33:03,900
And I can take A3 and
add that to it.

709
00:33:03,900 --> 00:33:06,530
And so at the end I'll have
sort of calculated the sum

710
00:33:06,530 --> 00:33:09,030
from A0 to A3.

711
00:33:09,030 --> 00:33:12,880
So this is essentially -- the
serial reduction applies when

712
00:33:12,880 --> 00:33:15,080
your operation is
an associative.

713
00:33:15,080 --> 00:33:17,450
So the addition is
associative.

714
00:33:17,450 --> 00:33:22,100
So in this case I can actually
do something more intelligent.

715
00:33:22,100 --> 00:33:24,280
And I think we talked about
that last time.

716
00:33:24,280 --> 00:33:26,470
I'm going to show you
some more examples.

717
00:33:26,470 --> 00:33:29,980
And often sort of the end result
follows a broadcast. It

718
00:33:29,980 --> 00:33:31,130
says, here is the end result.

719
00:33:31,130 --> 00:33:32,530
Who are all the people
that need it?

720
00:33:32,530 --> 00:33:35,030
I'll sort of broadcast
that out so that

721
00:33:35,030 --> 00:33:36,820
everybody has the result.

722
00:33:36,820 --> 00:33:39,320
If your operation isn't
associative, then you're

723
00:33:39,320 --> 00:33:41,790
essentially limited to
a serial process.

724
00:33:41,790 --> 00:33:45,130
And so that's not very good from
a performance standpoint.

725
00:33:48,860 --> 00:33:50,890
Some of the tricks you can apply
for actually getting

726
00:33:50,890 --> 00:33:54,030
performance out of your
reduction is to go to a

727
00:33:54,030 --> 00:33:55,840
tree-based reduction model.

728
00:33:55,840 --> 00:33:57,200
So this might be very obvious.

729
00:33:57,200 --> 00:34:01,010
Rather than doing A0 and A1
and then adding A2 to that

730
00:34:01,010 --> 00:34:03,650
result, I can do A0
and A1 together.

731
00:34:03,650 --> 00:34:05,860
In parallel I can
do A2 and A3.

732
00:34:05,860 --> 00:34:07,830
And then I can get those results
and add them together.

733
00:34:07,830 --> 00:34:12,150
So rather than doing n steps
I can do log n steps.

734
00:34:12,150 --> 00:34:15,240
So this is particularly
attractive when only one task

735
00:34:15,240 --> 00:34:16,100
needs the result.

736
00:34:16,100 --> 00:34:19,280
So in the MPI program when we're
doing the integration to

737
00:34:19,280 --> 00:34:21,920
calculate pi, you know, one
processor needs to print out

738
00:34:21,920 --> 00:34:23,170
that value of pi.

739
00:34:25,440 --> 00:34:29,020
But if you have a computation
where more than one process

740
00:34:29,020 --> 00:34:31,170
actually needs the result of
the reduction, there's

741
00:34:31,170 --> 00:34:35,170
actually a better mechanism you
can use that's sort of a

742
00:34:35,170 --> 00:34:38,060
better alternative to the
tree-based reduction followed

743
00:34:38,060 --> 00:34:39,950
by a broadcast. So you can do a

744
00:34:39,950 --> 00:34:44,010
recursive doubling reduction.

745
00:34:44,010 --> 00:34:47,110
So at the end here, every
process will have the result

746
00:34:47,110 --> 00:34:50,550
of the reduction without having
done the broadcast. So

747
00:34:50,550 --> 00:34:54,950
we can start off as with the
tree-based and add up A0 and

748
00:34:54,950 --> 00:34:56,380
A1 together.

749
00:34:56,380 --> 00:35:00,110
But what we do is for each
process that has a value, we

750
00:35:00,110 --> 00:35:01,960
sort of do a local exchange.

751
00:35:01,960 --> 00:35:05,100
So from here we communicate
the value to here.

752
00:35:05,100 --> 00:35:06,900
And from here we communicate
the value to here.

753
00:35:06,900 --> 00:35:09,900
And so now these two processors
that had the value

754
00:35:09,900 --> 00:35:13,700
independently now both have
a local sum, A0 to A1.

755
00:35:13,700 --> 00:35:17,140
And similarly we can sort of
make the similar symmetric

756
00:35:17,140 --> 00:35:19,740
computation on the other side.

757
00:35:19,740 --> 00:35:23,340
And now we can communicate data
from these two processors

758
00:35:23,340 --> 00:35:25,610
here to come up with
the end --

759
00:35:30,150 --> 00:35:31,630
PROFESSOR: It was there.

760
00:35:31,630 --> 00:35:32,720
All right.

761
00:35:32,720 --> 00:35:33,420
Must have been lost
in the animation.

762
00:35:33,420 --> 00:35:37,000
So you actually do that the
other way as well so that you

763
00:35:37,000 --> 00:35:42,050
have the sum A0 to A3 on all
the different processors.

764
00:35:42,050 --> 00:35:45,350
Sorry about the lost
animation.

765
00:35:45,350 --> 00:35:45,620
OK.

766
00:35:45,620 --> 00:35:47,870
So this is better than the
tree-based approach with a

767
00:35:47,870 --> 00:35:51,070
broadcast because you
end up with local

768
00:35:51,070 --> 00:35:53,410
results of your reduction.

769
00:35:53,410 --> 00:35:58,780
And rather than doing the
broadcast following the

770
00:35:58,780 --> 00:36:01,900
tree-based reduction which takes
n steps, you end up with

771
00:36:01,900 --> 00:36:03,000
an order n.

772
00:36:03,000 --> 00:36:06,450
Everybody has a result in order
n versus an order 2n

773
00:36:06,450 --> 00:36:10,590
process for the tree-based
plus broadcast.

774
00:36:10,590 --> 00:36:13,200
AUDIENCE: On the Cell processor
but not in general.

775
00:36:13,200 --> 00:36:15,250
PROFESSOR: Not in general.

776
00:36:15,250 --> 00:36:17,900
It depends on sort of the
architectural mechanism that

777
00:36:17,900 --> 00:36:20,860
you have for your network.

778
00:36:20,860 --> 00:36:23,220
If you actually do need to sort
of, you know, if you have

779
00:36:23,220 --> 00:36:26,205
a broadcast mechanism that has
bus-based architecture where

780
00:36:26,205 --> 00:36:28,970
you can deposit a local value,
everybody can pull that value,

781
00:36:28,970 --> 00:36:31,410
then, yeah, it can be
more efficient.

782
00:36:31,410 --> 00:36:33,960
Or on optical networks, you
can broadcast the data and

783
00:36:33,960 --> 00:36:35,480
everybody can just
fuse it out.

784
00:36:38,910 --> 00:36:40,340
OK.

785
00:36:40,340 --> 00:36:44,860
So summarizing all the different
patterns, so here

786
00:36:44,860 --> 00:36:47,490
these are the actual mechanisms
that you would use

787
00:36:47,490 --> 00:36:50,800
for how you would implement
the different patterns.

788
00:36:50,800 --> 00:36:53,130
So in the SPMD you would
write the same program.

789
00:36:53,130 --> 00:36:56,700
In loop parallelism you have
your program and you might

790
00:36:56,700 --> 00:36:59,380
annotate sort of some pragmas
that tell you how to

791
00:36:59,380 --> 00:37:01,200
parallelize your computation.

792
00:37:01,200 --> 00:37:04,740
In the master/worker model you
might have sort of a master

793
00:37:04,740 --> 00:37:07,690
that's going to create threads
and you actually know --

794
00:37:07,690 --> 00:37:10,800
you might sort of have a very
good idea of what is the kind

795
00:37:10,800 --> 00:37:12,780
of work you're going to have
to do in each thread.

796
00:37:12,780 --> 00:37:16,340
In the fork/join model you
have more dynamism.

797
00:37:16,340 --> 00:37:19,280
So you might create threads
on the fly.

798
00:37:19,280 --> 00:37:25,890
And you apply these sort of
based on appeal or what is

799
00:37:25,890 --> 00:37:28,490
more suited in terms of
implementation to each of the

800
00:37:28,490 --> 00:37:31,190
different patterns for how you
actually organize your data.

801
00:37:31,190 --> 00:37:34,020
So in the task parallelism
model, this is where you have

802
00:37:34,020 --> 00:37:37,410
a world of threads that you know
you're going to calculate

803
00:37:37,410 --> 00:37:39,290
or that you're going to use
for your computation.

804
00:37:39,290 --> 00:37:43,610
And really you can use largely
any one of these models.

805
00:37:43,610 --> 00:37:45,340
So I used a ranking
system where four

806
00:37:45,340 --> 00:37:46,340
stars is really good.

807
00:37:46,340 --> 00:37:51,192
One star is sort of bad or no
star means not well suited.

808
00:37:51,192 --> 00:37:53,320
AUDIENCE: Sort of in
Cell because the

809
00:37:53,320 --> 00:37:55,220
inherit master there.

810
00:37:55,220 --> 00:37:58,110
Sometimes master/worker might
get a little bit of a biasing

811
00:37:58,110 --> 00:37:59,910
than this one.

812
00:37:59,910 --> 00:38:01,210
PROFESSOR: Right, so --

813
00:38:01,210 --> 00:38:04,600
AUDIENCE: You don't have to pay
a cost of having master

814
00:38:04,600 --> 00:38:04,950
PROFESSOR: Right.

815
00:38:04,950 --> 00:38:06,050
Right.

816
00:38:06,050 --> 00:38:10,230
Although you could use the
Cell master to do regular

817
00:38:10,230 --> 00:38:11,660
computations as well.

818
00:38:11,660 --> 00:38:14,390
But, yes.

819
00:38:14,390 --> 00:38:18,700
So and the divide and conquer
model, you know, might be

820
00:38:18,700 --> 00:38:20,930
especially well suited for a
fork and join because you're

821
00:38:20,930 --> 00:38:23,580
creating all these recursive
subproblems They might be

822
00:38:23,580 --> 00:38:24,080
heterogeneous.

823
00:38:24,080 --> 00:38:26,160
In the nature of the computation
that you do, you

824
00:38:26,160 --> 00:38:28,850
might have more problems
created dynamically.

825
00:38:28,850 --> 00:38:30,300
Fork/join really works
well for that.

826
00:38:30,300 --> 00:38:33,220
And the fact, you know, the
subproblem structure that I

827
00:38:33,220 --> 00:38:35,400
showed, the graph of
sort of division.

828
00:38:35,400 --> 00:38:40,260
And then merging works really
well with the fork/join model.

829
00:38:40,260 --> 00:38:45,260
In the recursive, in the
geometric decomposition --

830
00:38:45,260 --> 00:38:48,420
this is essentially your lab one
exercise and the things we

831
00:38:48,420 --> 00:38:50,330
went over yesterday
in the recitation.

832
00:38:50,330 --> 00:38:54,860
You're taking data and you're
partitioning over multiple

833
00:38:54,860 --> 00:38:56,960
processors to actually
compute in parallel.

834
00:38:56,960 --> 00:39:00,510
So this could be SPMD
implementation or it could be

835
00:39:00,510 --> 00:39:02,570
a loop parallelism
implementation,

836
00:39:02,570 --> 00:39:04,880
which we didn't do.

837
00:39:04,880 --> 00:39:07,280
Less suitable, the master/worker
and fork/join,

838
00:39:07,280 --> 00:39:10,960
often because the geometric
decomposition applied some

839
00:39:10,960 --> 00:39:13,510
distribution to the data which
has static properties that you

840
00:39:13,510 --> 00:39:15,640
can exploit in various ways.

841
00:39:15,640 --> 00:39:17,420
So you don't need to
pay the overhead of

842
00:39:17,420 --> 00:39:21,320
master/worker or fork/join.

843
00:39:21,320 --> 00:39:25,670
Recursive data structures sort
of have very specific models

844
00:39:25,670 --> 00:39:27,320
that you can run with.

845
00:39:27,320 --> 00:39:32,950
Largely master/worker is a
decent implementation choice.

846
00:39:32,950 --> 00:39:36,180
SPMD is another.

847
00:39:36,180 --> 00:39:38,220
And you're going to hear more
about sort of the pipeline

848
00:39:38,220 --> 00:39:40,620
mechanism in the next talk so
I'm not going to talk about

849
00:39:40,620 --> 00:39:42,120
that very much.

850
00:39:42,120 --> 00:39:44,710
Event-based coordination,
largely dynamic.

851
00:39:44,710 --> 00:39:47,090
So fork/join works
really well.

852
00:39:47,090 --> 00:39:48,780
So one --

853
00:39:48,780 --> 00:39:50,870
AUDIENCE: When you're buffering
them you could do

854
00:39:50,870 --> 00:39:52,570
master/worker with pipelining?

855
00:39:52,570 --> 00:39:54,960
PROFESSOR: Yes, so next slide.

856
00:39:54,960 --> 00:39:58,860
So sort of these choices or
these tradeoffs aren't really

857
00:39:58,860 --> 00:39:59,570
orthogonal.

858
00:39:59,570 --> 00:40:01,670
You can actually combine
them in different ways.

859
00:40:01,670 --> 00:40:05,430
And in a lot of applications
what you might find is that

860
00:40:05,430 --> 00:40:08,710
the different patterns compose
hierarchically.

861
00:40:08,710 --> 00:40:12,540
And you actually want that
in various ways --

862
00:40:12,540 --> 00:40:13,610
for various reasons.

863
00:40:13,610 --> 00:40:17,530
So in the MPEG example, you
know, we had tasks here within

864
00:40:17,530 --> 00:40:21,720
each task and identified
some pipeline stages.

865
00:40:21,720 --> 00:40:23,680
You know, here I have some
data parallelism so I can

866
00:40:23,680 --> 00:40:27,590
apply the loop pattern here.

867
00:40:27,590 --> 00:40:30,400
And what I want to do is
actually in my computation

868
00:40:30,400 --> 00:40:32,870
sort of express these different
mechanisms so I can

869
00:40:32,870 --> 00:40:34,910
understand sort of different
tradeoffs.

870
00:40:34,910 --> 00:40:37,670
And for really large
applications, there might be

871
00:40:37,670 --> 00:40:40,630
different patterns that are
well suited for the actual

872
00:40:40,630 --> 00:40:41,910
computation that I'm doing.

873
00:40:41,910 --> 00:40:45,860
So I can combine things like
pipelining with a task-based

874
00:40:45,860 --> 00:40:49,270
mechanism or data parallelism
to actually get really good

875
00:40:49,270 --> 00:40:50,660
performance speedups.

876
00:40:50,660 --> 00:40:53,850
And one of the things that
might strike you as well,

877
00:40:53,850 --> 00:40:55,740
heck, this is a whole lot of
work that I have to do to

878
00:40:55,740 --> 00:40:58,810
actually get my code in the
right way so that I can

879
00:40:58,810 --> 00:41:01,050
actually take advantage of
my parallel architecture.

880
00:41:01,050 --> 00:41:03,020
You know, I have to conceptually
think about the

881
00:41:03,020 --> 00:41:04,250
question the right way.

882
00:41:04,250 --> 00:41:07,270
I have to maybe restructure my
computation in different ways

883
00:41:07,270 --> 00:41:09,790
to actually exploit
parallelism.

884
00:41:09,790 --> 00:41:11,260
Data distribution
is really hard.

885
00:41:11,260 --> 00:41:12,810
I have to get that right.

886
00:41:12,810 --> 00:41:14,970
Synchronization issues
might be a problem.

887
00:41:14,970 --> 00:41:16,330
And how much buffering
do I need to do

888
00:41:16,330 --> 00:41:18,020
between different tasks?

889
00:41:18,020 --> 00:41:20,210
So the thing you're going to
hear about in the next talk

890
00:41:20,210 --> 00:41:22,820
is, well, what if these things
really fall out naturally from

891
00:41:22,820 --> 00:41:27,800
the way you actually write the
program, and if the way you

892
00:41:27,800 --> 00:41:30,360
actually write your program
matches really well with the

893
00:41:30,360 --> 00:41:32,530
intuitive, sort of natural

894
00:41:32,530 --> 00:41:35,000
conceptualization of the problem.

895
00:41:35,000 --> 00:41:37,410
And so I'll leave Bill
to talk about that.

896
00:41:37,410 --> 00:41:40,190
And I'm going to stop here.

897
00:41:40,190 --> 00:41:41,883
Any questions?

898
00:41:41,883 --> 00:41:43,099
AUDIENCE: We can take in
some questions and

899
00:41:43,099 --> 00:41:44,349
then everybody --

900
00:41:50,650 --> 00:41:52,460
AUDIENCE: You talked about
fork and join.

901
00:41:52,460 --> 00:41:57,735
When you have a parent thread
that spawns off to a child

902
00:41:57,735 --> 00:42:01,403
thread, how do you keep your
parent thread from

903
00:42:01,403 --> 00:42:03,970
using up the SPE?

904
00:42:03,970 --> 00:42:07,290
PROFESSOR: So you have a
fork/join where you have --

905
00:42:07,290 --> 00:42:11,780
AUDIENCE: Most of the parents
it might be the PPE.

906
00:42:11,780 --> 00:42:17,430
And so if you just do fork/join,
might not really

907
00:42:17,430 --> 00:42:20,220
use PPE unless you can, you
know, you have some time and

908
00:42:20,220 --> 00:42:22,670
you let it do some of the
task and come back.

909
00:42:22,670 --> 00:42:26,859
AUDIENCE: So for our purposes
we shouldn't spawn off new

910
00:42:26,859 --> 00:42:29,110
threads by the SPEs?

911
00:42:29,110 --> 00:42:29,810
PROFESSOR: So, yeah.

912
00:42:29,810 --> 00:42:32,390
So most of the threads
that are spawned off

913
00:42:32,390 --> 00:42:34,160
are done by the PPE.

914
00:42:34,160 --> 00:42:35,560
So you have these --

915
00:42:35,560 --> 00:42:39,200
in fact a good walk through
in recitation yesterday.

916
00:42:39,200 --> 00:42:39,950
You have the PPE.

917
00:42:39,950 --> 00:42:43,065
Essentially it sends messages to
the SPEs that says, create

918
00:42:43,065 --> 00:42:45,190
these threads and start
running them.

919
00:42:45,190 --> 00:42:46,400
Here's the data for them.

920
00:42:46,400 --> 00:42:48,950
And then these threads
run on the SPEs.

921
00:42:48,950 --> 00:42:50,680
And they just do local
computation.

922
00:42:50,680 --> 00:42:53,570
And then they send messages
back to the PPE that says,

923
00:42:53,570 --> 00:42:54,050
we're done.

924
00:42:54,050 --> 00:42:56,422
So that essentially implements
the join mechanism.

925
00:42:56,422 --> 00:42:58,390
AUDIENCE: On the other hand,
if you are doing something

926
00:42:58,390 --> 00:43:05,430
like master slave way, and
then the SPE can send a

927
00:43:05,430 --> 00:43:09,690
message and deliver another
job into the PPE

928
00:43:09,690 --> 00:43:11,830
who feeds the master.

929
00:43:11,830 --> 00:43:13,930
If SPE see there's some more
computers, you can say, OK,

930
00:43:13,930 --> 00:43:16,360
look, put this into your keybord
and keep sending

931
00:43:16,360 --> 00:43:19,740
messages and so the master can
look at that and update it.

932
00:43:19,740 --> 00:43:22,935
So, you know, it's not only
master who has to fork off but

933
00:43:22,935 --> 00:43:24,670
the slaves also.

934
00:43:24,670 --> 00:43:28,470
They still can send
information back.

935
00:43:28,470 --> 00:43:34,180
So you can think about
something like very

936
00:43:34,180 --> 00:43:36,250
confident that way.

937
00:43:36,250 --> 00:43:43,340
There are eight -- like if six
SPE is running and you first

938
00:43:43,340 --> 00:43:47,440
get something in there and SPE
says divide it will take one

939
00:43:47,440 --> 00:43:50,066
task and run that until the
other one to the master will

940
00:43:50,066 --> 00:43:53,070
finish it and here's my ID.

941
00:43:53,070 --> 00:43:55,560
Send me the message
when it's done.

942
00:43:55,560 --> 00:43:56,510
And so you fork that
end and wait.

943
00:43:56,510 --> 00:43:59,740
So you can assume you can
do something like that.

944
00:43:59,740 --> 00:44:05,210
So it's almost master/slave but
the coordination is there.

945
00:44:05,210 --> 00:44:11,215
The trouble with normally
fork/join is if you create too

946
00:44:11,215 --> 00:44:12,420
many threads.

947
00:44:12,420 --> 00:44:14,190
You are in like a thread hell
because there are too many

948
00:44:14,190 --> 00:44:15,640
things to run.

949
00:44:15,640 --> 00:44:17,760
I don't know, can you SPE?

950
00:44:17,760 --> 00:44:19,650
PROFESSOR: No.

951
00:44:19,650 --> 00:44:22,035
AUDIENCE: So you can't even do
that because of some physical

952
00:44:22,035 --> 00:44:22,300
limitation.

953
00:44:22,300 --> 00:44:28,410
You can't get take up 1000
threads you run another

954
00:44:28,410 --> 00:44:34,180
master/slave thing yourself
is because 1000 threads

955
00:44:34,180 --> 00:44:36,900
on top of your SPEs.

956
00:44:36,900 --> 00:44:39,820
And that's going to
be locked threads.

957
00:44:39,820 --> 00:44:40,310
PROFESSOR: Yeah.

958
00:44:40,310 --> 00:44:43,420
Contact switching on the
SPEs is very expensive.

959
00:44:43,420 --> 00:44:47,980
So on the PlayStation
3 you have six SPEs

960
00:44:47,980 --> 00:44:48,810
available to you.

961
00:44:48,810 --> 00:44:52,020
So if you have a lot more than
six threads that you've

962
00:44:52,020 --> 00:44:55,190
created, essentially each
one runs to completion.

963
00:44:55,190 --> 00:44:58,210
And then you swap that out and
you bring in -- well, that

964
00:44:58,210 --> 00:44:59,090
terminates.

965
00:44:59,090 --> 00:45:02,200
You deallocate it from the SPE
and you bring in a new thread.

966
00:45:02,200 --> 00:45:05,770
If you actually want to do more
thread-like dynamic load

967
00:45:05,770 --> 00:45:08,720
balancing on the SPEs, it's
not well suited for that.

968
00:45:08,720 --> 00:45:09,720
Just because the --

969
00:45:09,720 --> 00:45:13,466
AUDIENCE: The best model there
is master/slave. Because the

970
00:45:13,466 --> 00:45:14,990
PPE [UNINTELLIGIBLE PHRASE]

971
00:45:14,990 --> 00:45:15,370
the master part.

972
00:45:15,370 --> 00:45:18,080
It will run more sequential
code.

973
00:45:18,080 --> 00:45:22,670
And when there's parallel send
-- it will give it to you and

974
00:45:22,670 --> 00:45:25,820
produce the work queue model
type and send stuff into SPE

975
00:45:25,820 --> 00:45:27,290
and feed that.

976
00:45:27,290 --> 00:45:30,750
So work queue type models
can be used there.

977
00:45:30,750 --> 00:45:31,140
PROFESSOR: Yeah.

978
00:45:31,140 --> 00:45:33,820
And the SPMD model might not
work really well because you

979
00:45:33,820 --> 00:45:37,030
have this heterogeneity in the
actual hardware, right.

980
00:45:37,030 --> 00:45:39,530
So if I'm taking the same
program running on the SPE

981
00:45:39,530 --> 00:45:43,360
versus the PPE, that code
might not be -- so I

982
00:45:43,360 --> 00:45:45,010
essentially have to specialize
the code.

983
00:45:45,010 --> 00:45:47,670
And that starts to deviate
away from the SPMD model.

984
00:45:47,670 --> 00:45:51,565
AUDIENCE: Something I think most
of the code you write for

985
00:45:51,565 --> 00:45:55,020
Cell will probably
be master/worker.

986
00:45:55,020 --> 00:45:56,995
And if you try to do something
other than you should think

987
00:45:56,995 --> 00:45:59,556
hard why that's the case.

988
00:46:02,690 --> 00:46:04,140
PROFESSOR: You can
do fork/join but

989
00:46:04,140 --> 00:46:05,120
you know, it's --

990
00:46:05,120 --> 00:46:07,490
AUDIENCE: I mean you can't --

991
00:46:07,490 --> 00:46:09,410
because you don't have
virtualization.

992
00:46:09,410 --> 00:46:11,938
If you fork too much where are
you going to put those?

993
00:46:11,938 --> 00:46:12,404
PROFESSOR: Right.

994
00:46:12,404 --> 00:46:14,790
Sometimes you fork --

995
00:46:14,790 --> 00:46:15,685
AUDIENCE: Yeah. but in that
sense you -- should

996
00:46:15,685 --> 00:46:17,220
you fork too much?

997
00:46:17,220 --> 00:46:19,330
To keep work you want
the master.

998
00:46:19,330 --> 00:46:20,590
You can fork things --

999
00:46:20,590 --> 00:46:22,770
you can do virtual fork and send
the work to the master

1000
00:46:22,770 --> 00:46:24,070
and say, here, I forked
something.

1001
00:46:24,070 --> 00:46:26,350
Here's the work.

1002
00:46:26,350 --> 00:46:30,370
I mean, the key thing is
do the simplest thing.

1003
00:46:30,370 --> 00:46:33,320
I mean, you guys have
two weeks left.

1004
00:46:33,320 --> 00:46:36,590
And if you try doing anything
complicated, you might end up

1005
00:46:36,590 --> 00:46:38,730
with a big mess that's
undebuggable.

1006
00:46:38,730 --> 00:46:39,780
Just do simple things.

1007
00:46:39,780 --> 00:46:45,400
And I can vouch, parallelism
is hard.

1008
00:46:45,400 --> 00:46:47,600
Debugging parallel code
is even harder.

1009
00:46:47,600 --> 00:46:51,300
So you're sort of trying to
push the limits on the

1010
00:46:51,300 --> 00:46:55,345
complexity of messages going
all over the world and the

1011
00:46:55,345 --> 00:46:57,615
three different types of
parallelism all trying to

1012
00:46:57,615 --> 00:46:59,040
compete in there.

1013
00:46:59,040 --> 00:47:00,935
Just do the simple thing.

1014
00:47:00,935 --> 00:47:01,882
Just get the simple
thing working.

1015
00:47:01,882 --> 00:47:05,005
First get the sequential code
working and keep adding more

1016
00:47:05,005 --> 00:47:05,650
and more story.

1017
00:47:05,650 --> 00:47:08,430
And then make sure that
each level it works.

1018
00:47:08,430 --> 00:47:10,350
The problem with parallelism
is because things that

1019
00:47:10,350 --> 00:47:12,020
determine if some bugs
that might show up.

1020
00:47:12,020 --> 00:47:13,950
Data might be hard.

1021
00:47:13,950 --> 00:47:17,770
But design absolutely matters.

1022
00:47:17,770 --> 00:47:21,000
Another thing I think,
especially for doing demos and

1023
00:47:21,000 --> 00:47:24,570
stuff would be nice, would be to
have a knob that basically

1024
00:47:24,570 --> 00:47:25,510
you can tune.

1025
00:47:25,510 --> 00:47:26,180
So you can say, OK, no SPEs.

1026
00:47:26,180 --> 00:47:29,624
Everything running in
PPE one is two is.

1027
00:47:29,624 --> 00:47:32,410
So you can actually see
hopefully in your code how how

1028
00:47:32,410 --> 00:47:36,693
things move for the demo part.

1029
00:47:43,200 --> 00:47:43,770
PROFESSOR: You had a question?

1030
00:47:43,770 --> 00:47:46,550
All right.

1031
00:47:46,550 --> 00:47:48,130
We'll take a brief
break and do the

1032
00:47:48,130 --> 00:47:49,380
quizzes in the meantime.