1
00:00:04,500 --> 00:00:06,570
In this recitation
we will apply some

2
00:00:06,570 --> 00:00:09,210
of the ideas from Moneyball
to data from the National

3
00:00:09,210 --> 00:00:14,180
Basketball Association--
that is, the NBA.

4
00:00:14,180 --> 00:00:16,710
So the first thing we'll
do is read in the data

5
00:00:16,710 --> 00:00:18,930
and learn about it.

6
00:00:18,930 --> 00:00:23,360
The data we have is located
in the file NBA_train

7
00:00:23,360 --> 00:00:27,780
and contains data from all
teams in season since 1980,

8
00:00:27,780 --> 00:00:31,280
except for ones with
less than 82 games.

9
00:00:31,280 --> 00:00:35,590
So I'll read this in
to the variable NBA,

10
00:00:35,590 --> 00:00:36,930
NBA = read.csv("NBA_train.csv").

11
00:00:48,710 --> 00:00:49,210
OK.

12
00:00:49,210 --> 00:00:50,550
So we've read it in.

13
00:00:50,550 --> 00:00:52,340
And let's explore it
a little bit using

14
00:00:52,340 --> 00:00:54,460
the str command, str(NBA).

15
00:00:58,950 --> 00:00:59,740
All right.

16
00:00:59,740 --> 00:01:01,290
So this is our data frame.

17
00:01:01,290 --> 00:01:06,110
We have 835 observations
of 20 variables.

18
00:01:06,110 --> 00:01:09,750
Let's take a look at what
some of these variables are.

19
00:01:09,750 --> 00:01:13,020
SeasonEnd is the year
the season ended.

20
00:01:13,020 --> 00:01:15,350
Team is the name of the team.

21
00:01:15,350 --> 00:01:18,200
And playoffs is a binary
variable for whether or not

22
00:01:18,200 --> 00:01:20,870
a team made it to the
playoffs that year.

23
00:01:20,870 --> 00:01:26,430
If they made it to the playoffs
it's a 1, if not it's a 0.

24
00:01:26,430 --> 00:01:30,620
W stands for the number
of regular season wins.

25
00:01:30,620 --> 00:01:35,680
PTS stands for points scored
during the regular season.

26
00:01:35,680 --> 00:01:38,610
oppPTS stands for
opponent points

27
00:01:38,610 --> 00:01:41,990
scored during the
regular season.

28
00:01:41,990 --> 00:01:46,420
And then we've got quite
a few variables that

29
00:01:46,420 --> 00:01:49,170
have the variable name
and then the same variable

30
00:01:49,170 --> 00:01:51,580
with an 'A' afterwards.

31
00:01:51,580 --> 00:02:00,140
So we've got FG and FGA, X2P,
X2PA, X3P, X3PA, FT, and FTA.

32
00:02:00,140 --> 00:02:02,780
So what this notation
is, is it means

33
00:02:02,780 --> 00:02:05,720
if there is an 'A' it means
the number that were attempted.

34
00:02:05,720 --> 00:02:08,090
And if not it means the
number that we're successful.

35
00:02:08,090 --> 00:02:12,860
So for example FG is the number
of successful field goals,

36
00:02:12,860 --> 00:02:15,220
including two and
three pointers.

37
00:02:15,220 --> 00:02:18,240
Whereas FGA is the number
of field goal attempts.

38
00:02:18,240 --> 00:02:22,579
So this also contains the number
of unsuccessful field goals.

39
00:02:22,579 --> 00:02:27,829
So FGA will always be a
bigger number than FG.

40
00:02:27,829 --> 00:02:31,340
The next pair is
for two pointers.

41
00:02:31,340 --> 00:02:34,130
The number of successful
two pointers and the number

42
00:02:34,130 --> 00:02:35,610
attempted.

43
00:02:35,610 --> 00:02:40,980
The pair after that, right down
here is for three pointers,

44
00:02:40,980 --> 00:02:43,640
the number successful
and the number attempted.

45
00:02:43,640 --> 00:02:46,370
And the next pair
is for free throws,

46
00:02:46,370 --> 00:02:50,380
the number successful
and the number attempted.

47
00:02:50,380 --> 00:02:53,900
Now you'll notice, actually,
that the two pointer and three

48
00:02:53,900 --> 00:02:58,430
pointer variables have
an 'X' in front of them.

49
00:02:58,430 --> 00:03:02,590
Well, this isn't because we had
an 'X' in the original data.

50
00:03:02,590 --> 00:03:04,620
In fact, if you were
to open up the csv

51
00:03:04,620 --> 00:03:09,920
file of the original data, it
would just say, 2P and 2PA,

52
00:03:09,920 --> 00:03:14,510
and, 3P and 3PA, without
the 'X' in front.

53
00:03:14,510 --> 00:03:16,180
The reason there's
an 'X' in front of it

54
00:03:16,180 --> 00:03:18,360
is because when
we load it into R,

55
00:03:18,360 --> 00:03:23,110
R doesn't like it when a
variable begins with a number.

56
00:03:23,110 --> 00:03:25,870
So if a variable
begins with a number

57
00:03:25,870 --> 00:03:28,790
it will put an 'X'
in front of it.

58
00:03:28,790 --> 00:03:29,650
This is fine.

59
00:03:29,650 --> 00:03:31,030
It's just something
we need to be

60
00:03:31,030 --> 00:03:36,370
mindful of when we're
dealing with variables in R.

61
00:03:36,370 --> 00:03:38,600
So moving on to the
rest of our variables.

62
00:03:38,600 --> 00:03:40,970
We've got ORB and DRB.

63
00:03:40,970 --> 00:03:44,850
These are offensive
and defensive rebounds.

64
00:03:44,850 --> 00:03:48,120
AST stands for assists.

65
00:03:48,120 --> 00:03:51,130
STL for steals.

66
00:03:51,130 --> 00:03:54,020
BLK stands for blocks.

67
00:03:54,020 --> 00:03:55,900
And TOV stands for turnovers.

68
00:03:58,820 --> 00:04:02,410
Don't worry if you're
not a basketball expert

69
00:04:02,410 --> 00:04:05,090
and don't understand exactly
the difference between each

70
00:04:05,090 --> 00:04:06,520
of these variables.

71
00:04:06,520 --> 00:04:08,150
But we just wanted
to familiarize you

72
00:04:08,150 --> 00:04:12,760
with some common basketball
statistics that are recorded.

73
00:04:12,760 --> 00:04:14,250
And explain the
labeling notation

74
00:04:14,250 --> 00:04:17,060
that we use in our data.

75
00:04:17,060 --> 00:04:20,769
We'll go on to use these
variables in the next video.