1
00:00:04,730 --> 00:00:08,680
To download the data that we'll
be working with in this video,

2
00:00:08,680 --> 00:00:12,900
click on the hyperlink given
in the text above this video.

3
00:00:12,900 --> 00:00:14,070
Don't use Internet Explorer.

4
00:00:14,070 --> 00:00:20,050
Chrome, Safari, or Firefox
should all work fine.

5
00:00:20,050 --> 00:00:21,700
After you click
on the hyperlink,

6
00:00:21,700 --> 00:00:25,020
it will take you to a
page that looks like this.

7
00:00:25,020 --> 00:00:28,280
Go ahead and copy all
the text on this page

8
00:00:28,280 --> 00:00:37,090
by first selecting all of it and
then hitting Control C on a PC

9
00:00:37,090 --> 00:00:39,680
or Command C on a Mac.

10
00:00:39,680 --> 00:00:42,740
Then go to a simple
text editor, like

11
00:00:42,740 --> 00:00:46,600
Notepad on a PC or
Text Edit on a Mac,

12
00:00:46,600 --> 00:00:50,480
and paste what you just
copied into the text editor

13
00:00:50,480 --> 00:00:55,060
with Control V on a PC
or Command V on a Mac.

14
00:00:55,060 --> 00:01:03,820
Then go ahead and save this
file as the name movielens.txt,

15
00:01:03,820 --> 00:01:05,220
for text.

16
00:01:05,220 --> 00:01:10,620
Save this somewhere that you can
easily navigate to in R. Now,

17
00:01:10,620 --> 00:01:13,570
let's switch to R
and load our data.

18
00:01:13,570 --> 00:01:17,370
First, in your R console,
navigate to the directory

19
00:01:17,370 --> 00:01:19,170
where you just saved that file.

20
00:01:28,660 --> 00:01:31,110
And click OK.

21
00:01:31,110 --> 00:01:34,750
Now, to load our data, we'll
be using a slightly different

22
00:01:34,750 --> 00:01:36,520
command this time.

23
00:01:36,520 --> 00:01:38,550
Our data is not a CSV file.

24
00:01:38,550 --> 00:01:40,840
It's a text file,
where the entries

25
00:01:40,840 --> 00:01:43,560
are separated by a vertical bar.

26
00:01:43,560 --> 00:01:46,750
So we'll call it
data set movies,

27
00:01:46,750 --> 00:01:50,509
and then we'll use the
read.table function, where

28
00:01:50,509 --> 00:01:55,620
the first argument is the name
of our data set in quotes.

29
00:01:55,620 --> 00:02:00,000
The second argument
is header=FALSE.

30
00:02:00,000 --> 00:02:02,170
This is because our
data doesn't have

31
00:02:02,170 --> 00:02:05,440
a header or a variable name row.

32
00:02:05,440 --> 00:02:13,150
And then the next
argument is sep="|" ,

33
00:02:13,150 --> 00:02:17,870
which can be found above the
Enter key on your keyboard.

34
00:02:17,870 --> 00:02:20,920
We need one more argument,
which is quote="(backslash)" ".

35
00:02:28,790 --> 00:02:31,650
Close the parentheses,
and hit Enter.

36
00:02:31,650 --> 00:02:34,400
That last argument just
make sure that our text

37
00:02:34,400 --> 00:02:37,270
was read in properly.

38
00:02:37,270 --> 00:02:39,340
Let's take a look at the
structure of our data

39
00:02:39,340 --> 00:02:40,900
using the str function.

40
00:02:46,400 --> 00:02:53,210
We have 1,682 observations
of 24 different variables.

41
00:02:53,210 --> 00:02:57,530
Since our variables didn't have
names, header equaled false,

42
00:02:57,530 --> 00:03:03,090
R just labeled them with
V1, V2, V3, et cetera.

43
00:03:03,090 --> 00:03:05,640
But from the Movie
Lens documentation,

44
00:03:05,640 --> 00:03:07,810
we know what these
variables are.

45
00:03:07,810 --> 00:03:11,830
So we'll go ahead and add in
the column names ourselves.

46
00:03:11,830 --> 00:03:18,270
To do this, start by typing
colnames, for column names,

47
00:03:18,270 --> 00:03:20,320
and then in parentheses,
the name of our data

48
00:03:20,320 --> 00:03:24,010
set, movies, and then
equals, and we'll

49
00:03:24,010 --> 00:03:26,410
use the c function,
where we're going

50
00:03:26,410 --> 00:03:28,520
to list all of the
variable names,

51
00:03:28,520 --> 00:03:32,500
each of them in double quotes
and separated by commas.

52
00:03:32,500 --> 00:03:38,840
So first, we have "ID", the
ID of the movie, then "Title",

53
00:03:38,840 --> 00:03:50,590
"ReleaseDate",
"VideoReleaseDate", "IMDB",

54
00:03:50,590 --> 00:03:54,090
"Unknown"-- this is
the unknown genre--

55
00:03:54,090 --> 00:04:02,030
and then our 18 other genres--
"Action", "Adventure",

56
00:04:02,030 --> 00:04:14,620
"Animation", "Children's,
"Comedy", "Crime",

57
00:04:14,620 --> 00:04:27,620
"Documentary", "Drama",
"Fantasy", "FilmNoir",

58
00:04:27,620 --> 00:04:45,730
"Horror", "Musical", "Mystery",
"Romance", "SciFi", "Thriller",

59
00:04:45,730 --> 00:04:47,630
"War", and "Western".

60
00:04:50,320 --> 00:04:50,820
Go

61
00:04:50,820 --> 00:04:54,690
ahead and close the
parentheses, and hit Enter.

62
00:04:54,690 --> 00:04:56,450
Let's see what our
data looks like now

63
00:04:56,450 --> 00:05:00,780
using the str function again.

64
00:05:00,780 --> 00:05:03,980
We can see that we have the
same number of observations

65
00:05:03,980 --> 00:05:06,780
and the same number of
variables, but each of them

66
00:05:06,780 --> 00:05:10,660
now has the name
that we just gave.

67
00:05:10,660 --> 00:05:14,900
We won't be using the ID,
release date, video release

68
00:05:14,900 --> 00:05:17,600
data, or IMDB variables.

69
00:05:17,600 --> 00:05:19,850
So let's go ahead
and remove them.

70
00:05:19,850 --> 00:05:24,780
To do this, we type the name
of our data set-- movies$--

71
00:05:24,780 --> 00:05:27,630
the name of the variable
we want to remove,

72
00:05:27,630 --> 00:05:31,460
and then just say =NULL,
in capital letters.

73
00:05:31,460 --> 00:05:35,400
This would just remove the
variable from our data set.

74
00:05:35,400 --> 00:05:43,260
Let's repeat this with
release date, video release

75
00:05:43,260 --> 00:05:49,080
date, and IMDB.

76
00:05:55,280 --> 00:05:58,520
And there are a few duplicate
entries in our data set,

77
00:05:58,520 --> 00:06:02,100
so we'll go ahead and remove
them with the unique function.

78
00:06:02,100 --> 00:06:03,480
So just type the
name of our data

79
00:06:03,480 --> 00:06:04,680
set, movies = unique(movies).

80
00:06:10,530 --> 00:06:12,660
Let's take a look at
our data one more time.

81
00:06:16,300 --> 00:06:21,910
Now, we have 1,664 observations,
a few less than before,

82
00:06:21,910 --> 00:06:26,600
and 20 variables-- the title
of the movie, the unknown genre

83
00:06:26,600 --> 00:06:32,000
label, and then the
18 other genre labels.

84
00:06:32,000 --> 00:06:34,880
In this video, we've
seen one example

85
00:06:34,880 --> 00:06:37,830
of how to prepare data
taken from the internet

86
00:06:37,830 --> 00:06:39,480
to work with it in R.

87
00:06:39,480 --> 00:06:42,130
In the next video,
we'll use this data

88
00:06:42,130 --> 00:06:46,060
set to cluster our movies
using hierarchical clustering.