Lab 1: let’s know iris
Premise
This is an R Markdown Notebook. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Goals
- get
iris
data - know basic info about it
- plot iris and play with it
Possible “solution”
Get data
iris
is so common that it is a global variable preloaded
in the R environment. Hence you can “access” it by just writing
iris
.
iris
In a notebook, you see the result nicely formatted. Try execute it in the console to see the “raw” result.
In a more common case, you load the data from a file or directly from
the network. In the former case, you usually use read.table
and its variants (read.csv
, read.csv2
).
Basic exploration of the data
Let’s use another variable for working on iris
:
d
stands for data.
d = iris
Note that d
is an identifier of a variable whose value
is a “dataframe”, i.e., a table with columns with values of homogeneous
type.
See its size, in three different ways.
dim(d)
[1] 150 5
nrow(d)
[1] 150
ncol(d)
[1] 5
Ok, we got it: it’s a \(150 \times 5\) table.
For an overview of the content, use summary
.
summary(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
For each one of the numerical variables (columns) of the dataframe, basic descriptive statistics are shown; for categorical variables, the number of occurrences of each possible (actually valued) values is shown.
For just showing the names of the variables, use
names
.
names(d)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
names
can also be used for changing the names:
names(d)[2] = "sw"
summary(d)
Sepal.Length sw Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
(Now let’s restore the original name).
d = iris
names(d)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Basic graphical exploration
plot
is an R function that can be used to plot almost
everything. It is an example of polymorphism: its behavior changes
depending on what it is invoked with.
Examples of plot
usages
plot(c(1,2,3,4,3,2,1))
x = seq(0, 10, 0.01)
plot(x, sin(x))
Can it be used with our d
too?
plot(iris)
Yes, and it produces something useful. Actually really useful for our current purpose (explore the data).
Yet, serious plotting is usually done with different tools. The most
common one is ggplot
.
Plotting with ggplot
ggplot
is a package that itself is a member of
a larger family of R packages: tidyverse
. Install it
install.packages("tidyverse")
(not done in this notebook,
may be tricky; note that you do not really need the full
tidyverse
package for doing the things we’ll do) and load
it.
require("tidyverse")
Caricamento del pacchetto richiesto: tidyverse
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.3.6 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Trivial plot with ggplot
Key point: before doing a plot, - first, decide what is the question the plot is going to (attempt to) answer - then, design the plot accordingly
Question: are the sepal length and sepal width somehow related for
the species Setosa? Design: the two variables on the axes of a
scatterplot, one point for each Setosa sample in
d
.
d %>% filter(Species=="setosa") %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()
New question: Is this dependency different among the three Species?
d %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()
Less trivial
Question: Are the distribution of variables different among species? Design: three boxplots
d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_boxplot() + facet_grid(.~name, scales="free")
Maybe even better with violin-plots.
d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_violin() + facet_grid(.~name, scales="free")
Take-home
- Versicolor and Virginica appear harder to tell apart than Setosa.
- Petal-related attributes appear more useful (for inferring the species) than sepal- ones.