Lab 1: let’s know iris
Premise
This is an R Markdown Notebook. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Goals
- get
irisdata - know basic info about it
- plot iris and play with it
Possible “solution”
Get data
iris is so common that it is a global variable preloaded
in the R environment. Hence you can “access” it by just writing
iris.
iris
In a notebook, you see the result nicely formatted. Try execute it in the console to see the “raw” result.
In a more common case, you load the data from a file or directly from
the network. In the former case, you usually use read.table
and its variants (read.csv, read.csv2).
Basic exploration of the data
Let’s use another variable for working on iris:
d stands for data.
d = iris
Note that d is an identifier of a variable whose value
is a “dataframe”, i.e., a table with columns with values of homogeneous
type.
See its size, in three different ways.
dim(d)
[1] 150 5
nrow(d)
[1] 150
ncol(d)
[1] 5
Ok, we got it: it’s a \(150 \times 5\) table.
For an overview of the content, use summary.
summary(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
For each one of the numerical variables (columns) of the dataframe, basic descriptive statistics are shown; for categorical variables, the number of occurrences of each possible (actually valued) values is shown.
For just showing the names of the variables, use
names.
names(d)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
names can also be used for changing the names:
names(d)[2] = "sw"
summary(d)
Sepal.Length sw Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
(Now let’s restore the original name).
d = iris
names(d)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Basic graphical exploration
plot is an R function that can be used to plot almost
everything. It is an example of polymorphism: its behavior changes
depending on what it is invoked with.
Examples of plot usages
plot(c(1,2,3,4,3,2,1))
x = seq(0, 10, 0.01)
plot(x, sin(x))
Can it be used with our d too?
plot(iris)
Yes, and it produces something useful. Actually really useful for our current purpose (explore the data).
Yet, serious plotting is usually done with different tools. The most
common one is ggplot.
Plotting with ggplot
ggplot is a package that itself is a member of
a larger family of R packages: tidyverse. Install it
install.packages("tidyverse") (not done in this notebook,
may be tricky; note that you do not really need the full
tidyverse package for doing the things we’ll do) and load
it.
require("tidyverse")
Caricamento del pacchetto richiesto: tidyverse
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.3.6 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Trivial plot with ggplot
Key point: before doing a plot, - first, decide what is the question the plot is going to (attempt to) answer - then, design the plot accordingly
Question: are the sepal length and sepal width somehow related for
the species Setosa? Design: the two variables on the axes of a
scatterplot, one point for each Setosa sample in
d.
d %>% filter(Species=="setosa") %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()
New question: Is this dependency different among the three Species?
d %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()
Less trivial
Question: Are the distribution of variables different among species? Design: three boxplots
d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_boxplot() + facet_grid(.~name, scales="free")
Maybe even better with violin-plots.
d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_violin() + facet_grid(.~name, scales="free")
Take-home
- Versicolor and Virginica appear harder to tell apart than Setosa.
- Petal-related attributes appear more useful (for inferring the species) than sepal- ones.