Premise

This is an R Markdown Notebook. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Goals

  1. get iris data
  2. know basic info about it
  3. plot iris and play with it

Possible “solution”

Get data

iris is so common that it is a global variable preloaded in the R environment. Hence you can “access” it by just writing iris.

iris

In a notebook, you see the result nicely formatted. Try execute it in the console to see the “raw” result.

In a more common case, you load the data from a file or directly from the network. In the former case, you usually use read.table and its variants (read.csv, read.csv2).

Basic exploration of the data

Let’s use another variable for working on iris: d stands for data.

d = iris

Note that d is an identifier of a variable whose value is a “dataframe”, i.e., a table with columns with values of homogeneous type.

See its size, in three different ways.

dim(d)
[1] 150   5
nrow(d)
[1] 150
ncol(d)
[1] 5

Ok, we got it: it’s a \(150 \times 5\) table.

For an overview of the content, use summary.

summary(d)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

For each one of the numerical variables (columns) of the dataframe, basic descriptive statistics are shown; for categorical variables, the number of occurrences of each possible (actually valued) values is shown.

For just showing the names of the variables, use names.

names(d)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

names can also be used for changing the names:

names(d)[2] = "sw"
summary(d)
  Sepal.Length         sw         Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

(Now let’s restore the original name).

d = iris
names(d)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

Basic graphical exploration

plot is an R function that can be used to plot almost everything. It is an example of polymorphism: its behavior changes depending on what it is invoked with.

Examples of plot usages

plot(c(1,2,3,4,3,2,1))

x = seq(0, 10, 0.01)
plot(x, sin(x))

Can it be used with our d too?

plot(iris)

Yes, and it produces something useful. Actually really useful for our current purpose (explore the data).

Yet, serious plotting is usually done with different tools. The most common one is ggplot.

Plotting with ggplot

ggplot is a package that itself is a member of a larger family of R packages: tidyverse. Install it install.packages("tidyverse") (not done in this notebook, may be tricky; note that you do not really need the full tidyverse package for doing the things we’ll do) and load it.

require("tidyverse")
Caricamento del pacchetto richiesto: tidyverse
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Trivial plot with ggplot

Key point: before doing a plot, - first, decide what is the question the plot is going to (attempt to) answer - then, design the plot accordingly

Question: are the sepal length and sepal width somehow related for the species Setosa? Design: the two variables on the axes of a scatterplot, one point for each Setosa sample in d.

d %>% filter(Species=="setosa") %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()

New question: Is this dependency different among the three Species?

d %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()

Less trivial

Question: Are the distribution of variables different among species? Design: three boxplots

d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_boxplot() + facet_grid(.~name, scales="free")

Maybe even better with violin-plots.

d %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=Species, y=value)) + geom_violin() + facet_grid(.~name, scales="free")

Take-home

  • Versicolor and Virginica appear harder to tell apart than Setosa.
  • Petal-related attributes appear more useful (for inferring the species) than sepal- ones.
LS0tCnRpdGxlOiAiTGFiIDE6IGxldCdzIGtub3cgaXJpcyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyMgUHJlbWlzZQoKVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suClRyeSBleGVjdXRpbmcgdGhpcyBjaHVuayBieSBjbGlja2luZyB0aGUgKlJ1biogYnV0dG9uIHdpdGhpbiB0aGUgY2h1bmsgb3IgYnkgcGxhY2luZyB5b3VyIGN1cnNvciBpbnNpZGUgaXQgYW5kIHByZXNzaW5nICpDdHJsK1NoaWZ0K0VudGVyKi4gCgojIEdvYWxzCjEuIGdldCBgaXJpc2AgZGF0YQoyLiBrbm93IGJhc2ljIGluZm8gYWJvdXQgaXQKMy4gcGxvdCBpcmlzIGFuZCBwbGF5IHdpdGggaXQKCiMgUG9zc2libGUgInNvbHV0aW9uIgoKIyMgR2V0IGRhdGEKCmBpcmlzYCBpcyBzbyBjb21tb24gdGhhdCBpdCBpcyBhIGdsb2JhbCB2YXJpYWJsZSBwcmVsb2FkZWQgaW4gdGhlIFIgZW52aXJvbm1lbnQuCkhlbmNlIHlvdSBjYW4gImFjY2VzcyIgaXQgYnkganVzdCB3cml0aW5nIGBpcmlzYC4KYGBge3J9CmlyaXMKYGBgCgpJbiBhIG5vdGVib29rLCB5b3Ugc2VlIHRoZSByZXN1bHQgbmljZWx5IGZvcm1hdHRlZC4KVHJ5IGV4ZWN1dGUgaXQgaW4gdGhlIGNvbnNvbGUgdG8gc2VlIHRoZSAicmF3IiByZXN1bHQuCgpJbiBhIG1vcmUgY29tbW9uIGNhc2UsIHlvdSBsb2FkIHRoZSBkYXRhIGZyb20gYSBmaWxlIG9yIGRpcmVjdGx5IGZyb20gdGhlIG5ldHdvcmsuCkluIHRoZSBmb3JtZXIgY2FzZSwgeW91IHVzdWFsbHkgdXNlIGByZWFkLnRhYmxlYCBhbmQgaXRzIHZhcmlhbnRzIChgcmVhZC5jc3ZgLCBgcmVhZC5jc3YyYCkuCgojIyBCYXNpYyBleHBsb3JhdGlvbiBvZiB0aGUgZGF0YQoKTGV0J3MgdXNlIGFub3RoZXIgdmFyaWFibGUgZm9yIHdvcmtpbmcgb24gYGlyaXNgOiBgZGAgc3RhbmRzIGZvciBkYXRhLgpgYGB7cn0KZCA9IGlyaXMKYGBgCk5vdGUgdGhhdCBgZGAgaXMgYW4gaWRlbnRpZmllciBvZiBhIHZhcmlhYmxlIHdob3NlIHZhbHVlIGlzIGEgImRhdGFmcmFtZSIsIGkuZS4sIGEgdGFibGUgd2l0aCBjb2x1bW5zIHdpdGggdmFsdWVzIG9mIGhvbW9nZW5lb3VzIHR5cGUuCgpTZWUgaXRzIHNpemUsIGluIHRocmVlIGRpZmZlcmVudCB3YXlzLgpgYGB7cn0KZGltKGQpCmBgYApgYGB7cn0KbnJvdyhkKQpgYGAKCmBgYHtyfQpuY29sKGQpCmBgYAoKT2ssIHdlIGdvdCBpdDogaXQncyBhICQxNTAgXHRpbWVzIDUkIHRhYmxlLgoKRm9yIGFuIG92ZXJ2aWV3IG9mIHRoZSBjb250ZW50LCB1c2UgYHN1bW1hcnlgLgpgYGB7cn0Kc3VtbWFyeShkKQpgYGAKCkZvciBlYWNoIG9uZSBvZiB0aGUgbnVtZXJpY2FsIHZhcmlhYmxlcyAoY29sdW1ucykgb2YgdGhlIGRhdGFmcmFtZSwgYmFzaWMgZGVzY3JpcHRpdmUgc3RhdGlzdGljcyBhcmUgc2hvd247IGZvciBjYXRlZ29yaWNhbCB2YXJpYWJsZXMsIHRoZSBudW1iZXIgb2Ygb2NjdXJyZW5jZXMgb2YgZWFjaCBwb3NzaWJsZSAoYWN0dWFsbHkgdmFsdWVkKSB2YWx1ZXMgaXMgc2hvd24uCgpGb3IganVzdCBzaG93aW5nIHRoZSBuYW1lcyBvZiB0aGUgdmFyaWFibGVzLCB1c2UgYG5hbWVzYC4KYGBge3J9Cm5hbWVzKGQpCmBgYAoKYG5hbWVzYCBjYW4gYWxzbyBiZSB1c2VkIGZvciBjaGFuZ2luZyB0aGUgbmFtZXM6CmBgYHtyfQpuYW1lcyhkKVsyXSA9ICJzdyIKc3VtbWFyeShkKQpgYGAKCihOb3cgbGV0J3MgcmVzdG9yZSB0aGUgb3JpZ2luYWwgbmFtZSkuCmBgYHtyfQpkID0gaXJpcwpuYW1lcyhkKQpgYGAKCiMjIEJhc2ljIGdyYXBoaWNhbCBleHBsb3JhdGlvbgoKYHBsb3RgIGlzIGFuIFIgZnVuY3Rpb24gdGhhdCBjYW4gYmUgdXNlZCB0byBwbG90IGFsbW9zdCBldmVyeXRoaW5nLgpJdCBpcyBhbiBleGFtcGxlIG9mIHBvbHltb3JwaGlzbTogaXRzIGJlaGF2aW9yIGNoYW5nZXMgZGVwZW5kaW5nIG9uIHdoYXQgaXQgaXMgaW52b2tlZCB3aXRoLgoKIyMjIEV4YW1wbGVzIG9mIGBwbG90YCB1c2FnZXMKCmBgYHtyfQpwbG90KGMoMSwyLDMsNCwzLDIsMSkpCmBgYAoKYGBge3J9CnggPSBzZXEoMCwgMTAsIDAuMDEpCnBsb3QoeCwgc2luKHgpKQpgYGAKCkNhbiBpdCBiZSB1c2VkIHdpdGggb3VyIGBkYCB0b28/CmBgYHtyfQpwbG90KGlyaXMpCmBgYAoKWWVzLCBhbmQgaXQgcHJvZHVjZXMgc29tZXRoaW5nIHVzZWZ1bC4KQWN0dWFsbHkgcmVhbGx5IHVzZWZ1bCBmb3Igb3VyIGN1cnJlbnQgcHVycG9zZSAoZXhwbG9yZSB0aGUgZGF0YSkuCgpZZXQsIHNlcmlvdXMgcGxvdHRpbmcgaXMgdXN1YWxseSBkb25lIHdpdGggZGlmZmVyZW50IHRvb2xzLgpUaGUgbW9zdCBjb21tb24gb25lIGlzIGBnZ3Bsb3RgLgoKIyMgUGxvdHRpbmcgd2l0aCBgZ2dwbG90YAoKYGdncGxvdGAgaXMgYSAqcGFja2FnZSogdGhhdCBpdHNlbGYgaXMgYSBtZW1iZXIgb2YgYSBsYXJnZXIgZmFtaWx5IG9mIFIgcGFja2FnZXM6IGB0aWR5dmVyc2VgLgpJbnN0YWxsIGl0IGBpbnN0YWxsLnBhY2thZ2VzKCJ0aWR5dmVyc2UiKWAgKG5vdCBkb25lIGluIHRoaXMgbm90ZWJvb2ssIG1heSBiZSB0cmlja3k7IG5vdGUgdGhhdCB5b3UgZG8gbm90IHJlYWxseSBuZWVkIHRoZSBmdWxsIGB0aWR5dmVyc2VgIHBhY2thZ2UgZm9yIGRvaW5nIHRoZSB0aGluZ3Mgd2UnbGwgZG8pIGFuZCBsb2FkIGl0LgpgYGB7cn0KcmVxdWlyZSgidGlkeXZlcnNlIikKYGBgCgojIyMgVHJpdmlhbCBwbG90IHdpdGggYGdncGxvdGAKCipLZXkgcG9pbnQqOiBiZWZvcmUgZG9pbmcgYSBwbG90LCAKLSBmaXJzdCwgX2RlY2lkZV8gd2hhdCBpcyB0aGUgcXVlc3Rpb24gdGhlIHBsb3QgaXMgZ29pbmcgdG8gKGF0dGVtcHQgdG8pIGFuc3dlcgotIHRoZW4sIF9kZXNpZ25fIHRoZSBwbG90IGFjY29yZGluZ2x5CgpRdWVzdGlvbjogYXJlIHRoZSBzZXBhbCBsZW5ndGggYW5kIHNlcGFsIHdpZHRoIHNvbWVob3cgcmVsYXRlZCBmb3IgdGhlIHNwZWNpZXMgU2V0b3NhPwpEZXNpZ246IHRoZSB0d28gdmFyaWFibGVzIG9uIHRoZSBheGVzIG9mIGEgKnNjYXR0ZXJwbG90Kiwgb25lIHBvaW50IGZvciBlYWNoIFNldG9zYSBzYW1wbGUgaW4gYGRgLgpgYGB7cn0KZCAlPiUgZmlsdGVyKFNwZWNpZXM9PSJzZXRvc2EiKSAlPiUgZ2dwbG90KGFlcyh4PVNlcGFsLkxlbmd0aCwgeT1TZXBhbC5XaWR0aCkpICsgZ2VvbV9wb2ludCgpCmBgYAoKTmV3IHF1ZXN0aW9uOiBJcyB0aGlzIGRlcGVuZGVuY3kgZGlmZmVyZW50IGFtb25nIHRoZSB0aHJlZSBTcGVjaWVzPwpgYGB7cn0KZCAlPiUgZ2dwbG90KGFlcyh4PVNlcGFsLkxlbmd0aCwgeT1TZXBhbC5XaWR0aCwgY29sb3I9U3BlY2llcykpICsgZ2VvbV9wb2ludCgpCmBgYAoKIyMjIExlc3MgdHJpdmlhbAoKUXVlc3Rpb246IEFyZSB0aGUgZGlzdHJpYnV0aW9uIG9mIHZhcmlhYmxlcyBkaWZmZXJlbnQgYW1vbmcgc3BlY2llcz8KRGVzaWduOiB0aHJlZSBib3hwbG90cwpgYGB7cn0KZCAlPiUgcGl2b3RfbG9uZ2VyKGNvbHM9IVNwZWNpZXMpICU+JSBnZ3Bsb3QoYWVzKHg9U3BlY2llcywgeT12YWx1ZSkpICsgZ2VvbV9ib3hwbG90KCkgKyBmYWNldF9ncmlkKC5+bmFtZSwgc2NhbGVzPSJmcmVlIikKYGBgCgpNYXliZSBldmVuIGJldHRlciB3aXRoIHZpb2xpbi1wbG90cy4KYGBge3J9CmQgJT4lIHBpdm90X2xvbmdlcihjb2xzPSFTcGVjaWVzKSAlPiUgZ2dwbG90KGFlcyh4PVNwZWNpZXMsIHk9dmFsdWUpKSArIGdlb21fdmlvbGluKCkgKyBmYWNldF9ncmlkKC5+bmFtZSwgc2NhbGVzPSJmcmVlIikKYGBgCgojIyBUYWtlLWhvbWUKCi0gVmVyc2ljb2xvciBhbmQgVmlyZ2luaWNhIGFwcGVhciBoYXJkZXIgdG8gdGVsbCBhcGFydCB0aGFuIFNldG9zYS4KLSBQZXRhbC1yZWxhdGVkIGF0dHJpYnV0ZXMgYXBwZWFyIG1vcmUgdXNlZnVsIChmb3IgaW5mZXJyaW5nIHRoZSBzcGVjaWVzKSB0aGFuIHNlcGFsLSBvbmVzLg==