-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-dataviz.qmd
1090 lines (711 loc) · 41.4 KB
/
04-dataviz.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data viz I {#sec-dataviz}
```{r include=FALSE}
library(tidyverse)
library(palmerpenguins)
library(patchwork)
# Layers
# https://intro2r.com/the-start-of-the-end.html
```
## Intended Learning Outcomes {.unnumbered}
By the end of this chapter, you should be able to:
- explain the layered grammar of graphics
- choose an appropriate plot for categorical variables
- create a basic version of an appropriate plot
- apply additional layers to modify the appearance of the plot
It is time to think about selecting the most appropriate plot for your data. Different types of variables call for different kinds of plots, which depends on how many variables you’re aiming to plot and what their data types are. In this chapter, we will focus on **plots for categorical data**. Next week, we will explore plots for continuous variables and learn which plots work best when combining continuous and categorical data.
## [Individual Walkthrough]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
## Building plots
We are using the package `ggplot2` to create data visualisations. It's part of the tidyverse package. Actually, most people call th package `ggplot` but it's official name is `ggplot2`.
We’ll be using the `ggplot2` package to create data visualisations. It’s part of the `tidyverse` suite of packages. Although many people refer to it simply as `ggplot`, its official name is `ggplot2`.
::: grid
::: g-col-6
**ggplot2** uses a layered grammar of graphics, where plots are constructed through a series of layers. You start with a base layer (by calling `ggplot`), then add **data** and **aesthetics**, followed by selecting the appropriate **geometries** for the plot.
These first 3 layers will give you the most simple version of a complete plot. However, you can enhance the plot’s clarity and appearance by adding additional layers such as **scales**, **facets**, **coordinates**, **labels** and **themes**.
:::
::: g-col-6
{target="_blank"}](images/gglayers.png){width="90%"}
:::
:::
To give you a brief overview of the layering system, we will use the `palmerpenguins` package ([https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/){target="_blank"}). This dataset contains information about penguins, including bill length and depth, flipper length, body mass, and more.
```{r}
head(penguins)
```
Let's build a basic scatterplot to show the relationship between `flipper_length` and `body_mass`. We will customise plots further later on in the individual plots. This is just a quick overview of the different layers.
Let’s build a basic scatterplot to show the relationship between `flipper_length` and `body_mass`. We will further customise the plots in subsequent sections, but for now, this will provide a quick overview of the different layers.
* **Layer 1** creates the base plot that we build upon.
* **Layer 2** adds the `data` and some `aesthetics`:
* The data is passed as the first argument.
* Aesthetics are added via the mapping argument, where you define your variables (e.g., x or both x and y). This also allows you to specify general properties, like the color for grouping variables, etc.
* **Layer 3** adds geometries, or `geom_?` for short. This tells ggplot how to display the data points. Remember to add these layers with a `+`, rather than using a pipe (`%>%`). You can also add multiple geoms if needed, for example, combining a violin plot with a boxplot.
* **Layer 4** includes `scale_?` functions, which let you customise aesthetics like color. You can do much more with scales, but we'll explore later.
* **Layer 5** introduces facets, such as `facet_wrap()`, allowing you to add an extra dimension to your plot by showing the relationship you are interested in for each level of a categorical variable.
* **Layer 6** involves coordinates, where `coord_cartesian()` controls the limits for the x- and y-axes (xlim and ylim), enabling you to zoom in or out of the plot.
* **Layer 7** helps you modify axis labels.
* **Layer 8** controls the overall style of the plot, including background color, text size, and borders. R provides several predefined themes, such as `theme_classic`, `theme_bw`, `theme_minimal`, and `theme_light`.
Click on the tabs below to see how each layer contributes to refining the plot.
::: {.panel-tabset}
## Layer 1
```{r}
ggplot()
```
There’s not much to see at this stage - this is basically an empty plot layer.
## Layer 2
```{r}
ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm))
```
You won’t see any data points yet because we haven’t specified how to display them. However, we have mapped the aesthetics, indicating that we want to plot `body_mass` on the x-axis and `flipper_length` on the y-axis. This also sets the axis titles, as well as the axis values and breakpoints.
::: callout-tip
You won't need to add `data =` or `mapping =` if you keep those arguments in exactly that order. Likewise, the first column name you enter within the `aes()` function will always be interpreted as x, and the second as y, so you could omit them if you wish.
You don’t need to include `data =` or `mapping =` if you keep those arguments in the default order. Similarly, the first column name you enter in the `aes()` function will automatically be interpreted as the x variable, and the second as y, so you can omit specifying `x` and `y` if you prefer.
```{r eval = FALSE}
ggplot(penguins, aes(body_mass_g, flipper_length_mm))
```
will give you the same output as the code above.
:::
## Layer 3
```{r}
ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm, colour = sex)) +
geom_point()
```
Here we are telling `ggplot` to add a scatterplot. You may notice a warning indicating that some rows were removed due to missing values.
The `colour` argument adds colour to the points based on a grouping variable (in this case, `sex`). If you want all the points to be black — representing only two dimensions rather than three — simply omit the `colour` argument.
## Layer 4
```{r}
ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm, colour = sex)) +
geom_point() +
# changes colour palette
scale_colour_brewer(palette = "Dark2") +
# add breaks from 2500 to 6500 in increasing steps of 500
scale_x_continuous(breaks = seq(from = 2500, to = 6500, by = 500))
```
The `scale_?` functions allow us to modify the color palette of the plot, adjust axis breaks, and more. You could change the axis labels within `scale_x_continuous()` as well or leave it for Layer 7.
## Layer 5
```{r}
ggplot(data = penguins, mapping = aes(x=body_mass_g, y=flipper_length_mm, colour=sex)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
# split main plot up into different subplots by species
facet_wrap(~ species)
```
In this step, we’re using faceting to split the plot by species.
## Layer 6
```{r}
ggplot(data = penguins, mapping = aes(x=body_mass_g, y=flipper_length_mm, colour=sex)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
facet_wrap(~ species) +
# limits the range of the y axis
coord_cartesian(ylim = c(0, 250))
```
Here we adjust the limits of the y-axis to zoom out of the plot. If you want to zoom in or out of the x-axis, you can add the `xlim` argument to the `coord_cartesian()` function.
## Layer 7
```{r}
ggplot(data = penguins, mapping = aes(x=body_mass_g, y=flipper_length_mm, colour=sex)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
facet_wrap(~ species) +
labs(x = "Body Mass (in g)", # labels the x axis
y = "Flipper length (in mm)", # labels the y axis
colour = "Sex") # labels the grouping variable in the legend
```
You can change the axis labels using the `labs()` function, or you can modify them when adjusting the scales (e.g., within the `scale_x_continuous()` function).
## Layer 8
```{r}
ggplot(data = penguins, mapping = aes(x=body_mass_g, y=flipper_length_mm, colour=sex)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
facet_wrap(~ species) +
labs(x = "Body Mass (in g)",
y = "Flipper length (in mm)",
colour = "Sex") +
# add a theme
theme_classic()
```
The `theme_classic()` function is applied to change the overall appearance of the plot.
:::
::: callout-important
You need to stick to the first three layers to create your base plot. Everything else is optional, meaning you don’t need to use all eight layers. Additionally, layers 4-8 can be added in any order (more or less), whereas layers 1-3 must follow a fixed sequence.
:::
## Activity 1: Set-up and data for today
* We are still working with the data from Pownall et al. (2023), so **open your project**.
* However, let’s start with a fresh R Markdown file: **Create a new `.Rmd` file** and save it in your project folder. Give it a meaningful name (e.g., "chapter_04.Rmd" or "04_data_viz.Rmd"). If you need guidance, refer to @sec-rmd. Delete everything below line 12, but keep the setup code chunk.
* We previously aggregated the data in @sec-wrangling and @sec-wrangling2. If you want a fresh copy, download the data here: [data_prp_for_ch4.csv](data/data_prp_for_ch4.csv "download"). Make sure to place the csv file in the project folder.
* If you need a reminder about the data and variables, check the codebook or refer back to @sec-download_data_ch1.
## Activity 2: Load in libraries, read in data, and adjust data types
Today, we will be using the `tidyverse` package and the dataset `data_prp_for_ch4.csv`.
```{r eval=FALSE}
## packages
???
## data
data_prp_viz <- read_csv(???)
```
::: {.callout-caution collapse="true" icon="false"}
## Solution
```{r eval=FALSE}
library(tidyverse)
data_prp_viz <- read_csv("data_prp_for_ch4.csv")
```
:::
```{r include=FALSE}
## I basically have to have 2 code chunks since I tell them to put the data files next to the project, and mine are in a separate folder called data - unless I'll turn this into a fixed path
library(tidyverse)
data_prp_viz <- read_csv("data/data_prp_for_ch4.csv")
```
As mentioned in @sec-familiarise, it is always a good idea to take a glimpse at the data to see how many variables and observations are in the dataset, as well as the data types.
::: {.callout-note collapse="true" icon="false"}
## glimpse output
```{r}
glimpse(data_prp_viz)
```
:::
We can see that some of the categorical data in `data_prp_viz` was read in as numeric variables which makes them continuous. This will haunt us big time when building the plots. We would be better off addressing these changes in the dataset before we start plotting (and potentially getting frustrated with R and data viz in general).
Let’s convert some of the categorical variables into factors. We’ll use the `factor()` function, which requires the `variable` to convert, the `levels` (where we can re-order them as needed), and the corresponding `labels`.
```{r}
data_prp_viz <- data_prp_viz %>%
mutate(Gender = factor(Gender,
levels = c(2, 1, 3),
labels = c("females", "males", "non-binary")),
Secondyeargrade = factor(Secondyeargrade,
levels = c(1, 2, 3, 4, 5),
labels = c("≥ 70% (1st class grade)", "60-69% (2:1 grade)", "50-59% (2:2 grade)", "40-49% (3rd class)", "< 40%")),
Plan_prereg = factor(Plan_prereg,
levels = c(1, 3, 2),
labels = c("Yes", "Unsure", "No")),
Closely_follow = factor(Closely_follow,
levels = c(2, 3),
labels = c("Followed it somewhat", "Followed it exactly")),
Research_exp = factor(Research_exp),
Pre_reg_group = factor(Pre_reg_group))
```
## Activity 3: Barchart (`geom_bar()`)
A barchart is the best choice when you want to plot a single categorical variable.
For example, let’s say we want to count some demographic data, such as gender. To visualise the gender counts, we would use a **barplot**. This is done with `geom_bar()` in the third layer. Since the counting is done automatically in the background, the `aes()` function only requires an x value (i.e., the name of your variable).
```{r fig-bc-base, fig.cap="Default barchart"}
ggplot(data_prp_viz, aes(x = Gender)) +
geom_bar()
```
This is the base plot done. You can customise it by adding different layers. For example, the **labels** could be clearer, or you might want to add a splash **colour**. Click on the tabs below to see examples of additional customisations, and try applying them to your base plot in your own `.Rmd` file.
::: {.panel-tabset}
## Colour
We can change the colour by adding a `fill` argument in the `aes()`. If we want to modify these colours further, we would add a `scale_fill_?` argument. If you have specific colours in mind, you would use `scale_fill_manual()`, or if you prefer to stick with pre-defined options like viridis, you can use `scale_fill_viridis_d()`.
```{r}
ggplot(data_prp_viz, aes(x = Gender, fill = Gender)) +
geom_bar() +
# customise colour
scale_fill_viridis_d()
```
## Axes labels & margins
The x-axis label is fine, but the categories need to be relabelled. You can achieve this with the `scale_x_discrete()` function and the `labels =` argument. Just make sure to order the labels according to the order in the dataframe.
There is also a gap between the bottom of the chart and the bars that looks a bit odd. You can remove it by using the `expansion()` function.
```{r}
ggplot(data_prp_viz, aes(x = Gender, fill = Gender)) +
geom_bar() +
scale_fill_viridis_d() +
# changing group labels on the breaks of the x axis
scale_x_discrete(labels = c("Female", "Male", "Non-Binary")) +
scale_y_continuous(
# changing name of the y axis
name = "Count",
# remove the space below the bars (first number), but keep a tiny bit (5%) above (second number)
expand = expansion(mult = c(0, 0.05))
)
```
## Legend
The legend does not add any useful information because the labels are already provided on the x-axis. We can remove the legend by adding the argument `guide = "none"` to the `scale_fill` function.
```{r}
ggplot(data_prp_viz, aes(x = Gender, fill = Gender)) +
geom_bar() +
scale_fill_viridis_d(
# remove the legend
guide = "none") +
scale_x_discrete(labels = c("Female", "Male", "Non-Binary")) +
scale_y_continuous(
name = "Count",
expand = expansion(mult = c(0, 0.05))
)
```
## Themes
Let's experiment with the themes. For this plot we have chosen `theme_minimal()`.
```{r}
ggplot(data_prp_viz, aes(x = Gender, fill = Gender)) +
geom_bar() +
scale_fill_viridis_d(
guide = "none") +
scale_x_discrete(labels = c("Female", "Male", "Non-Binary")) +
scale_y_continuous(
name = "Count",
expand = expansion(mult = c(0, 0.05))
) +
# pick a theme
theme_minimal()
```
:::
## Activity 4: Column plot (`geom_col()`)
If the counts had already been summarised for you, `geom_bar()` would not work. Instead, you’d need to use `geom_col()` to display the pre-aggregated data.
```{r}
gender_count <- data_prp_viz %>%
count(Gender)
gender_count
```
The mapping for `geom_col()` requires both **x** and **y** aesthetics. In this example, **x** would represent the categorical variable (e.g., `Gender`), while **y** would refer to the column storing the summarised values (e.g., `n`). Notice how the axis title now reflects `n` instead of `count` in the base version.
```{r fig-col, fig.cap="Column plot with different coloured bars"}
ggplot(gender_count, aes(x = Gender, y = n, fill = Gender)) +
geom_col()
```
::: {.callout-note icon="false"}
## Your Turn: Make the column plot pretty
The other layers to change the colour scheme, axes labels and margins, removing the legend and altering the theme require exactly the same functions as with the boxplot above. Test yourself to see if you can...
* [ ] change the colour scheme (e.g., viridis or [any other colour palettes](https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/){target="_blank"})
* [ ] remove the legend
* [ ] change the titles of the x and y axes
* [ ] make the bars start directly on the x-axis
* [ ] add a theme of your liking
::: {.callout-tip collapse="true"}
## Possible solution code for the column plot (with a different colour palette and a different theme)
```{r}
ggplot(gender_count, aes(x = Gender, y = n, fill = Gender)) +
geom_col() +
# replaced vidiris with the brewer palette
scale_fill_brewer(
palette = "Set1", # try "Set2" or "Dark2" for some variety
guide = "none") + # legend removed
# labels of the categories changed
scale_x_discrete(labels = c("Male", "Female", "Non-Binary")) +
scale_y_continuous(
# change y axis label
name = "Count",
# starts bars on x axis without any gaps but leaves some space at the top (this time 10%)
expand = expansion(mult = c(0, 0.1))
) +
# different theme
theme_light()
```
:::
:::
## Activity 5: Stacked, Percent Stacked, and Grouped Barchart {#sec-adv_bar}
When dealing with **two categorical variables**, you have three options for displaying stacked barcharts: the "normal" **Stacked Barchart** (the default option), a **Percent Stacked Barchart**, or a **Grouped Barchart**.
For this activity, we will explore the variable `Plan_prereg`, which measures whether students planned to pre-register their undergraduate dissertation at time point 1, and `Pre_reg_group`, which tracks whether they actually followed through with a pre-registration for their dissertation.
One way to display this data is by creating either a **Stacked Barchart** (the default) or a **Percent Stacked Barchart**. In both cases, the subgroups are displayed on top of each other. To make comparison easier, we will place the two plots side by side and move the legend to the bottom of the chart.
```{r, eval=FALSE}
## Stacked barchart
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar() + # no position argument added
theme(legend.position = "bottom") # move legend to the bottom
## Percent stacked barchart
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") # move legend to the bottom
```
```{r, fig-barcharts_stacked, fig.cap="Stacked barchart (left), and Percent stacked barchart (right)", echo=FALSE}
## Stacked barchart
bc_stacked <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar() + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 1)) # display across 2 rows
## Percent stacked barchart
bc_percent <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 1)) # display across 2 rows
bc_stacked + bc_percent + plot_layout(nrow = 1)
```
In the **stacked barchart** (@fig-barcharts_stacked, left plot), you can display participant numbers. From this, we can see that the highest number of students were unsure whether they wanted to pre-register their dissertation, followed closely by those who answered "yes." We also see that the number of students who did not end up with a pre-registered dissertation (blue category) is the same for both those who had planned to pre-register and those who did not want to. However, since the "No" category has significantly fewer participants than the other two, it’s difficult to tell if the ratio remains consistent across all three groups.
If we want to highlight this ratio, a **Percent Stacked Barchart** (@fig-barcharts_stacked, right plot) would be more appropriate. This plot shows that approximately 80% of the students who had planned to pre-register their dissertations, 50% of the students who were initially unsure, and only 33% of the students who had no plan to pre-register ended up with a pre-registered dissertation. BUT! We would lose the information about the raw values in the sample.
**It’s all a trade-off, and the plot you choose depends on the "story" you want the data to tell.**
::: callout-note
The position argument `position = "stack"` is the default. Adding this argument to the code for the left plot in @fig-barcharts_stacked would produce the same plot as leaving the argument out.
:::
The other option is a **Grouped Barchart**, which displays the bars next to each other. You can achieve this by changing the `position` argument to `"dodge"`. You can see the default version of the plot in @fig-barchart_grouped on the left, and one with additional layers on the right.
Instead of using a pre-existing colour palette, we manually changed the colours using hex codes. These are some of the colours Gaby used in her PhD thesis, but you can:
* create your own colour hex codes by using [this website](https://www.hexcolortool.com/){target="_blank"}, OR
* use pre-defined colour names like "green" or "purple" instead. See a full list [here](https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/){target="_blank"}.
Feel free to explore.
Since the legend title for the second plot is a bit long, we displayed the legend content across two rows by adding the layer `guides(fill = guide_legend(nrow = 2))` at the end.
```{r eval=FALSE}
## Default grouped barchart
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "dodge") + # add position argument here
theme(legend.position = "bottom") # move legend to the bottom
## Prettier grouped barchart
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "dodge") + # add position argument here
# changing labels for x, y, and fill category - alternative method
labs(x = "Pre-registration planned", y = "Count", fill = "Pre-registered dissertation") +
# manual colour change for values
scale_fill_manual(values = c('#648FFF', '#DC267F'),
labels = c("Yes", "No")) +
scale_y_continuous(
# remove the space below the bars, but keep a tiny bit (5%) above
expand = expansion(mult = c(0, 0.05))
) +
# pick a theme
theme_classic() +
# need to move this following line to the end otherwise the `theme_*` overrides it
theme(legend.position = "bottom") +
# display across 2 rows
guides(fill = guide_legend(nrow = 2))
```
```{r fig-barchart_grouped, fig.cap="Default grouped barchart (left) and one with a few more layers added (right)", echo=FALSE}
gbc_default <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "dodge") + # add position argument here
theme(legend.position = "bottom") # move legend to the bottom
## Prettier grouped barchart
gbc_pretty <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Pre_reg_group)) +
geom_bar(position = "dodge") + # add position argument here
# changing labels for x, y, and fill category - alternative method
labs(x = "Pre-registration planned", y = "Count", fill = "Pre-registered dissertation") +
# manual colour change for values
scale_fill_manual(values = c('#648FFF', '#DC267F'),
labels = c("Yes", "No")) +
scale_y_continuous(
# remove the space below the bars, but keep a tiny bit (5%) above
expand = expansion(mult = c(0, 0.05))
) +
# pick a theme
theme_classic() +
# need to move this following line to the end otherwise the `theme_*` overrides it
theme(legend.position = "bottom") +
# display across 2 rows
guides(fill = guide_legend(nrow = 2))
gbc_default + gbc_pretty + plot_layout(nrow = 1)
```
::: {.callout-tip collapse="true" icon="false"}
## Special case: Categorical variables with missing values
If we had chosen a different categorical variable that contains missing values, such as `Closely_follow`, our plots would have included those missing values by default. To change the colour of the missing value bars, you would need to specify this using the `na.value =` argument within the `scale_fill()` function. Here’s an example of a grouped barchart.
```{r eval=FALSE}
# default grouped barchart with missing values
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "dodge") +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 3)) # display across 3 rows
## Prettier grouped barchart with missing values
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "dodge") +
labs(x = "Pre-registration planned", y = "Count", fill = "Pre-registration followed") +
# manual colour change for values of the factor and the NA responses
scale_fill_manual(values = c('#648FFF', '#DC267F'), na.value = '#FFB000') +
scale_y_continuous(
expand = expansion(mult = c(0, 0.05))
) +
theme_classic() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 3)) # display across 3 rows
```
```{r fig-barchart_grouped_na, fig.cap="Default grouped barchart (left) and one with a few more layers added (right) for a variable with missing values", echo=FALSE}
gbc_default <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "dodge") +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 3)) # display across 3 rows
## Prettier grouped barchart with missing values
gbc_pretty <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "dodge") +
labs(x = "Pre-registration planned", y = "Count", fill = "Pre-registration followed") +
# manual colour change for values of the factor and the NA responses
scale_fill_manual(values = c('#648FFF', '#DC267F'), na.value = '#FFB000') +
scale_y_continuous(
expand = expansion(mult = c(0, 0.05))
) +
theme_classic() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 3)) # display across 3 rows
gbc_default + gbc_pretty + plot_layout(nrow = 1)
```
If you don’t want the missing values to appear in the plot, you will need to do some data wrangling to remove them first. The function for this is `drop_na()`. Here we applied `drop_na()` to `Closely_follow` only.
```{r}
# remove NA
prereg_plan_follow <- data_prp_viz %>%
select(Code, Plan_prereg, Closely_follow) %>%
drop_na(Closely_follow)
```
::: {.callout-note collapse="true" icon="false"}
## check NAs have been removed
```{r}
# check NA have been removed
prereg_plan_follow %>%
distinct(Plan_prereg, Closely_follow) %>%
arrange(Plan_prereg, Closely_follow)
```
:::
But keep in mind that it could misrepresent the data, e.g., giving a wrong impression about proportions. As a comparison...
```{r eval=FALSE}
# with NA
ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 2)) # display across 2 rows
# without NA
ggplot(prereg_plan_follow, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 2)) # display across 2 rows
```
```{r fig-barchart_na_no_na, echo=FALSE, fig.cap="Percent stacked barchart with (left) and without missing values (right)"}
# with NA
comp_a <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 2)) # display across 2 rows
# without NA
comp_b <- ggplot(prereg_plan_follow, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "fill") + # add position argument here
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow = 2)) # display across 2 rows
comp_a + comp_b + plot_layout(nrow = 1)
```
:::
## Activity 6: Save your plots
You can save your figures using the `ggsave()` function, which will save them to your project folder.
There are two ways to use `ggsave()`. If you don’t specify which plot to save, by **default** it will **save the last plot you created**. In our case, the last plot was the one without `NA` from the special case scenario (@fig-barchart_na_no_na). However, if you did not follow along with the special case scenario, your last plot will be the grouped bar chart on the right from @fig-barchart_grouped.
```{r eval=FALSE}
ggsave(filename = "last_plot.png")
```
```{r include=FALSE}
comp_b
ggsave(filename = "images/last_plot.png")
```
::: {.callout-note collapse="true" icon="false"}
## Our last plot saved

:::
The second option is to save the plot as an object and refer to the object within `ggsave()`. As an example, let's save the grouped barchart that contained missing values (@fig-barchart_grouped) as an object called `grouped_bar`.
The second option is to save the plot as an object and then refer to that object within `ggsave()`. For example, let’s save the grouped barchart that contained missing values (@fig-barchart_grouped) as an object called `grouped_bar`.
```{r}
grouped_bar <- ggplot(data_prp_viz, aes(x = Plan_prereg, fill = Closely_follow)) +
geom_bar(position = "dodge") +
labs(x = "Pre-registration planned", y = "Count", fill = "Pre-registration followed") +
# manual colour change for values of the factor and the NA responses
scale_fill_manual(values = c('#648FFF', '#DC267F'), na.value = '#FFB000') +
scale_y_continuous(
expand = expansion(mult = c(0, 0.05))
) +
theme_classic() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 3)) # display across 3 rows
```
Then, you can run the following line:
```{r eval=FALSE}
ggsave(filename = "grouped_bar.png",
plot = grouped_bar)
```
```{r echo=FALSE}
ggsave(filename = "images/grouped_bar.png",
plot = grouped_bar)
```
The `filename` is the name you want your PNG file to have, and `plot` refers to the name of the plot object.
::: {.callout-note collapse="true" icon="false"}
## Our saved `grouped_bar.png` would look like this:

:::
This is the plot saved with the default settings. If you like it, feel free to keep it as is. However, if it seems a bit "off", you can adjust the width, height, and units (e.g., "cm", "mm", "in", "px"). You might need to experiment with the dimensions until it feels about right.
```{r eval=FALSE}
ggsave(filename = "grouped_bar2.png",
plot = grouped_bar,
width = 16, height = 9, units = "cm")
```
```{r echo=FALSE}
ggsave(filename = "images/grouped_bar2.png",
plot = grouped_bar,
width = 16, height = 9, units = "cm")
```
::: {.callout-note collapse="true" icon="false"}
## `grouped_bar.png` with different dimensions

:::
## [Pair-coding]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
### Task 1: Open the R project for the lab {.unnumbered}
### Task 2: Create a new `.Rmd` file {.unnumbered}
... and name it something useful. If you need help, have a look at @sec-rmd.
### Task 3: Load in the library and read in the data {.unnumbered}
The data should already be in your project folder. If you want a fresh copy, you can download the data again here: [data_pair_coding](data/data_pair_coding.zip "download").
We are using the package `tidyverse` today, and the datafile we should read in is `dog_data_clean_wide.csv`.
```{r reading in data for me, echo=FALSE, message=FALSE}
library(tidyverse)
dog_data_wide <- read_csv("data/dog_data_clean_wide.csv")
```
### Task 4: Create an appropriate plot {.unnumbered}
Pick **any single or two categorical variables** from the Binfet dataset and **choose one of the appropriate plot choices**. Things to think about:
* [ ] Select your categorical variable(s): `GroupAssignment`, `Year_of_Study`, `Live_Pets`, and/or `Consumer_BARK`
* [ ] Decide on the plot you want to display: barchart, stacked barchart, percent stacked barchart, or grouped barchart
* [ ] You may need to convert your variables into factors
* [ ] Think about what you want to do with missing data
* [ ] Pick a colour scheme (manual or pre-defined colour palette)
* [ ] Tidy the axes labels
* [ ] Decide whether you need a legend or not, and if so, where you would want to place it
* [ ] Remove the gap between the bottom of the chart and the bars
* [ ] Pick a theme
::: {.callout-caution collapse="true" icon="false"}
## Possible solution for a plot with 1 categorical variable
**Converting some variables into factors**
```{r}
dog_data_wide <- dog_data_wide %>%
mutate(Year_of_Study = factor(Year_of_Study,
levels = c("First", "Second", "Third", "Fourth", "Fifth or above")))
```
**Now we can plot**
```{r}
ggplot(dog_data_wide, aes(x = Year_of_Study, fill = Year_of_Study)) +
geom_bar() +
scale_fill_brewer(
palette = "Dark2",
guide = "none") +
scale_x_discrete(name = "Year of Study") +
scale_y_continuous(name = "Count",
expand = expansion(mult = c(0, 0.05))) +
theme_classic()
```
:::
::: {.callout-caution collapse="true" icon="false"}
## Possible solution for a plot with 2 categorical variables
**Converting some variables into factors**
```{r}
dog_data_wide <- dog_data_wide %>%
mutate(GroupAssignment = factor(GroupAssignment,
levels = c("Direct", "Indirect", "Control")))
```
**Now we can plot**
```{r}
ggplot(dog_data_wide, aes(x = GroupAssignment , fill = Live_Pets)) +
geom_bar(position = "fill") +
labs(x = "Experimental Group", y = "Count", fill = "Pets at Home") +
scale_fill_manual(values = c('deeppink', 'springgreen2'), na.value = 'orangered',
labels = c("Yes", "No")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
theme_classic() +
theme(legend.position = "bottom")
```
:::
## [Test your knowledge]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
Let's go back to the `palmerpenguins` package ([https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/){target="_blank"}), and assume you have the following data available:
```{r message=FALSE}
library(palmerpenguins)
penguin_selection <- penguins %>%
group_by(species, island) %>%
summarise(penguin_count = n())
penguin_selection
```
### Knowledge check {.unnumbered}
#### Question 1 {.unnumbered}
What `geom` would you use to plot penguin count for each species? `r mcq(c(x = "geom_bar", answer = "geom_col"))`
#### Question 2 {.unnumbered}
What mapping would you use to display penguin count across species? `r longmcq(c(x = "aes(x = penguin_count, y = species)", answer = "aes(x = species, y = penguin_count)", x = "aes(x = species)", x = "aes(x = penguin_count)"))`
#### Question 3 {.unnumbered}
What `geom` would you use to count the number of species on each island? `r mcq(c(answer = "geom_bar", x = "geom_col"))`
#### Question 4 {.unnumbered}
What mapping would you use to display the count of species per island? `r longmcq(c(x = "aes(x = island, y = species)", x = "aes(x = species, y = island)", answer = "aes(x = island)", x = "aes(x = species)"))`
::: {.callout-caution collapse="true" icon="false"}
## Explain these answers
**Question 1**: `geom_col()` is the appropriate choice for bar charts with predefined y-values, such as `penguin_count`.
**Question 2**: The correct aesthetic mapping places the categorical variable (`species`) on the x-axis and the numeric variable (number of observed penguins) on the y-axis. Using `aes(x = penguin_count, y = species)` would flip the axes, placing the number of penguins on the x-axis and species on the y-axis, which doesn’t match the conventional structure of a bar chart.
**Question 3**: `geom_bar()` is the appropriate choice when you want to automatically count the number of observations within each category, such as counting the number of penguin species on each island.
**Question 4**: For a simple count of species per island, you only need to map the categorical variable (`island`) to the x-axis. The y-axis will automatically represent counts when using `geom_bar()`.
:::
### Error mode {.unnumbered}
Some of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.
#### Question 5 {.unnumbered}
We want to plot the number of penguins across the different islands.
```{r error=TRUE}
ggplot(penguins, aes(x = islands)) +
geom_bar()
```
What does this error message mean and how do you fix it?
::: {.callout-caution collapse="true" icon="false"}
## Explain the solution
The error message consists of 2 parts. Part 1 is perhaps a bit trickier to interpret, but part 2 gives some useful hints:
* *"Aesthetics must be either length 1 or the same as the data (344)"*: This means that the variable mapped to `x` should either be a constant (like a single value) or a column that has 344 entries (matching the number of rows in the penguins dataset).
* *"Fix the following mappings: `x`"*: The issue is specifically with the `x` aesthetic, meaning `islands` is either misspelled or doesn’t exist in the dataset.
To check the `penguins` data, you can use `glimpse()`.
```{r}
glimpse(penguins)
```
To fix the error, you need to correct the column name. The correct column in the `penguins` dataset is called `island` (without the "s" at the end). The `island` column has 344 entries, just like the rest of the dataset, so the mapping now works properly.
```{r}
ggplot(penguins, aes(x = island)) +
geom_bar()
```
:::
#### Question 6 {.unnumbered}
Next, we want to create a grouped bar chart displaying species per island, using the viridis color palette.
```{r error=TRUE}
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "dodge") +
scale_fill_viridis()
```
What does this error message mean and how do you fix it?
::: {.callout-caution collapse="true" icon="false"}
## Explain the solution
The function `scale_fill_viridis()` is incorrect; the correct function is called `scale_fill_viridis_d()`.
FIX: correct the function name to display the grouped bar chart with the viridis color palette.
```{r}
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "dodge") +
scale_fill_viridis_d()
```
:::
#### Question 7 {.unnumbered}
We want to create a grouped bar chart showing the number of penguins on each island, broken down by year.
```{r error=TRUE}
ggplot(penguins, aes(x = island, fill = year)) +
geom_bar(position = "dodge")
```
Hmmm. We got a plot, but certainly not the one we intended. The warning message mentions something about the grouping structure and gives some additional hints.
::: {.callout-caution collapse="true" icon="false"}
## Explain the solution
The grouping variable needs to be a factor. R helpfully asks if we’ve forgotten to convert a numerical variable into a factor!!! Oh, let's check that in the `penguins` data using the `glimpse()` function.
```{r}
glimpse(penguins)
```
Indeed, `year` is currently stored as a numeric (integer) variable. To fix this, we need to convert `year` to a factor. We can do this directly within the `ggplot()` function.
```{r}
ggplot(penguins, aes(x = island, fill = as.factor(year))) +