The ggtree package by Guangchuang
Yu is an excellent resource for drawing information-rich visualizations
of phylogenetic trees. (See
https://guangchuangyu.github.io/software/ggtree/gallery/ for a gallery
of figures generated using ggtree
.) ggtree
employs the same grammar
as ggplot
, so writing code looks quite similar:
# load the package without all the preamble
suppressMessages(require(ggtree))
#> Warning: package 'ggtree' was built under R version 4.0.3
# generate a random tree
set.seed(1999); phy <- rtree(50)
# draw the tree gg-style!
ggtree(phy) + theme_tree2() + geom_tiplab(size=3)
Since I am accustomed to base R, however, I find this syntax unfamiliar
and frustrating, and was motivated to start implementing similar plot
functions in ggfree
. Of course there are several other packages in R
that provide plotting functions for phylogenetic tree objects. For
example, the
ape package —
which defines the phylo
S3 object
class that many packages for phylogenetics will accept as inputs —
adapts the generic S3 plot
function for trees (called plot.phylo
).
However, it is not straight-forward to annotate the resulting plot with
additional information, such as highlighting a subset of internal nodes.
This is because there are two things happening when we call
plot.phylo
:
-
the method maps the nodes and edges (branches) of the tree to the
x
andy
coordinates according to one of several layout algorithms, and; -
the method draws the tree on these coordinates in the active plot device.
Consequently, the x
and y
coordinates are not readily available for
the user to add additional information to the plot. (I’ll demonstrate
why it’s nice to have access to these data in subsequent sections.)
ggfree
separates these tasks into different functions. Generating the
coordinates for a given layout algorithm is handled by the function
tree.layout
, which returns an S3
object that I call phyloLayout
.
require(ggfree)
#> Loading required package: ggfree
#> Loading required package: ape
#>
#> Attaching package: 'ape'
#> The following object is masked from 'package:ggtree':
#>
#> rotate
#>
#> Attaching package: 'ggfree'
#> The following object is masked from 'package:ape':
#>
#> unroot
L <- tree.layout(phy)
plot(L)
axis(side=1) # add a horizontal axis like ggtree example
There are clearly some differences here. For instance, the order of tips
is different because ggtree
has automatically ladderized the phylogeny
by default. We can amend this using the ape:ladderize
function:
phy2 <- ladderize(phy, right=F)
plot(tree.layout(phy2), col='black', lwd=1, cex=0.75, mar=c(2,0,0,0))
axis(side=1, cex.axis=0.75, mgp=c(3,0.5,0))
Let’s have a look at what’s getting returned when we call tree.layout
:
L <- tree.layout(phy2)
L # calls print.phyloLayout function
#> Phylogenetic layout using a rectangular algorithm
#> with 50 tips and 49 internal nodes.
#>
#> $nodes (99 nodes with 4 attributes):
#> label n.tips x y
#> 1 t40 0 3.032877 22
#> 2 t17 0 2.763512 23
#> 3 t49 0 3.252531 24
#> 4 t12 0 3.039627 25
#> 5 t27 0 6.064988 46
#> 6 t10 0 6.747390 47
#>
#> $edges (98 edges with 8 attributes):
#> parent child length isTip x0 x1 y0 y1
#> 1 51 87 0.91742815 FALSE 0.0000000 0.9174281 4.484375 4.484375
#> 2 87 88 0.98563767 FALSE 0.9174281 1.9030658 1.750000 1.750000
#> 3 88 37 0.50608669 TRUE 1.9030658 2.4091525 1.000000 1.000000
#> 4 88 89 0.54734647 FALSE 1.9030658 2.4504123 2.500000 2.500000
#> 5 89 38 0.49303928 TRUE 2.4504123 2.9434516 2.000000 2.000000
#> 6 89 39 0.08740325 TRUE 2.4504123 2.5378155 3.000000 3.000000
Here we can see that a phyloLayout
object contains two data frames for
the nodes and edges of the tree, respectively. (Only the first six rows
of each are being displayed because this print
function is calling
head
for each data frame.) The nodes
data frame contains x and y
coordinates, and the edges
data frame contains two sets of x and y
coordinates. In addition, any node- or edge-level attributes that you
happened to attach to the original phylo
object are automatically
carried over to these data frames. These are the main outputs of the
layout algorithm.
ggfree
provides three popular layout algorithms for phylogenetic
trees. We specify which layout algorithm we want to use by passing a
type
argument to the function tree.layout
:
-
Rectangular/slanted trees. The above trees have been plotted using a rectangular layout. The horizontal lines (branches) represent the amount of time or expected amount of evolution along each lineage. By convention, time “flows” forward from the left to right of the figure. The vertical lines are only used to connect linegaes to their common ancestors and to separate lineages in the vertical dimension - their lengths do not represent any information.
A slanted tree uses the same coordinate information as a rectangular layout, except that lineages are connected directly to their ancestral nodes without vertical line segments:
plot(tree.layout(phy, type='s', unscaled=TRUE), cex=0.5, mar=c(0,0,0,3))
-
Circular (radial) trees. This is essentially the rectangular layout algorithm in a polar coordinate system. Time flows outward from the origin (centre) of the plot.
plot(tree.layout(phy, type='o'))
-
Unrooted trees. Like many other R packages that provide plot functions for trees,
ggtree
implements Joe Felsenstein’s equal-angle layout algorithm for unrooted trees. Note that I’ve turned off tip labels because they become cluttered in this layout:plot(tree.layout(phy, type='u'), label='n', mar=c(0,0,0,0))
To demonstrate how we can reproduce plots generated by ggtree
with
base R graphics, I’ve taken the example data sets from the ggtree
distribution
(https://github.com/YuLab-SMU/ggtree/blob/master/inst/examples/) and
reprocessed them using base R, with the exception of read.nexus
from
the ape
package:
# read maximum clade credibility tree
flu <- read.nexus('MCC_FluA_H3.tree')
# extract dN/dS ratios from text file
mlc <- readLines('mlc.txt')
nwk <- mlc[ which(grepl("w ratios", mlc)) + 1 ]
nwk <- gsub("#", ":", nwk)
nwk <- gsub(" ", "", nwk)
dnds <- read.tree(text=nwk)
# annotate flu tree with dN/dS outcomes
flu <- ladderize(flu, right=F)
dnds <- ladderize(dnds, right=F)
flu$dnds <- dnds$edge.length
# export the result for loading with `ggfree`
save(flu, file="flu.RData")
Here is the example code from ggtree
to draw the phylogeny annotated
with host species and genotype (with a heatmap):
cols <- scale_color(merged_tree, "dN_vs_dS", low="#0072B2", high="#D55E00",
interval=seq(0, 1.5, length.out=100))
p <- ggtree(merged_tree, size=.8, mrsd="2013-01-01", ndigits = 2, color=cols)
p <- add_colorbar(p, cols, font.size=3)
## add annotation of clade posterior and colored in darkgreen
p <- p + geom_text(aes(label=posterior), color="darkgreen", vjust=-.1,
hjust=-.03, size=1.8)
## add annotation of amino acid substitution inferred by joint probabilities
p <- p + geom_text(aes(x=branch, label=joint_AA_subs), vjust=-.03, size=1.8)
## get the tip labels
tip <- get.tree(merged_tree)$tip.label
## create a new data.frame containing information of host species.
host <- rep("Human", length(tip))
host[grep("Swine", tip)] <- "Swine"
host.df <- data.frame(taxa=tip, host=factor(host))
## use %<+% operator to attach the information to the tree view
p <- p %<+% host.df
## add circles in tips and tip labels align to the right hand side.
## after the attachment via %<+% operator,
## we can use 'host' information to color circles and labels of tips.
p <- p + geom_tippoint(aes(color=host), size=2) +
geom_tiplab(aes(color=host), align=TRUE, size=3, linesize=.3)
## rescale the color to blue (for human) and red (for swine)
p <- p + scale_color_manual(values = c("#377EB8", "#E41A1C"), guide='none')
## setting the coordinate system
p <- p + theme_tree2() + scale_y_continuous(expand=c(0, 0.6)) + xlab("Time")
## setting size of axis text and title.
p <- p + theme(axis.text.x=element_text(size=10), axis.title.x =
element_text(size=12))
p <- gheatmap(p, genotype, width=.4, offset=7, colnames=F) %>% scale_x_ggtree
p <- p + scale_fill_brewer(palette="Set2")
p <- p+theme(legend.text=element_text(size=8), legend.key.height=unit(.5,
"cm"),
legend.key.width=unit(.4, "cm"),
legend.position=c(.13, y=.945))
Note that I’ve omitted some of the code for loading and reformatting the
data for brevity and for a more equitable comparison to ggfree
syntax.
# now let's try the same thing with ggfree
L <- tree.layout(flu, 'r')
plot(L, cex=0.6, type='n', mar=c(3,1,0,20), label='n')
# map dN/dS values to colours
pal <- colorRampPalette(c('#0072B2', '#D55E00'))(20)
breaks <- c(seq(0, 1.5, length.out=19), 1000)
col <- pal[as.integer(cut(L$edges$dnds, breaks=breaks))]
# draw labels
host <- ifelse(grepl('Swine', L$nodes$label), '#E41A1C', '#377EB8')
text(L, align=TRUE, cex=0.75, col=host)
# draw tree
lines(L, col=col, lwd=3)
axis(side=1, at=seq(0, 20, 5), labels=seq(1990, 2010, 5), line=-2)
# draw points on tips
points(L, pch=20, col=host, cex=ifelse(L$nodes$n.tips==0, 1.5, 0))
# map colors from edges to nodes
index <- L$edges$child[L$edges$isTip]
draw.guidelines(L, col=host[index])
# load genotype data (example from ggtree)
path <- system.file("extdata/Genotype.txt", package="ggfree")
geno <- read.table(path, header=T, sep='\t', na.strings='')
geno <- geno[match(flu$tip.label, row.names(geno)), ]
# draw boxes
require(RColorBrewer)
#> Loading required package: RColorBrewer
col <- brewer.pal(3, 'Set2')
image(L, geno, xlim=c(30, 37), col=col, cex.axis=0.75, line=-2)
This end result is pretty similar to the ggtree
example (Figure 5 from
this paper), except that
we’re not embedding small images and not annotating the internal nodes
with amino acid substitutions (which I find rather messy and
uninformative!).
The ape
package has several examples of phylogenetic trees that
relates birds at different taxonomic levels. I’m going to use one that
relates birds at the level of families.
# load the example tree
data("bird.families")
plot(tree.layout(bird.families, type='o'), cex.lab=0.8)
I’d like to annotate this tree with the number of named species per family. Since these data are not provided with this tree object, I had to search around for some supplementary data. I settled on the dataset published by the non-governmental conservation organization BirdLife International, where I retrieved the ZIP archive including an Excel spreadsheet that I converted to a CSV and processed in R. The resulting data frame covers substantially more bird families than our phylogeny:
path <- system.file("extdata/birdlife.csv", package='ggfree')
birds <- read.csv(path, row.names=1)
head(birds)
#> Family Count
#> 1 Acanthisittidae 4
#> 2 Acanthizidae 72
#> 3 Accipitridae 295
#> 4 Acrocephalidae 76
#> 5 Aegithalidae 17
#> 6 Aegithinidae 4
nrow(birds)
#> [1] 243
Let’s try to match up family names:
index <- match(bird.families$tip.label, birds$Family)
# some family names failed to match
bird.families$tip.label[is.na(index)]
#> [1] "Dendrocygnidae" "Bucorvidae" "Rhinopomastidae" "Dacelonidae"
#> [5] "Cerylidae" "Centropidae" "Coccyzidae" "Crotophagidae"
#> [9] "Neomorphidae" "Batrachostomidae" "Eurostopodidae" "Chionididae"
#> [13] "Eopsaltriidae"
# There's only 13 of these missing, so I'm willing to manually restore them.
# I *think* Dacelonidae is the tree kingfishers subfamily...
missing <- data.frame(
Family=c("Dendrocygnidae", "Bucorvidae", "Rhinopomastidae", "Dacelonidae", "Cerylidae", "Centropidae", "Coccyzidae", "Crotophagidae", "Neomorphidae", "Batrachostomidae", "Eurostopodidae", "Chionididae", "Eopsaltriidae"),
Count=c(8, 2, 3, 70, 9, 10, 13, 4, 6, 5, 3, 2, 44)
)
birds <- rbind(birds, missing)
index <- match(bird.families$tip.label, birds$Family)
Now we’re ready to use this information to annotate the tree:
# the count distribution is more even if we do a log-transform
bins <- as.integer(cut(log(birds$Count[index]), breaks=8))
require(RColorBrewer)
pal <- brewer.pal(9, 'Blues')[2:9]
L <- tree.layout(bird.families, type='o')
plot(L, cex.lab=0.7, offset=2, mar=rep(5,4), col='chocolate')
image(L, z=as.matrix(bins), xlim=c(28.5,30), col=pal)