Different counts in phyloseq object vs data.frame + ggplot2 #1376

bathyscapher · 2020-08-04T09:57:49Z

Dear joey711,

I run into an unexpected behaviour: coming from the same phyloseq object, the counts in plot_bar look dissimilar to those generated from a data.frame with ggplot2.

Starting point is this phyloseq object:

> smp
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 452 taxa and 160 samples ]
sample_data() Sample Data:       [ 160 samples by 23 sample variables ]
tax_table()   Taxonomy Table:    [ 452 taxa by 6 taxonomic ranks ]

...that with

smp.log <- transform_sample_counts(smp, function(otu) {log1p(otu)})

plot_bar(smp.log, x = "Class", fill = "Site") +
  geom_bar(aes(color = Site, fill = Site), stat = "identity",
           position = "stack") +
  facet_grid( ~ Domain, scales = "free", space = "free") +
  theme(legend.position = "top") +
  xlab("") +
  ylab("ln+1(Abundance) [%]") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

... yields this plot of ln+1 transformed read counts

To gain more flexibility, I converted the phyloseq object smp into a data.frame:

tax <- as.data.frame(smp@tax_table@.Data)
otu <- t(as.data.frame(otu_table(smp)))

smp.df <- merge(tax, otu, by = 0, all = TRUE)
rownames(smp.df) <- smp.df$Row.names
colnames(smp.df)[1] <- "Taxa"

... with dimensions:

> dim(smp.df)
[1] 452 167

So far, so good. Further, again a ln+1 transformation with a subsequent conversion into a long format and restoration of some metadata from the sample IDs:

smp.df[, -c(1:7)] <- log1p(smp.df[, -c(1:7)])

smp.df.m <- reshape2::melt(smp.df, id.vars = c("Taxa", "Domain",
                                                      "Phylum", "Class",
                                                      "Order", "Family",
                                                      "Genus"),
                               variable.name = "FullID", value.name = "Count")
smp.df.m$Site <- substr(smp.df.m$FullID, 1, 2)
smp.df.m$Site <- ordered(smp.df.m$Site,
                           levels = c("CB", "LT", "LE", "LV", "LM"))

... allows to plot the very same barplots directly with ggplot:

ggplot(smp.df.m, aes(x = Class, y = Count, fill = Site)) +
  geom_bar(stat = "identity") +
  facet_grid( ~ Domain, scales = "free", space = "free") +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("ln+1(Abundance) [%]") +
  xlab("")

... and yield this similar, but slightly different plot:

The counts are distinctly and always higher in the phyloseq approach. This is maybe clearer with aggregate:

> aggregate(Count ~ Class, data = smp.df.m, FUN = sum)
                                 Class        Count
1                       Acidimicrobiia  0.004355303
2                       Acidobacteriae  0.294341520
3                       Actinobacteria  0.381871643
4                  Alphaproteobacteria 28.411744445
5                         Anaerolineae  0.023555151
6                             Babeliae  0.009931957
7                              Bacilli  0.094647975
8                          Bacteroidia 28.127169898
9                           Bdelloidea  0.021980542
10                     Bdellovibrionia  0.696601756
11                      Bicosoecida_cl  0.513898510
12                      Blastocatellia  0.018884731
13                 Blastocladiomycetes  0.005309363
14                        Branchiopoda  0.043169334
15                     Campylobacteria  0.003890279
16                      Cercomonadidae  0.020020627
17                          Chlamydiae  0.014261845
18                       Chlorophyceae  4.764854839
19                    Choanoflagellida  0.433964103
20                         Chromadorea  2.983740776
21                       Chrysophyceae 24.013766364
22                    Chthonomonadetes  0.029224961
23                    Chytridiomycetes  1.114656382
24                          Clitellata  0.017599472
25                          Clostridia  0.241672826
26                      Coriobacteriia  0.002587673
27                    Cryptophyceae_cl  0.808402938
28                      Cyanobacteriia  0.004642427
29                 Cystobasidiomycetes  0.273621094
30                    Desulfovibrionia  0.003290145
31                         Dinophyceae  0.005431654
32                            Discosea  0.478278176
33                           Euglenida  2.073871107
34                        Eutardigrada  0.499949209
35                      Fimbriimonadia  0.028017848
36                 Gammaproteobacteria 73.053148191
37                    Gemmatimonadetes  0.002417882
38                          Holophagae  0.004763798
39                          Imbricatea  4.848615365
40                      Incertae_Sedis  0.203811372
41                  Intramacronucleata 20.346342998
42                       Kinetoplastea  9.942740949
43                   Malasseziomycetes  0.028507168
44                         Monogononta  0.001839331
45                          Myxococcia  0.016153610
46                       Negativicutes  0.031457376
47                     Nitrososphaeria  0.002209479
48 Nucleariidae_and_Fonticula_group_cl  0.008895341
49                         Oligoflexia  0.093392009
50                         Perkinsidae  0.003257338
51               Peronosporomycetes_cl  0.346673428
52                       Phycisphaerae  0.005929233
53                      Planctomycetes  0.044320291
54                           Polyangia  0.002585521
55                       Rhabditophora  0.342925681
56                     Saccharomycetes  2.385294963
57                          Tetramitia  1.385767081
58                        Thecofilosea  0.639731651
59                    Trebouxiophyceae  2.362078500
60                     Tremellomycetes  5.754122950
61                    Verrucomicrobiae  0.359597885
62                    Vicinamibacteria  0.019765229

As control, convert the ln+1 transformed phyloseq object into a data.frame:

tax.log <- as.data.frame(smp.log@tax_table@.Data)
otu.log <- t(as.data.frame(otu_table(smp.log)))

smp.df.log.m <- merge(tax.log, otu.log, by = 0, all = TRUE)
rownames(smp.df.log.m) <- smp.df.log.m$Row.names
colnames(smp.df.log.m)[1] <- "Taxa"

... and compare:

> aggregate(Count ~ Class, data = smp.df.log.m, FUN = sum)
                                 Class        Count
1                       Acidimicrobiia  0.004374341
2                       Acidobacteriae  0.297461237
3                       Actinobacteria  0.390738936
4                  Alphaproteobacteria 31.108635644
5                         Anaerolineae  0.023682891
6                             Babeliae  0.009968175
7                              Bacilli  0.096179130
8                          Bacteroidia 32.258376283
9                           Bdelloidea  0.022146068
10                     Bdellovibrionia  0.715886080
11                      Bicosoecida_cl  0.546559720
12                      Blastocatellia  0.019188304
13                 Blastocladiomycetes  0.005337678
14                        Branchiopoda  0.045102204
15                     Campylobacteria  0.003895521
16                      Cercomonadidae  0.020204573
17                          Chlamydiae  0.014340958
18                       Chlorophyceae  6.176733400
19                    Choanoflagellida  0.502132909
20                         Chromadorea  4.180737261
21                       Chrysophyceae 36.034411670
22                    Chthonomonadetes  0.029607917
23                    Chytridiomycetes  1.396314348
24                          Clitellata  0.017913817
25                          Clostridia  0.242862298
26                      Coriobacteriia  0.002590202
27                    Cryptophyceae_cl  1.261008857
28                      Cyanobacteriia  0.004648014
29                 Cystobasidiomycetes  0.296732475
30                    Desulfovibrionia  0.003293526
31                         Dinophyceae  0.005446133
32                            Discosea  0.747785243
33                           Euglenida  2.649113627
34                        Eutardigrada  0.657231874
35                      Fimbriimonadia  0.028170217
36                 Gammaproteobacteria 82.496346010
37                    Gemmatimonadetes  0.002420996
38                          Holophagae  0.004786582
39                          Imbricatea  5.974678427
40                      Incertae_Sedis  0.219539667
41                  Intramacronucleata 25.491293114
42                       Kinetoplastea 12.956833682
43                   Malasseziomycetes  0.029328645
44                         Monogononta  0.001842719
45                          Myxococcia  0.016241119
46                       Negativicutes  0.031623114
47                     Nitrososphaeria  0.002214370
48 Nucleariidae_and_Fonticula_group_cl  0.008975058
49                         Oligoflexia  0.094216947
50                         Perkinsidae  0.003267977
51               Peronosporomycetes_cl  0.410704946
52                       Phycisphaerae  0.005964563
53                      Planctomycetes  0.044588886
54                           Polyangia  0.002587698
55                       Rhabditophora  0.505408121
56                     Saccharomycetes  3.198730010
57                          Tetramitia  1.912039783
58                        Thecofilosea  0.879960542
59                    Trebouxiophyceae  3.053906946
60                     Tremellomycetes  7.339475733
61                    Verrucomicrobiae  0.365067198
62                    Vicinamibacteria  0.019929726

So, finally, the question is where this deviance stem from. Do you have any clue what is wrong here? IMO the two approaches should yield exactly the same result...

Also, in the phyloseq plot, close-to-zero numbers are displayed much more pronounced, while they are invisible in the ggplot2 version. Is an artefact or even a desired property to emphasize the low counts?

I hope I could make my statement somewhat comprehensible (just a bit long maybe ;) ),
Best & thank you,
Bathyscapher

The text was updated successfully, but these errors were encountered:

ycl6 · 2021-08-04T22:17:31Z

Hi @bathyscapher, I wonder if you've worked it out why the abundance from the two methods were different in your case? I've tried it with demo dataset GlobalPatterns and found the abundance to be exectly the same with both methods.

One thing I found odd is how the tax and otu were combined in your script, where the OTU table was transposed and then merged with the Tax table, but this will not create the a data.frame with the correct information.

tax <- as.data.frame(smp@tax_table@.Data)
otu <- t(as.data.frame(otu_table(smp)))
smp.df <- merge(tax, otu, by = 0, all = TRUE)

With demo:

library(phyloseq)
library(ggplot2)
data(GlobalPatterns)

# prune OTUs that are not present in at least 100% of the samples to reduce total taxa
gp <- prune_taxa(taxa_sums(GlobalPatterns) > length(sample_names(GlobalPatterns)), GlobalPatterns)
gp.log <- transform_sample_counts(gp, function(otu) {log1p(otu)})

p1 = plot_bar(gp.log, x = "Phylum", fill = "SampleType") +
        geom_bar(aes(color = SampleType, fill = SampleType), stat = "identity", position = "stack") +
        theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
        xlab("") + ylab("ln+1(Abundance) [%]")

tax <- as.data.frame(tax_table(gp))
otu <- as.data.frame(otu_table(gp))
gp.df.log <- cbind(tax, otu)
gp.df.log[, -c(1:7)] <- log1p(gp.df.log[, -c(1:7)])

gp.df.log.m <- reshape2::melt(gp.df.log, id.vars = colnames(gp.df.log)[1:7], 
                              variable.name = "Sample", value.name = "Count")
gp.df.log.m <- merge(gp.df.log.m, data.frame(sample_data(gp)[,"SampleType"]), by.x = "Sample", by.y = "row.names")
gp.df.log.m$SampleType <- factor(gp.df.log.m$SampleType, levels = levels(sample_data(gp)$SampleType))

p2 = ggplot(gp.df.log.m, aes(x = Phylum, y = Count)) +
        geom_bar(aes(color = SampleType, fill = SampleType), stat = "identity") +
        theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
        xlab("") + ylab("ln+1(Abundance) [%]")

# Compare plots
#gridExtra::grid.arrange(p1, p2, ncol = 1)

head(aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum), 10)
            Phylum       Count
1         ABY1_OD1    13.28367
2              AC1    15.57499
3    Acidobacteria 11995.80383
4   Actinobacteria 18263.85000
5              AD3   106.65813
6  Armatimonadetes   529.13889
7    Bacteroidetes 27559.40199
8             BRC1    78.84175
9       Caldithrix    51.38637
10          CCM11b    23.85461

tax.log <- as.data.frame(tax_table(gp.log))
otu.log <- as.data.frame(otu_table(gp.log))
gp.df.log2 <- cbind(tax.log, otu.log)

gp.df.log2.m <- reshape2::melt(gp.df.log2, id.vars = colnames(gp.df.log2)[1:7], 
                               variable.name = "Sample", value.name = "Count")
gp.df.log2.m <- merge(gp.df.log2.m, data.frame(sample_data(gp)[,"SampleType"]), by.x = "Sample", by.y = "row.names")
gp.df.log2.m$SampleType <- factor(gp.df.log2.m$SampleType, levels = levels(sample_data(gp)$SampleType))

> head(aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum), 10)
            Phylum       Count
1         ABY1_OD1    13.28367
2              AC1    15.57499
3    Acidobacteria 11995.80383
4   Actinobacteria 18263.85000
5              AD3   106.65813
6  Armatimonadetes   529.13889
7    Bacteroidetes 27559.40199
8             BRC1    78.84175
9       Caldithrix    51.38637
10          CCM11b    23.85461

> table(aggregate(Count ~ Phylum, data = gp.df.log2.m, FUN = sum)$Count ==
        aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum)$Count)

TRUE
  56

R version 4.1.0
ggplot2_3.3.4 
phyloseq_1.36.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different counts in phyloseq object vs data.frame + ggplot2 #1376

Different counts in phyloseq object vs data.frame + ggplot2 #1376

bathyscapher commented Aug 4, 2020

ycl6 commented Aug 4, 2021

Different counts in phyloseq object vs data.frame + ggplot2 #1376

Different counts in phyloseq object vs data.frame + ggplot2 #1376

Comments

bathyscapher commented Aug 4, 2020

ycl6 commented Aug 4, 2021