New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different counts in phyloseq object vs data.frame + ggplot2 #1376
Comments
Hi @bathyscapher, I wonder if you've worked it out why the abundance from the two methods were different in your case? I've tried it with demo dataset One thing I found odd is how the
With demo: library(phyloseq)
library(ggplot2)
data(GlobalPatterns)
# prune OTUs that are not present in at least 100% of the samples to reduce total taxa
gp <- prune_taxa(taxa_sums(GlobalPatterns) > length(sample_names(GlobalPatterns)), GlobalPatterns)
gp.log <- transform_sample_counts(gp, function(otu) {log1p(otu)})
p1 = plot_bar(gp.log, x = "Phylum", fill = "SampleType") +
geom_bar(aes(color = SampleType, fill = SampleType), stat = "identity", position = "stack") +
theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") + ylab("ln+1(Abundance) [%]")
tax <- as.data.frame(tax_table(gp))
otu <- as.data.frame(otu_table(gp))
gp.df.log <- cbind(tax, otu)
gp.df.log[, -c(1:7)] <- log1p(gp.df.log[, -c(1:7)])
gp.df.log.m <- reshape2::melt(gp.df.log, id.vars = colnames(gp.df.log)[1:7],
variable.name = "Sample", value.name = "Count")
gp.df.log.m <- merge(gp.df.log.m, data.frame(sample_data(gp)[,"SampleType"]), by.x = "Sample", by.y = "row.names")
gp.df.log.m$SampleType <- factor(gp.df.log.m$SampleType, levels = levels(sample_data(gp)$SampleType))
p2 = ggplot(gp.df.log.m, aes(x = Phylum, y = Count)) +
geom_bar(aes(color = SampleType, fill = SampleType), stat = "identity") +
theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") + ylab("ln+1(Abundance) [%]")
# Compare plots
#gridExtra::grid.arrange(p1, p2, ncol = 1) head(aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum), 10)
Phylum Count
1 ABY1_OD1 13.28367
2 AC1 15.57499
3 Acidobacteria 11995.80383
4 Actinobacteria 18263.85000
5 AD3 106.65813
6 Armatimonadetes 529.13889
7 Bacteroidetes 27559.40199
8 BRC1 78.84175
9 Caldithrix 51.38637
10 CCM11b 23.85461 tax.log <- as.data.frame(tax_table(gp.log))
otu.log <- as.data.frame(otu_table(gp.log))
gp.df.log2 <- cbind(tax.log, otu.log)
gp.df.log2.m <- reshape2::melt(gp.df.log2, id.vars = colnames(gp.df.log2)[1:7],
variable.name = "Sample", value.name = "Count")
gp.df.log2.m <- merge(gp.df.log2.m, data.frame(sample_data(gp)[,"SampleType"]), by.x = "Sample", by.y = "row.names")
gp.df.log2.m$SampleType <- factor(gp.df.log2.m$SampleType, levels = levels(sample_data(gp)$SampleType)) > head(aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum), 10)
Phylum Count
1 ABY1_OD1 13.28367
2 AC1 15.57499
3 Acidobacteria 11995.80383
4 Actinobacteria 18263.85000
5 AD3 106.65813
6 Armatimonadetes 529.13889
7 Bacteroidetes 27559.40199
8 BRC1 78.84175
9 Caldithrix 51.38637
10 CCM11b 23.85461 > table(aggregate(Count ~ Phylum, data = gp.df.log2.m, FUN = sum)$Count ==
aggregate(Count ~ Phylum, data = gp.df.log.m, FUN = sum)$Count)
TRUE
56
|
Dear joey711,
I run into an unexpected behaviour: coming from the same
phyloseq
object, the counts inplot_bar
look dissimilar to those generated from adata.frame
withggplot2
.Starting point is this
phyloseq
object:...that with
... yields this plot of ln+1 transformed read counts
To gain more flexibility, I converted the
phyloseq
objectsmp
into adata.frame
:... with dimensions:
So far, so good. Further, again a ln+1 transformation with a subsequent conversion into a long format and restoration of some metadata from the sample IDs:
... allows to plot the very same barplots directly with
ggplot
:... and yield this similar, but slightly different plot:
The counts are distinctly and always higher in the
phyloseq
approach. This is maybe clearer withaggregate
:As control, convert the ln+1 transformed
phyloseq
object into adata.frame
:... and compare:
So, finally, the question is where this deviance stem from. Do you have any clue what is wrong here? IMO the two approaches should yield exactly the same result...
Also, in the
phyloseq
plot, close-to-zero numbers are displayed much more pronounced, while they are invisible in theggplot2
version. Is an artefact or even a desired property to emphasize the low counts?I hope I could make my statement somewhat comprehensible (just a bit long maybe ;) ),
Best & thank you,
Bathyscapher
The text was updated successfully, but these errors were encountered: