Bioviz Under the Microscope, Part 1

Open-source biological visualization tools from dataviz.cafe

george s.
high stakes design

--

Above: Dengue virus (DENV) methyltransferase visualized in a JavaScript-based, dataviz.cafe-listed protein viewer called pv. These structures play a critical role in regulating gene expression and, in this particular case, can serve as early diagnostic markers of dengue hemorrhagic fever.

Biological visualization is becoming an increasingly sophisticated discipline, with visual analytics featuring prominently in contemporary life sciences, bioinformatics, and public health research around the world. In this post, we showcase some of the more noteworthy tools IQT Labs found when we used dataviz.cafe — IQT Labs’ newly-launched collection of free and open-source software (FOSS) for visualization — to survey the open-source biovisualization landscape. In particular, we explore a handful of cutting-edge tools for visualizing genomic sequences and phylogenetic/evolutionary trees: Ideogram.js, Aequatus, ETE Toolkit, ggtree, and treeio. In an upcoming post, we’ll focus on software for visualizing molecular pathways, brain structures, and infectious disease transmission networks.

You can find these (and other) FOSS bioviz tools on the dataviz.cafe homepage by typing “bio” into the keyword search box. Alternatively, you can narrow down this subcategory by entering even more specialized search terms such as “gene,” “genetic,” “phylogenetic,” “chromosome,” “molecular,” “metabolism,” “neurology,” or “brain.”

We have chosen to highlight these particular categories and tools because we believe they constitute an illustrative cross-section of today’s biovisualization capabilities. The diversity and dynamism of biological data call for an equally varied visualization toolkit. In taking a deep dive into so many different bioviz disciplines, we hope to shed light on some important advances in scientific visualization and showcase the situational awareness value of dataviz.cafe.

Open-source tools for visualizing genomic data

Ideogram.js is a JavaScript library for displaying and animating chromosomes in a web browser. This software creates idiograms, a type of visualization that gene researchers have used for almost a century.

Above: 20,000+ genes visualized in Ideogram (grouped by chromosome); the red histograms show the distribution of individual genes within the human genome.
Above: a zoomed-in view of the same dataset focusing on chromosome 1 (the largest), annotated with a more detailed gene expression frequency histogram.

Idiograms are a practical way of looking for regularities and abnormalities in DNA chromosomes, which can be valuable in the context of cancer research, studies of human aging, clinical sequencing, reference assembly, plant genetics, and agricultural biotechnology.

The idiogram first emerged as a graphical shorthand when microbiologists started staining sample chromosomes with special dyes, diagramming them, and highlighting where the stains tend to band. Since the adenine-thymine (AT)-rich regions in the chromosome typically stain more than other parts — saturated with guanine-cytosine (GC), which tend to resist the stain — this technique allows researchers to distinguish these two sets of complementary base pairs. As with other methods of data collection, grouping schemas, and norms for scientific communication, establishing consistent reporting procedures proved critical to the success of idiograms. By convention, idiograms arrange chromosomes in order of size, arms aligned vertically, and shortest arm at the top.

Once scientists agreed on this method of arrangement, ideograms quickly took off as a way of representing the relative size of individual chromosomes as well as their structure and composition. In short order, scientists began annotating idiograms, using color to highlight unknown bases and other areas of interest.

Above: This idiogram shows large-scale insertions, deletions, and inversions, using color to group specific regions with significant variations. (In the live version, clicking or hovering over a given region hides the other annotations.)

Today, Ideogram.js brings the shorthand of idiograms to the modern web browser. Like many visualization tools in circulation today, it uses D3.js, a popular JavaScript library, which is compatible with a wide variety of browser environments. When faced with larger, more complex genome-wide datasets, Ideogram.js provides faceted search using Crossfilter.js, another highly versatile JavaScript visualization library built to filter multivariate data. In addition, the tool comes with several wrappers that allow users to integrate Ideogram into other data science platforms like R and Jupyter Notebook.

Since launching in 2015, Ideogram has been used to visualize the genomes of humans and other primates, as well as fish, mice, chicken, insects, fungi, tropical plants, and staple crops like corn and rice. By default, Ideogram now includes access to several genome-wide datasets for these model organisms, defined through a concise snippet of JavaScript (example below).

var ideogram = new Ideogram({
organism: "human",
annotations: [{
name: "BRCA1",
chr: "17",
start: 43044294,
stop: 43125482
}]
});

By providing expressive templates like the above and by including access to model organism data, efforts like Ideogram.js help lower barriers to adoption for biological visualization and, in turn, drive open innovation in the life sciences.

Visualizing phylogenetic trees with open-source software

Aequatus is an open-source software library developed by the Earlham Institute (formerly The Genome Analysis Centre UK). Aequatus is similar to Ideogram in that it relies on D3 for rendering DNA insertions and deletions. Like Ideogram, it also includes access to reference genomes for several species. Aequatus differs, however, in its multiple capabilities for visualizing different species’ evolution and common ancestry, which brings us up a level of analysis and into another category of biovisualization: phylogenetic trees.

Above: screenshot from Aequatus showing alignments between homologous genes across different species

Aequatus looks for synteny, that is: regions of chromosomes with similar sets of genes positioned in roughly the same locations within the genome. Synteny can help determine whether two species share a common ancestor.

Importantly, Aequatus has a JavaScript integration with the Galaxy computational biology platform, a general-purpose research management system. By offering both a standalone web-based viewer and a more customizable plugin version, Aequatus makes visualization capabilities available to scientists who are not comfortable with elaborate programming tasks. This user-friendliness aspect of the software is important because in building dataviz.cafe, we were cognizant that low-code and zero-code tools like these are valuable for users who may be well-versed in various scientific disciplines, but who may have little or no software development expertise.

Exploring evolutionary relationships with phylogeny visualizations

Dataviz.cafe includes several other tools that use phylogeny trees to visualize evolutionary relationships. These include ETE Toolkit, ggtree, and treeio.

Above: a “tree of life” visualization generated using ETE

ETE stands for Environment for Tree Exploration, and is enormously popular in the bioinformatics community, with more than 5,000 downloads per month. Perhaps best known for its “tree of life” representation (shown above), ETE can generate circular, semicircular, and rectilinear tree structures with a wide variety of annotation options.

It also includes a module for testing evolutionary hypotheses. This feature lets users load alignment sequence data (highlighted in the genomic visualization section immediately above) and run multiple simulation models under different conditions. Using the output of these models, researchers can visually identify how different selective pressures, like habitat, diet, and mating patterns, and predator encroachment might have impacted the phylogeny of various branches of the tree.

Above: the phylogenetic tree of different H3 influenza viruses. This ggtree visualization groups host species by color, with humans in blue and swine in red.

Treeio and ggtree are two R-based alternatives to ETE, which is Python-based. Both are part of the rOpenSci and Bioconductor projects, which aim to address transparency/reproducibility in the life sciences. Treeio and ggtree both generate similar visualizations to ETE, and with a few minor exceptions, handle similar data inputs. In addition, treeio can handle format conversion, in addition to linking heterogeneous data sources — such as phenotypic data, experimental data, and clinical data — with phylogeny data.

This overlap across different software packages illustrates an important lesson in free and open-source software (FOSS) for dataviz: for any given dataset, there is often more than one tool or method that allows users to generate a visualization of interest. Today’s abundance of scientific visualization software equips researchers, analysts, data scientists, and designers with multiple options suited to their preferred toolchains, data formats, and programming constraints. Comparing options is essential to making informed visualization tool choices, hence the value of a resource like dataviz.cafe.

Stay tuned for our next post, which will feature open-source tools for visualizing brain structures, molecular pathways, and infectious disease outbreaks.

Please note that software listings on dataviz.cafe and references in this blog series are for informational purposes only. This catalog of visualization tools refers to third-party sites, software packages, and code modules that are not maintained by In-Q-Tel/IQT Labs. Listing here does not constitute an endorsement or recommendation, and your usage of this information is subject to both IQT’s Privacy Policy and Terms of Use.

--

--

george s.
high stakes design

👨🏻‍💻 open-source data visualization at IQT Labs