Circos > Genomic Data

Application of Circos to Genomics

Abstract

Circos is used for the identification and analysis of similarities and differences arising from comparisons of genomes. Circos is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies.

Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line and histogram plots, heat maps, tiles, connectors and text.

Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.

An interactive online version of Circos designed to visualize tabular data is available. Circos is licensed under GPL.

Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645.

The creation of Circos was motivated by a need to visualize structural variation within a genome. Initially, this variation was detected using BAC clones derived from tumor genomes — clones which had alignments to distant regions of the genome captured a rearrangement in the cancer genome. The positions of these alignments were drawn circularly and the density of the alignments (clones sampled the genome redundantly) was taken as the configuration of rearrangement.

Subsequently, we began using Circos to show relationships between the sequence of multiple genomes, thus visualizing sequence synteny and conservation. Typically a genome is characterized in several ways, each at a different resolution, and Circos was used to show the relationships between corresponding positions within these representations (e.g. sequence assembly and fingerprint map).

Specific features are included to help viewing data on the genome. The genome is a large structure with localized regions of interest, frequently separated by large oceans of uninteresting sequence. To help visualize data in this context, Circos can create images with variable axis scaling, permitting local magnification of genomic regions to be controlled without cropping. Scale smoothing ensures that the magnification level changes smoothly. In combination with axis breaks and custom ideogram order, the final image can be easily tuned to offer the clearest illustration of your data.

Circos is similar to chromowheel and, to a lesser extent, genopix.

Circos Archetype

Let's look at an image which typifies one kind of genomic data illustration — one with a large number of links and several high-resolution tracks placed on the outside. This image appeared in the Conde Nast Portfolio as part of an article about 23andMe.

Circos two-page spread from Conde Nast Portfolio (950 x 940)
An illustration of the human genome showing location of genes implicated in disease, regions of with self-similarity and those with structural variation within populations. This graphic layers a variety of data types (links, heat maps, tiles, histograms) and is a good example of a Circos image. This graphic appeared in the Conde Nast Portfolio. (zoom).

What is Shown?

The human genome is comprised of 22 pairs of chromosomes 1–22 and the pair of sex chromosomes X,Y. Individual chromosomes range from about 50 Mb (chr 21) to about 250 Mb (chr 1) and together compose the 3 Gb human genome.

This graphic shows the chromosomes arranged in a circular orientation, shown as wedges, marked with a length scale. Data placed outside of the chromosome ring represents degree of small- and large- scale variation in the genome at a given position found between different populations.

Data placed on top of the chromosome ring highlights positions of genes implicated in disease, such as cancer, diabetes, and glaucoma. Data placed inside the ring links disease-related genes found in the same biochemical pathway (grey) and the degree of similarity for a subset of the genome (colored).

Detailed Caption

Circos two-page spread from Conde Nast Portfolio (500 x 500)

The graphic shows the human genome annotated with data related to genes implicated in disease, regions of variation found in various populations, and regions of similarity between chromosomes.

The 24 individual chromosomes (1, 2, 3, ..., 21, 22, X, Y) are arranged circularly C, and represented by labeled C3 ideograms on which the distance scale is displayed C1.

Some chromosomes are shown at different physical scales to illustrate the rich pattern of the data (chr2 3× zoom / chrs 18,19,20,21,22 2× zoom / chrs 3,7,17 10× zoom). Within each ideogram, cytogenetic bands are shown C2. These are large-scale features used in cytogenetics to locate and reference gross changes.

On the outside of the ideograms, genomic variation between individuals and populations is represented by tracks A and B. The number of catalogued locations at which single base pair changes have been observed within populations is shown as a histogram A. Large regions which have been seen to vary in size and copy number between individuals are marked in B.

Locations of genes associated with disease are superimposed on the ideograms D. D3 shows the location of genes implicated in cancer (very dark red), other disease (dark red) and all other genes (red). D2 shows locations of genes implicated in lung, ovarian, breast, prostate, pancreatic, and colon cancer, colored in progressively darker shade of red. D1 marks gene positions implicated in other diseases such as ataxia, epilepsy, glaucoma, heart disease, neuropathy, colored in progressively darker shade of red, as well as diabetes (orange), deafness (green), and Alzheimer (blue) disease.

Grey lines E connect positions on ideograms associated with genes that participate in the same biochemical pathways. The shade of the link reflects character of the gene - dark grey indicates that the gene is implicated in cancer, grey in disease, and light grey for all other genes. Colored links F connect a subset of genomic region pairs that are highly similar and illustrate the deep level of similarity between genomic regions (about 50% of the genome is in so-called repeat regions regions which appear in the genome multiple times and in a variety of locations).

References

Circos two-page spread from Conde Nast Portfolio (500 x 542)

Many of the data sets used in the figure are available through the genome browser at University of California Santa Cruz. The data used in the figure was downloaded from the table browser for the human genome assembly (hg18, May 2006).

The data used (group/track) for figure elements is as follows

  • C mapping and sequencing tracks / chromosome band (ideogram)
  • A variation and repeats / snps (v126). The histogram shows the number of SNPs per 1 Mb.
  • F variation and repeats / segmental duplication. A small subset of segmental duplications are drawn, filtered by locations on chromosomes 2, 3, 7, 9. The choice of locations was motivated by the need for a visually balanced set of links.
  • B variants in genome structure catalogued by the TCAG database.
  • D locations of genes implicated in disease. Gene-to-disease mappings were done using OMIM database.

Gene-to-chromosome location mappings were done using the following data tables from UCSC

  • gene and gene prediction tracks / UCSC genes
  • gene and gene prediction tracks / RefSeq genes

For example, genes implicated in diabetes were found by scanning for all gene and gene aliases that have the keyword "diabetes" in the OMIM entry. Subsequently, the list of gene names was cross-referenced with positional information from UCSC to obtain a final list of genomic positions.

Track D3 shows all genes (red), OMIM genes (dark red) and genes from the Cancer Gene Census, a manually curated subset of genes with strong evidence linking them to cancer. Tracks D1 and D2 show locations of genes associated with specific types of cancer, as well as with other disease.

Links shown in E connect genes found in metabolic pathways catalogued by the KEGG database. For a given set of genes `g_1, g_2, g_3, ...` found in the same pathway, links are drawn between `g_i` and `g_{i+1}`.