Circos > Presentations > Articles > American Scientist Cover

American Scientist Cover

Circos - Circular Genome Data Visualization (400 x 490)
American Scientist cover tearsheet (zoom).

This article provides insight into the design of the American Scientist Sept/Oct 2007 cover image and the data analysis done to create the image. The image accompanies the article Genetics and the Shape of Dogs by Elaine Ostrander.

Dogs and Humans — Not only best friends

Although a great deal of differences between dogs and humans exist, such as, for example, the curious lack of dignity in the canine species (as anyone with a dog can attest), the genomes of human and dog show great similiarity. This similarity, called synteny (when comparison is made across species), is due to the fact that the dog and human share a distant common ancestor.

Circos - Circular Genome Data Visualization (477 x 328)
Phylogenetic history of mammals. From The Ancestor's Tale (zoom).

Examination of the genomic sequence suggests that the dog and human diverged from a common ancestor about 90-100 million years ago (Springer MS, Murphy WJ, Eizirik E, O'Brien SJ: Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc Natl Acad Sci U S A 2003, 100:1056-1061) in the Cretaceous period.

Fluffy is not just your friend. He's also a relative.

Image Design

The magazine's cover image shows a visual summary of the similarity relationships between the human and dog genomes. The figure shows a subset of both genomes: human chromosomes are coded in blue, and dog chromosomes in orange. Regions of synteny (sequence similarity) are linked using grey ribbons.

The synteny ribbons were derived from millions of individual short sequence alignments, which represent relatively small regions of the dog and human genomes which have similar sequence. Because of the large number of alignments, some way to collate them was necessary to control the complexity in the figure. To do this, neighbouring alignments were bundled into a single ribbon. The ribbons therefore correspond to a long-range gapped alignment, within which there are runs of similar sequence, separated by gaps of dissimilar sequence.

The syntentic bundles between dog chromosome 15 and the human genome are highlighted in colour. This chromosome is of specific interest in the Ostrander lab (Ostrander EA, Wayne RK: The canine genome. Genome Res 2005, 15:1706-1716).

Visualizing Comparative Genomic Data

Circos - Circular Genome Data Visualization (450 x 450)
Early versions of the cover image. I was searching for visually interesting patterns in the bundles of synteny. The first panel (A) is extremely complex and, although colourful, carries little interpretable information. The next two panels (B, C) are more concise but inefficient and contain a lot of empty space. The last panel (D) contains a digestable amount of information but sits flat due to its lack of color. Color can be a powerful way to encode data, but too much color (A) overwhelms the eye.

Visually exploring comparative genomic data is difficult. Not only is the task made difficult by the fact that there is a very large number of genomes that have been sequenced, or are in the process of being sequenced, but also by the fact that the genomes themselves are large and the similarity data is sparse. There have been many efforts to generate visual representations of genome-to-genome relationships. Circos is one such project.

The difficulty in generating graphical representations of comparative data quickly becomes apparent when one explores the data itself. Using the UCSC Genome Viewer Table Browser regions of sequence similarity between dog and human number over 3,700,000. These pairs of related regions provide more than 1-fold coverage of the dog and human genomes - this is possible because coordinates of similarity pairs overlap. Out of all the pairs, the vast majority are small regions (90% of regions are <400bp on dog and <330bp on human). Adjacent groups of such pairs are frequent, indicating contiguity in similarity across large regions of genomes. However, long-range runs of similarity are broken up by gaps in similarity, or runs of similarity to other regions.

Below is an example of a small fraction of such data, which relates a region on a dog chromosomes (cfN, N=1..38,X) to a human chromosome (hsM, M=1..22,X). Notice that there is a pair of similarity pairs that indicate a region on cf1 shows sequence similarity to regions of hs19 and hs3. Closely downstream, a single region of homology between cf1 and hs2 breaks up the syntenic region between cf1:112Mb and hs19:51Mb.

cf1 112324823 112331694 6872 hs19 51713100 51718054 4955
cf1 112328159 112330938 2780 hs19 51715464 51716592 1129
cf1 112328700 112329092 393 hs3 198134679 198134817 139
cf1 112418235 112463291 45057 hs19 51320175 51443212 123038
cf1 112582829 112601354 18526 hs19 51115819 51121833 6015
cf1 112852364 112853418 1055 hs19 50799219 50801066 1848
cf1 113508037 113509480 1444 hs19 49956706 49958137 1432
cf1 113638063 113642450 4388 hs2 184178863 184181447 2585
cf1 113900245 113901596 1352 hs19 49504154 49505831 1678

Attempting to draw all of these data results in a jumble that is difficult to interpret. Although zooming into pairs of regions offers detailed accounting of regions of homology, the large picture (the homology bundles) cannot be appreciated at this scale.

The challenge for the American Scientist figure was to depict sequence similarity between the dog and human genomes in a manner that was both informative and visually appealing. Due to the large number of individual regions of synteny, across a large range of sizes, some data filtering and collating was necessary to strike a balance between clarity and complexity. Because dog chromosome 15 is of particular interest to Elaine Ostrander, the author of the article, and her group, I wanted the figure to draw attention to the relationship between chromosome 15 and the human genome.

data processing

Circos - Circular Genome Data Visualization (450 x 450)
Effect of bin size on complexity of figure. Shown here is homology between dog chromosome 15 and the human genome. Results with bins of size 5, 10, 25 and 50kb are shown. The cover image used 100kb bins.

I started with the dog vs human sequence similarity available from UCSC Table Viewer. These data were in pairs

dog_chr   dog_chr_start   dog_chr_end 
human_chr human_chr_start human_chr_end
indicating sequence homology between dog_chr dog chromosome's region dog_chr_start-dog_chr_end human_chr human chromosome region human_chr_start-human_chr_end . To limit the complexity in the data, I binned each data pair by dividing the dog genome into bins of 100kb. Within each bin, I examined each data pair and collated the target human regions that were associated with the dog genome bin and for each human chromosome, I computed coverage by syntenic regions and filled in any gaps between regions as long as the gap was <0.25 in size of regions on either side. For a given bin and human chromosome, I created an intermediate list of the largest 5 human syntenic regions. For example, here are the largest 5 regions of syntenic human regions to a bin on the dog genome at cf15:11.5Mb.
cf15 11500000 11600000 100000 hs18 49966279 49971915 5637
cf15 11500000 11600000 100000 hs6 29604229 29609758 5530
cf15 11500000 11600000 100000 hs19 33858498 33863344 4847
cf15 11500000 11600000 100000 hs6_cox_hap1 952557 957072 4516
cf15 11500000 11600000 100000 hs1 43380571 43384085 3515

Through this process, I reduced the 3,700,000 data pairs to just over 42,700. I chose a bin size of 100kb, since such a bin would cover about 0.07 seconds of arc if the entire dog genome was represented along half of the circle. Thus, if the circle image had a radius of about 8,000 pixels, a 100 kb bin would occupy one pixel. This seemed like the right ball-park for the bin size, although the final figure does not significantly change if the bin size is somewhat increased.

The next step was to find the syntenic bundles. I did this by associating adjacent regions of similarity (adjacent on both the dog and human genomes) together. I allowed up to 500kb of gap between regions. This was done to give better illustration of bundles of synteny on a larger scale.

Once the bundle structure was computed, I went back to the binned data and for each binned data pair checked which bundle it overlapped with. At this point only data pairs that overlapped with the largest bundle for the region were accepted for drawing in the figure. This limited the number of small, isolated regions of synteny within larger runs of regions that linked the same dog-human regions. Links corresponding to regions belonging to smaller bundles were drawn behind (and in lighter grey tone) links associated with larger bundles.

Cover Image

Circos - Circular Genome Data Visualization (600 x 600)
Final American Scientist cover image. Regions of similarity between human (top, blue [A]) and dog (bottom, orange [C]) chromosomes. One dimensional similarity mapping between human [B] and dog [D] chromosomes. This mapping provides the chromosome color coding associated with grey ribbons [F]. These grey ribbons are composed of binned synteny regions that fall in the same bundle (see above). The level of grey is proportional to the size of the syntenic regions. Synteny on chromosome 15 is highlighted with colored ribbons [E]. Ribbons that twist such as [F2] indicate inversions, whereas those that don't [F1] indicate regions of synteny on the same strand. (zoom).

The final image is shown here. For the cover image, it was decided to selected a subset of dog and human chromosomes to limit the visual complexity of the figure.

Starting with dog chromosomes 1,2,3,4 (largest - seemed like a good start) and 15 (Elaine's favourite), I began constructing the figure by adding human chromosomes that formed ribbon connections. By adding more dog and human chromosomes to the set, I obtained a figure in which the ribbons provide near total coverage of all the chromosomes around the circle. As an added bonus, the selected dog chromosomes occupy about 1/2 of the circle (the physical length scale is the same for each ideogram).

The circular composition of the ideograms allows for rapid exploration of the data, at least on a large scale such as this. Notice that large-scale inversions are easy to spot because the ribbons appear to twist (ribons like F2). Ribons that connect dog and human chromosomes without twisting F1 indicate similarity between the same strands.

Furthermore, existing color schemes can be easily integrated into Circos' approach to visualizing data. The color scheme used here is the standard chromosome color palate that relates a fixed color to each chromosome for consistent display. Browsers that are founded on a linear representation use this color scheme to indicate the mapping between two regions. With a circular representation the mapping is made explicit by lines, or bundles of lines.

Extended Cover Image

Circos - Circular Genome Data Visualization (600 x 600)
Cover image with all chromosomes and additional data elements, showing conservation, breed and morphology QTL data. (zoom).

The balance of complexity and visual appeal seemed right in the figure above. This was the version used for the cover image.

I created a supplemental image which included all human and dog chromosomes (human Y was removed — with apologies to all males, myself included — this is a stupendously boring chromosome). The image with all ideograms is shown on the right.

Not satisfied, I added conservation information [G] as well as dog breed marker data [H] (courtesy of Heidi Parker, see PMID 15155949) and morphology QTL [I] (courtesy of Kevin Chase, see PMID 9987902 and MPID 16934357 and PMID 12114542).

The conservation data is shown as two histograms. The blue histogram shows the degree of conservation over bins of 3Mb between dog and human for a given region of the human chromosome. The orange histogram indicates average conservation between human and other vertebrate species. The bins across which conservation was computed are very large (3Mb), and much detail is lost. However, the data track illustrates effective integration of standard plot types (here the histogram), into a Circos image.

Breed Marker and Morphology QTL Data

In the Science publication Genetic structure of the purebred domestic dog by Parker et al., data is presented which groups 85 domestic dog breeds into four clusters of breeds. The author graciously shared her data with me to include in the figure.

Circos - Circular Genome Data Visualization (350 x 350)
Breed cluster data [H] and morphology QTL location [I] are shown as data tracks around the dog genome. Each QTL was associated with a principal component, encoded by the level of grey in [I] glyphs. Format of breed data is described below. (zoom).
Circos - Circular Genome Data Visualization (400 x 363)
Organization of breed marker data. Markers (a,b,c,d) contain multiple alleles, which are grouped by breed cluster frequency (A,B,C,D). For a given allele (e.g. eA1), frequency in each breed cluster is shown by a stacked rectangles. The rectangle color encodes the breed cluster. (zoom).

The figure shows the result of clustering sequence information derived from markers (a,b,c,...) across different breeds. The clustering resulted in four distinct groups (yellow, blue, green, red - same color scheme as in the Parker et al publication). Each marker (a,b,c,...) was associated with multiple alleles and each allele had a breed cluster frequence (f1...f4). For a given marker (e.g. e), I separated the alleles which had largest frequency component for the yellow breed cluster (A, ancient breeds, e.g. eA1 eA2 eA3 eA4), from those that had the largest component for the blue cluster (B, bulldog/mastiff types, eB1 eB2 eB3), green cluster (C, wolfhound/collie types, eC1 eC2) and red cluster (D, terriers, eD1). Each allele is represented by a stack of rectangles which represent the frequencies of that allele in each breed cluster. The rectangles are ordered by frequency (e.g. allele fC1 has a large green cluster frequency f1 followed by a smaller blue cluster frequency f2, yellow f3 and finally red f4).

The morphological variation of dogs is one of their most curious qualities. Who can resist chuckling at the notion of a chiwawa riding on a great dane. Regions of the dog genome have been associated with morphological traits (limb or skeleton size, for example). These regions affect morphology in a complex way, but can be grouped into four principal components, as described in the Genetic basis for systems of skeletal quantitative traits: principal component analysis of the canid skeleton. publication by Chase et al.. The authors kindly shared with me the published QTL locations and I encoded them in the figure as ticks near the dog ideograms. The ticks associate the QTL with a principal component (PC1 very dark grey, PC2 dark grey, PC3 grey, PC4 light grey).

dog vs human - one chromosome at a time

Click on the image to obtain a larger version (800 x 800 px). Click on zoom to obtain a very high resolution version.

single dog chromosomes vs human genome

Below you'll find one image for each of the dog chromosomes, showing regions of synteny to human genome. The scale here is finer than in the cover image, since only one dog chromosome is shown at a time. The length scale in each image is adjusted so that the dog chromosome occupies half of the circle.

cf1 zoom cf2 zoom cf3 zoom cf4 zoom cf5 zoom cf6 zoom cf7 zoom cf8 zoom cf9 zoom cf10 zoom cf11 zoom cf12 zoom cf13 zoom cf14 zoom cf15 zoom cf16 zoom cf17 zoom cf18 zoom cf19 zoom cf20 zoom cf21 zoom cf22 zoom cf23 zoom cf24 zoom cf25 zoom cf26 zoom cf27 zoom cf28 zoom cf29 zoom cf30 zoom cf31 zoom cf32 zoom cf33 zoom cf34 zoom cf35 zoom cf36 zoom cf37 zoom cf38 zoom cfX zoom

dog genome vs human genome - by dog chromosome

The dog genome is smaller than the human (2.4 Gb vs 3.1 Gb) and is scaled by a factor of 1.2 in the image to subtend half of the circle.

cf1 zoom cf2 zoom cf3 zoom cf4 zoom cf5 zoom cf6 zoom cf7 zoom cf8 zoom cf9 zoom cf10 zoom cf11 zoom cf12 zoom cf13 zoom cf14 zoom cf15 zoom cf16 zoom cf17 zoom cf18 zoom cf19 zoom cf20 zoom cf21 zoom cf22 zoom cf23 zoom cf24 zoom cf25 zoom cf26 zoom cf27 zoom cf28 zoom cf29 zoom cf30 zoom cf31 zoom cf32 zoom cf33 zoom cf34 zoom cf35 zoom cf36 zoom cf37 zoom cf38 zoom cfX zoom

dog genome vs human genome - by human chromosome

The dog genome is smaller than the human (2.4 Gb vs 3.1 Gb) and is scaled by a factor of 1.2 in the image to subtend half of the circle.

hs1 zoom hs2 zoom hs3 zoom hs4 zoom hs5 zoom hs6 zoom hs7 zoom hs8 zoom hs9 zoom hs10 zoom hs11 zoom hs12 zoom hs13 zoom hs14 zoom hs15 zoom hs16 zoom hs17 zoom hs18 zoom hs19 zoom hs20 zoom hs21 zoom hs22 zoom hsX zoom