Circos > Presentations > Articles > Visualizing Tables 1

table of contents


All images in this article were created with Circos (v0.49) and the tableviewer utility tool.

Tables Store Data, not Present It


Tables are natural containers for data. Whenever information is presented, chances are excellent that it is communicated by means of a table. In many cases, however, when this information is complex (and the table, therefore, is large) a tabular presentation is difficult to parse visually and patterns in the tabulated data remain opaque.

In other words - a useful container isn't automatically a useful presenter. The table presents individual data points very well and patterns that they compose very poorly.

Figure If your data are eggs, then the table is the egg crate, which keeps data ordered, separated and easily accessible. But, eggs aren't served out of egg crates ... perhaps data shouldn't be served out of a table either.

This article discusses an approach in which tabular data sets can be visually presented in a quantitative and informative manner. Obviously, there is a very large number of ways in which data, tabular or not, can be visualized. My goal with the approach I outline here is to establish a visual representation of the table that (a) captures all of the data in the table, (b) is sufficiently quantitative to visually extract patterns and descriptive statistics, (c) makes no assumptions about what kind of patterns may exist in the data and (d) is more appealing than a table

The method is an application of the Circos application. Circos was created to visualize differences (or similarities, and in general any relationship) between genomes. By connecting related regions of the genome (or multiple genomes) together using curves within the circular layout of ideograms, a visual profile of relatedness could be constructed. Given that a table relates a column to a row by means of the corresponding cell, it seemed natural to try to apply the circular layout to a table.

Tables are visual obstacles

Consider the five tables below. As you examine each table in turn, at one point you will reach a table whose size is too large to comfortably evaluate the data, identify patterns and formulate conclusions. For everyone this limit is different - for me the 4x4 table is too large.

A 3
A 9 14
B 2 18
A 2 4 9
B 15 4 14
C 4 13 1
A 15 5 1 16
B 5 16 16 11
C 14 15 6 10
D 12 3 9 13
A 18 3 7 17 15
B 15 6 7 15 4
C 6 13 2 13 19
D 14 14 10 0 13
E 0 3 14 9 8

Figure There is a limit to the size of a table that can be easily visually inspected. Unfortunately, that limit is much smaller than most tables.

Unfortunately, most interesting data sets correspond to tables larger than 3x3 - usually much larger. Our inability to glean information from such large tables limits the usefulness of direct presentation of such tables. To mitigate this, statistics and corresponding cues can be added to a table, such as column and row totals and identifiers for largest elements in a row (underlined in the table below) and largest elements in a column (bold in the table below).

32 37 68 25 42
55 14 8 15 1 17
22 2 1 9 6 4
34 1 8 11 3 11
50 13 6 17 10 4
43 2 14 16 5 6

Figure Adding row and column statistics to the table helps, but even simple patterns can still remain undetected.

Despite marking it up in this manner, the table remains a visual obstacle. Your eye passes from scanning a row to scanning a column and must continually travel back and forth as you reach the edge of the table.

Examples of Uninterpretable Tables

Too many publications, reports and articles (prestigious journals are not exempt from this list!) present tables that are completely unparsable by humans. Such tables are not only uninformative but tie up the reader in attempting to visually derive patterns from a static numerical presentation.

Other than confirming that data was, in fact, collected, analyzed, and sufficiently complicated to warrant publication, tables such as this do little to enhance the communication of the results. Such tables can be easily replaced by a well-crafted sentence that captures the essence of the data, or a visualization derived from the table that summarizes the results.

I've collected a handful of these visual travesties. Note that these examples have nothing to do with the overall quality of the publication - they were chosen on the basis of the table only. Click on a table to zoom.

These tables should be presented visually - identifying cutoffs in the progression of values is impossible. Tabular form should be relegated to supplementary materials.
Suffers from: visual noise, lost transition to significance, burden of significance, obscured statistic.
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-64 (2002).

This table is extremely difficult to parse because of its awkward formatting. The eye is requied to travel across a row, but its naturally travel is down a column due to the fact that column spacing is relatively larger than row spacing.
Suffers from misguided sightlines, incidental formatting, burden of significance.
Basu, A., Chaudhuri, P. & Majumder, P. P. Identification of polymorphic motifs using probabilistic search algorithms. Genome Res 15, 67-77 (2005).

What's the point? This table fails to communicate its own conclusions.
Suffers from visual noise, lack of significance, burden of significance.
Faux, N. G. et al. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res 15, 537-51 (2005).

These tables suffer from both too little and too much information. On the left half of the first table, there is a marked lack of information. On the other hand, the right half of the first table, and the second table, both suffer from an over-abundance of information.
Suffers from visual noise, unparsable content, burden of significance.
Horvath, J. E. et al. Development and application of a phylogenomic toolkit: resolving the evolutionary history of Madagascar's lemurs. Genome Res 18, 489-99 (2008).

At the risk of providing an incomplete account, the authors present too much information in these tables. The presentation becomes overwhelming.
Suffers from visual noise, unparsable content, burden of significance.
Lin, S., Chakravarti, A. & Cutler, D. J. Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet 36, 1181-8 (2004).

A table is a poor container for quantitiessuch as mean, median, min and max. These have geometric equivalents are are best served with a geometric representation, such as a whisker plot.
Suffers from: obscured statistic.
Tishkoff, S. A. & Kidd, K. K. Implications of biogeography of human populations for 'race' and medicine. Nat Genet 36, S21-7 (2004).

Precision for precision's sake leaves the viewer tired and frustrated. Number of significant figures in a table should be as small as possible to draw necessary conclusions.
Suffers from: hidden patterns, overspecified non-sigificant statistic.
Sweet-Cordero, A. et al. An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nat Genet 37, 48-55 (2005).

You may wonder - why are these table so poorly designed? It's likely that the real answer in each case is different, though publication deadlines and author fatigue likely play a role. One large motivating factor for tables is that they lull both the authors and the audience into a false sense of data security: provide all the information and the rest will follow. The reality is that by showing too much, you leave the audience with too little.

Visualization of Tabular Data

Does the method presented here provide a means to solve every problem in the above exemplars of poor tabularization? No. It does provide, however, a way to capture the essence of the table and present it quantitatively and attractively.

Representation of relationships

In a Circos figure, elements in your data set are composited circularly and links join elements that are related. The relationship can be one of similarity, difference, flow, distance - any quantity that associates two elements together.

Figure In the general case, relationships between elements in your data set are indicated by links. Links can indicate a simple relationship (A-B), a relationship that has positional information (A-C), or a unidirectional relationship (A-D). In each case, the link is formatted differently.

If the relationship has an associated quantity (e.g. degree of similarity, traffic between elements, etc), this quantity can be represented by the thickness of the link.

Figure Links with variable thickness can represent the extent of the relationship between elements.

By coloring the links based on one of the elements, following relationships to/from an element is made easier. For example, when the links relate a cell for a given row and column, the color of the link can be that of the row or column segment.

Figure When links are colored based on the elements that they relate, spotting patterns is easier. In particular, when relationships have a direction, links can be colored by source or target element.

Visualizing ratios

As I suggested above, the table's visual representation can be simplified by using a ribbon to encode two cells. Rather than creating one ribbon for (A,B) and another for (B,A), a single ribbon can show both.

Figure Transpositive cells (e.g. (A,B) and (B,A)) are here shown by a single ribbon (right). The ribbon now ends directly at both row and column segments and its ends are of variable thickness, which is the cell value for which the end's segment is a row. For example, if (A,B)=2 and (B,A)=10 then the ribbon's end touching A is thickness 2 and the end touching B is thickness 10.

Visualizing tables

The visual scheme of representing relationships can be applied to a table, given the observation that a table cell is a relationship (with a value) between a row and column. By representing the row and columns as segments along the circle, the information in the corresponding cell can be encoded as a link between the segments.

The value in the cell controls the thickness of the link, which starts to look more like a ribbon.

Figure Rows and columns are represented by segments along the circumference of the circle. Cells are represented by ribbons that join the corresponding row and column segments.

In general, the cell represents a unidirectional relationship (e.g. row->column) - in this relationship the role of the segments is not interchangeable (e.g. (row,col) and (col,row) are different cells). To identify the role of the segment, as a row or column, the ribbon is made to terminate at the row segment but slightly away from the column segment. In this way, for a given ribbon, it is easy to identify which segment is the row and which is the column.

Figure Ribbons touch the row segments, but terminate a short distance before reaching the column segment.

variety of tabular visualization layouts

For a given table, a large number of variations exist in the way it can be visualized using this layout.

For an all-purpose visualization, one could order the row and column segments alphabetically, based on the label. For a given segment, the ribbons could be ordered clockwise in decreasing size. This allows quick identification of the largest/smallest cell value for a given segment.

By adding a scale and tick marks to the image it becomes possible to determine precisely the thickness of each ribbon, and therefore the value in the cell. The scale has the additional benefit of indicating the total size of each segment, which corresponds to the sum of its cell values.

Figure Segments are ordered (clockwise) by their label value. Ribbons are ordered (clockwise) by decreasing size. Tick marks provide an absolute scale that indicates ribbon thickness, which corresponds to the cell value, and segment size, which corresponds to the total in the row or column.

If the contribution from each row and column to the table is important, the layout can be altered to order the segments by their size. In the example below, the row segments are shown first, ordered by decreasing size, followed by the column segments, ordered similarly.

This segment order scheme, when combined with the ribbon order scheme based on size, is very helpful in locating the table's largest-valued cells and their row/column location. For example, it is easy to see from the image below that the largest value in row C is also the largest value in col F, and both C and F are the largest row and column of the table.

Figure Segments are ordered in decreasing size, with row segments placed first, conveying information about contribution from each row or column to table's total.

If ordering the ribbons based on size is not required, the ribbon placement on the segments can be adjusted to reduce the number of crossing ribbons. By ordering the ribbons based on the relative order of the corresponding segments, the figure's visual complexity is reduced. In this case, however, it is harder to determine the relative contribution of each ribbon to a segment.

Figure When ribbon order is controlled by the order of segments, a less cluttered representation is obtained.

If the distinction between row and column segments is not important, the role (row or column) of the segment can be ignored when segment order is determined. In the example below, segments are ordered by size without regard to whether they correspond to a row or column. Given that a ribbon touches its row segment but not its column segment, it is still possible to determine whether a segment is a row or column.

Figure Segment order can be made to be independent of its role as row or column.

When the distinction of row and column is important, distinct representations are created when the ribbon color scheme is changed from row-based to column-based. Specifically, if the table represents some kind of flow of information (e.g. internet traffic, flow of money, number of traveler trips from row to column), adjusting the color scheme can be very useful in demonstrating the contribution of flow to/from a given segment from/to other segments.

Figure Depending on the data, changing the source for the color of the ribbons from rows to column can create a more informative figure.

Ribbons do a good job in visually representing where the bulk of the table's data are. However, sometimes absolute cell values are not as important as their relative values, normalized to some reasonable quantity. This can be achieved by remapping cell values, such as normalizing cells by the sum (or average) of their rows (or columns). Alternative, by making all the segments equally sized, such as in the example below, you can visualize cell values relative to the size of the corresponding row or column.

Figure Here each segment is normalized to be the same size in the image. Ribbon sizes are effectively shown relative to the size of their corresponding segments.

example - a small synthetic data set - preference for hair color in relationships

Let's start with a small data set that illustrates how this kind of tabular visualization can work in practise (or at least, in theory, since the data in this example is synthetic).

Defining partner hair color transition probabilities

For this example, I will use synthetic data that relates to preference for hair color in relationships. Originally, I thought of creating a survey and collecting this information, but I got very lazy and decided that fake data would be just as good - at least, for instructional purposes.

Specifically, let's consider the probability of transitioning from a partner with hair color A (e.g. black) to a partner with hair color B (e.g. blond). The figure below illustrates this probability.

Figure Do people prefer to stay with partners with the same hair color (blondes or nothing!) or change (black reminds me of my ex, let's try something different!). Collecting (or in this case simulating) relationship histories can help answer this.

Data simulation

I simulated relationship histories of 10,000 males, in which each history was composed of 5 transitions (i.e. 6 relationships). It thought about some of the stereotypes relating to hair color and built in the following rules into the simulation

  • after dating brown, most want to try blond with other colors sampled equally - brown is like a holding-pattern
  • after dating black, most stay with black and very few want to try red - black is a strong preference, with a dislike of red
  • after dating blond, most stay with blond, but some also try red - blond is a strong preference, with a moderate attraction towards red
  • after dating red, nearly all stay with red - red is an extremely strong preference, with a dislike towards all others

The transition frequencies I obtained were

       black brown  blond    red
black 11,975 8,916  5,871  2,868
brown  8,010 8,090 16,145  8,045
blond  1,951 2,060 10,048  6,171
red    1,013   940    990  6,907

First look at transition probabilities

Let's apply the visualization approach described above to see whether we can figure out what's going on (in this case I know what's going on because I simulated the data, but that's rare - most data is real and dirty).

Figure The transition frequencies between black, brown, blond and red haired partners sampled from relationship histories of 10,000 males, each having 6 relationships.

The figure may look complex at first, but it offers a very large amount of information about the underlying data. Let's take a look at some of the highlights.

It is clear from (A) that the vast majority of males that break up with a red-head immediately date another red-head.

More people broke up with brown-heads than hooked up with them. Ribbons (B) show that about 65% of the transitions involving brown-heads are breakups, with the largest outflow being to blondes (C). Also notice that ribbon (C) is the only ribbon which is the largest outflow from a color that does not terminate at the same color. Therefore, while those that date black-, blond- and red-heads tend to stay with that color in their next relationship, the largest outflow from brown-heads is to another color! This may well reflect the hypothetical truth that blonds are perceived as much more exciting than brown-heads, relative to other outflow colors.

In contrast, the majority of transitions (D), about 60%, that involve blondes correspond to a start of a relationship, rather than a breakup.

Let's now look at some figures that focus on the ratio of to:from transitions for a hair color. Transitions to a hair color indicate a start of a relationship. Transitions from a hair color indicate a breakup.

In the figure below, the left panel orders (clockwise, starting at top) the segments by the ratio of start:end transitions. Segments that come first correspond to hair colors that have the largest flux in (A) of males, relative to the flux out (B). Red heads are best at attracting males - 70% of transitions involving red-heads correspond to starts of relationships. Red heads, as a group, are also pretty good at retaining their partners, as can be seen by the small proportion of red hair transitions that corrspond to a departure to a new color (these are the last three thin ribbons from the red hair segment).

The right panel shows the data that illustrates the fraction of breakups for a color. Segments are ordered by their from:to transition ratio (C:D), with brown hair leading the way. Brown heads have a difficult time retaining their mates, with about 70% of transitions being breakups (some of those correspond to the start of a relationship with another brown head).

Figure When segments are ordered by their column:row (A:B) ratio (left), or row:column ratio (C:D) (right), a picture emerges that illustrates which hair colors are attracting (or repulsing) males.

Normalized layout


Ratio layout

The ratio layout is ideal for square tables which have the same rows and column labels (i.e. for every (A,B) there is a (B,A)).

Figure In the ratio layout, a ribbon represents two cells, except for cells on the diagonal which are represented by a single ribbon. By following the size of the ribbon from one segment to another, you can visually estimate the ratio of (A,B):(B,A).

The ratio layout has a higher data-to-ink ratio than the layout with a distinct ribbon for each cell. This layout is also quantitatively more informative because proportions and imbalances in contributions to color transitions can be easily evaluated.

Figure The ratio layout is extremely informative and reveals a great deal about the data. For this layout, ordering the ribbons by their size is most appropriate.

Ok, enough about dating and hair color, and move to a real-world (and much larger) data set.

example - a large real data set - reactivity of chemical elements in minerals

Definition of reactivity

To obtain a much larger data set than the synthetic relationships hair color transitions, I looked to the catalogue of minerals (basically rubble, rocks and dirt). From this list I created a table that stored, for each element pair, the ratios of their occurrence in minerals.

For example, Bismite is Bi2O3 and the ratio of Bi:O is 2:3. This mineral would therefore contribute +2 to the (row=Bi,col=O) cell and +3 to the (row=O,col=Bi) cell.

For minerals that are composed of more than two unique elements, all pair-wise combinations were stored in the table. For example, Zabuyelite is Li2CO3 and would therefore contribute +2 (Li,C), +2 (Li,O), +1 (C,Li), +1 (C,O), +3 (O,Li), +3 (O,C).

Thus, each cell at row=e1 and col=e2 (e1, e2 are elements) has the value

where m is the index over all minerals, and n(m,e) is the number of atoms of element e in mineral m.

This ratio is indicative of the relative affinity (or reactivity, a term I use here loosely) of the elements towards one another in minerals. These data are not normalized (all element pairs from a mineral are counted) and should therefore be interpreted accordingly.

Data set download and preparation

I downloaded a list of all minerals and their chemical formulae. From this list, I obtained 4,422 chemical formulae for minerals. The list was exotic, to say the least, and included imaginatively named entries such as

  • Barstowite Pb4Cl6CO3.H2O
  • Bartelkeite PbFe++Ge3O8
  • Bartonite K3Fe10S14
  • Barylite BaBe2Si2O7

Mineral series (e.g. Carrollite, Cu(Co,Ni)2S4) were parsed by distributing the total quantity (2) of the series items (Co,Ni) uniformly (e.g. 1 Co, 1 Ni). This is a simplification, but a reasonable one.

Any waters associated with the mineral (e.g. Cobaltkieserite, Co(SO4).H2O) were removed before parsing the forumula. Any ionic species such as V4+ in Doloresite (sounds very sad, H8V64+O16) were considered equivalent to their uncharged counterparts.

Some formulae contained unknown absolute quantities of elements, had complex formulae which I did not make any effort to parse, or contained variable elements (e.g. REE - rare earth element). These entries were ignored.

  • Glauconite (unknown absolute quantities) - (K,Na)x+y(Mg,Fe2+)x(Al,Fe3+)2-x[]Si4-y(Al,Fe3+)yO10(OH)2
  • Evenkite (complex representation) - CxH2x+2 (x=19 to 28)
  • Johnsenite-(Ce) (contains generic M code) - Na12(Ce,La,Sr,Ca,M)3Ca6Mn3Zr3W(Si25O73)(CO3)(OH,Cl)2

I was able to clean and parse 4,125 of the formulae from the list (with the help of Chemistry::File::Formula Perl module). You can download the cleaned formulae, but keep in mind that in order for the module to parse the formulae, I had to adjust some of the subscripts for series so that they were integers. Any subscript >1000 should be divided by 1,000,000 when determining ratios.

Tables of element ratios

You can download the ratio table for elements, the ratio table for elements with transition elements mapped to groups , and the ratio table for element classifications. For both tables, the cell (Erow,Ecol)=x is intrepreted as the sum of atoms of element Erow in minerals in which element Ecol is found. The remapped element table is a simplified version of the table, in which all transition elements are indicated by their group (e.g. Ib = Cu,Ag,Au). The entries in the cells are not always integers because of fractional contributions from series (see above).

These tables are very complex. No chance of interpreting anything visually, with the exception of the fact that oxygen is abundant. Below, fractional values are shown as <1, and values >100 (or >1000) are shown to two or one significant figure in units of 100 (or 1,000) and colored orange (or red). Zero values are greyed out. For example 576=5.7 (orange), 2355=2.3 (red), 12102=12 (red).

cannot open table file Ratio Table for Elements (download)

To illustrate the data in the table, consider the cell pair (C,H) and (H,C) (H hydrogen, C carbon) in the table above. These cells (C,H)=1.2k (number of carbons in all minerals that have hydrogen) and (H,C)=2.7k (number of hydrogens in all minerals that have carbon) suggest that hydrogen mixes with carbon in a 2.7:1.2 or about 2:1 ratio.

Once transition elements are combined into their corresponding groups (Ib ... VIIIb), the table shrinks but remains opaque to visual inspection.

Ratio Table for Elements (transition element remapped to group) (download)
cannot open table file

Of all the transition metal groups, VIIIb has the largest affinity towards oxygen, with 4.7k VIIIb elements found in minerals that have oxygen, ahead of group VIIb which mixed 2.6k elements with oxygen.

Finally, by remapping all elements to their classification, the data set can be significantly reduced in complexity.

Ratio Table for Remapped Elements (all elements remapped to classification) (download)
cannot open table file

The headers for this table are

  • AM alkali metals
  • AE alkaline earths
  • T transition (includes Lanthanides and Actinides)
  • M metals
  • MD metaloids
  • NM non-metals
  • H halogens
  • N noble gases

Let us now apply the circular visualization approach works for these tables.

Reactivity visualization - first look

To start this example off, I've used the last table in the previous section as the data source. In this table, all elements have been remapped to their classifications, of which there are 8.

Figure Relative abundance and largest contributor pairs to the reactivity table (NM:T, NM:H, NM: AE, NM: MD) can be immediately discerned.

When the element category segments are drawn on an absolute scale, as in the figure above, the relative size of the ends of the ribbon (A and B in the figure above) indicates the proportion in which the element categories mix.

Figure Data patterns are easily discerned from the size, position, and color of the ribbons.

By normalizing the element category segments to be the same size, a picture of relative reactivity emerges. In this kind of image, shown below, the ends of the ribbons are relative to the abundance of the category.

Figure Segment size normalization removes the effect of abundance and focuses on the proportions in which element categories mix, relative to their respective abundance. For example, about 30% of metal combinations involve non-metals.

By coloring ribbons based on their percentile rank (when a ratio layout is used, one needs to decide which end of the ribbon is used to determine its rank), major contributors to the table can be identified.

Figure In the left two figures, ribbon color is inherited from the segment at the larger (or smaller) end of the ribbon. Transparency is determined by the percentile rank of the ribbon. On the right, ribbon color is determined by a false-color map that spans the percentile range.

Let's now turn to creating images from the largest of the reactivity tables - one in which every element is individually represented.

Reactivity visualization - individual elements

When classifications (e.g. NM non-metals) are broken down into their individual elements (e.g. C, N, P, O, S, Se), the visualization becomes much more complex. The figure below shows all the data and, although it is much more approachable than the table itself, would benefit from a little design work.

Figure Relative reactivity of elements found in minerals. Each ribbon encodes the average ratio of a pair of elements.

Clearly oxygen (O) occupies a large fraction of the figure - about half - followed by hydrogen (H) and then silicon (Si). The remaining elements are relatively much less abundant. In particular, transition elements individually contribute little to the image, since their abundance is low. In the image below, I have collected transition elements into their periodic table groups.

Figure A simplified relative reactivity visualization, with transition elements combined into their respective groups. Transition elements are not as abundant as others, and when individually represented pollute the image with a large number of thin ribbons.

Reactivity visualization - looking for patterns

These figures have a great deal of structure, which can be elucidated by adjusting the ribbon layout. First, by ordering the segments in decreasing size and the ribbons in a way that reduces crossing, we can get a sense of the absolute ratios of elements (the ratio is absolute in the sense that a ratio 2:1 has a ribbon with ends of size 2 and 1, whereas a ratio 2000:1000 has a ribbon with ends size 2000 and 1000).

Figure By coloring the ribbons based on the segment at the larger (left) or smaller (right) end, the more reactive element in a pair can be identified.

When segments and ribbons are ordered independently, patterns in the data can emerge and appear as consistently ordered ribbons. Moreover, any breaks in this order indicates a break in the pattern of the data and an outlier data point. For example, in the image below the segments are ordered based on element abundance and within the segment, the ribbons are ordered by decreasing size of reactivity of that element. Oxygen (O) gives rise to ribbons that order themselves on the O segment in more-or-less the same order as abundance - this cna be seen by the fact that most ribbons flow horizontally in the figure. However, there are several ribbons from O that break this pattern - these are ribbons to S and Ca.

Figure Data patterns can be identified by selecting appropriate segment and ribbon ordering. Any ribbons that appear out of place indicate an outlier.

Reactivity reactivity - removing effect of abundance

Reactivity for oxygen can be isolated in the figure by coloring only oxygen-related ribbons. This is done in the figure below, in addition to normalization of all segments to equal size (except oxygen, shown at 20x).

Figure Reactivity for oxygen is emphasized in the figure. Since the segments are ordered by abundance, and ribbons by reactivity, the layout of the ribbons reveals any patterns between these two variables.

In most of the figures above, the sizes of the segments were proportional to the row and column total for a given element. This total was proportional to the number of minerals with the element, the number of other elements in these minerals and the number of atoms of the element in each mineral.

By normalizing all values in the table by this total, a picture of relative reactivity emerges. In the figure below, I color only those ribbons that correspond to values in >94 percentile.

Figure Relative affinity of elements in minerals is shown here. Element pairs showing the largest relative affinity to one another are connected by colored ribbons.

Many of the elements in the figure above that have colored ribbons are the elements that are least abundant. For example, gallium (Ga) only appears in 3 minerals and in those minerals the other elements are oxygen (2/3), hydrogen (2/3), sulphur (1/3) and copper (1/3).

  • CuGaS2
  • Ga(OH)3
  • GaO(OH)

Excluding oxygen (the figure does not show the oxygen segment), the element with highest affinity for gallium is hydrogen - with 4 atoms of hydrogen in these 3 minerals.

example - visualizing cross tabs from categorical data

In this final, and small, example, I would like to illustrate how categorical data can be visualized, such as the kind that may be the output of a survey.

Consider a survey in which the following information is collected: gender, hair color, eye color and height (cm). The data might look something like this

0       female  f       red     3       green   2       165
1       female  f       blond   0       blue    1       156
2       female  f       brown   1       grey    3       157
3       male    m       brown   1       green   2       165
4       female  f       black   2       blue    1       164
5       female  f       brown   1       green   2       158
996     male    m       red     3       grey    3       179
997     female  f       red     3       green   2       163
998     female  f       black   2       brown   0       161
999     female  f       brown   1       brown   0       160

where the middle three column pairs correspond to the answer (e.g. male) and code (e.g. m) for a question.

These data are already a table, but not the kind that can be input into the tableviewer utility. In order to visualize the data, we need to create a crosstab, which represents the join distribution of two variables.

           black     blond      brown     red
    green              |
     blue              |
     grey ------- n(blond+grey)

Figure Joint distribution of hair and eye color is displayed here. For a given hair and eye color combination, each gender has its own ribbon. The gender ribbons are of identical size and suggest no gender dependence on hair and eye color.

Another dimension (height) can be added by mapping the radial position of the ribbons to be a function of this variable. In the figure below, males and females clearly bifurcate based on height.

Figure Representation of five dimensional data: hair color, eye color, gender, height and count.

Of course, as more information is stacked into the image, the figure's quantitative appeal turns into a purely artistic representation. Take care!