Credit: IBM
Cleansing, processing, and visualizing a data set, Part
3
Content series:
This content is part # of 3 in the series: Cleansing, processing, and visualizing a data set, Part 3
https://www.ibm.com/developerworks/library/?series_title_by=**auto**
Stay tuned for additional content in this series.
This content is part of the series:Cleansing, processing, and visualizing a data set, Part 3
Stay tuned for additional content in this series.
Someone wise once said, “A picture is worth a thousand words.” Certainly,
there’s truth in that: A picture or graph can reveal a tremendous amount
of information in a short amount of time compared to reading through a
large table of data. In this tutorial, I explore some of the methods you
can use to visualize data, including the R programming environment,
gnuplot, and Graphviz.
The previous tutorial in this series explored machine learning algorithms
for data clustering. This third part of the series focuses on visualizing
data (see Figure 1).
Figure 1. The data processing
pipeline
In Part 2, I used two algorithms to cluster a cleansed data set:
vector quantization and adaptive resonance theory. You can visualize the
resulting clustering data in many ways, but there’s a tremendous amount of
raw data that you can also visualize, such as the algorithmic process that
led up to the clustering.
Data visualization is half art, half science. It’s more than just about
plot style or color scheme; instead, it’s about bringing data together
with art to communicate information or some insight. The options for
visualizing data range from the standard line, bar, pie, and area charts
to the more complex radar, polar, chord, and tree maps. Regardless of the
visualization type you choose, however, the data that you want to
visualize must match the chosen method so that your viewers understand or
gain insight. In other words, ask a question, and then let the
visualization answer that question for you.
It’s also important to consider the audience for your visualization. This
consideration should go beyond just the technical background of your
audience to include any limitations that audience might have. For example,
worldwide, approximately 1 in 12 men and 1 in 200 women have some form of
color blindness. Color is useful and widespread, but it’s not for
everyone.
Visualizing the original data
set
In Part 1 and Part 2 of this series, I used the zoo data set as an example for
data cleansing and clustering with machine learning algorithms. The zoo
data set consists of 101 feature vectors, each of which contains 18
features (animal name, class, numeric number of legs, and 15 binary-valued
features). This data set is tiny compared with production data sets but
can’t be visualized easily. Figure 2 (plotted using Microsoft® Excel) shows
the entire data set in two formats: stacked area and stacked bar plots.
The horizontal axis shows the observations (per animal), and the vertical
axis represents the features. Other than the fact that animals have a
diversity of features, you can’t glean much from this visualization.
Figure 2. Visualizing the entire zoo data
set
A common approach to visualizing highly dimensional data is to reduce the
data set’s dimensionality. One way to do this is called principal
component analysis (PCA). PCA is a statistical process that
converts a set of observations or correlated variables into a set of
values that are linearly uncorrelated (called principal
components) by applying an orthogonal transformation to the data
so that a lower-dimensional visualization can be created (in two or three
dimensions).
For this example, I use the R
programming language, which is an environment for statistical
computing and visualization. As Listing 1 shows, I read the data frame from its source
and indicate that it’s a comma-separated values data set. Typing the data
frame’s name shows the data set that was read and parsed. I extract the
classes from the data set by using the factor
command (the
last column of the data set), and then create a features frame that
consists only of the features for each observation (which is then
displayed, showing the omission of the name and class).
Listing 1. Using R with PCA to
reduce
dimensionality
> zooDataset <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data", sep="," ) > zooDataset V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 1 aardvark 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1 2 antelope 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 1 3 bass 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4 4 bear 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1 ... 99 wolf 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1 100 worm 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 7 101 wren 0 1 1 0 1 0 0 0 1 1 0 0 2 1 0 0 2 > zooClasses <- factor(zooDataset$V18) > zooFeatures <- zooDataset[c(2:17)] > zooFeatures V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 1 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 2 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 3 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 ... 99 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 100 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 101 0 1 1 0 1 0 0 0 1 1 0 0 2 1 0 0 > zooPCA <- prcomp( scale( zooFeatures[,-1] ) ) > plot(zooPCA$x[,1:2], col=zooClasses) >
The final two steps in Listing 1 are
the meat of the PCA. I use the prcomp
command to perform the
PCA on the given data, which I scale from my features data frame by using
the scale
command. I then plot the first two eigenvectors
from the PCA object by using the classes that the zooClasses
frame defines for colorizing. The resulting two-dimensional (2D) plot is
shown in Figure 3.
Figure 3. Plotting the zoo data set in two
dimensions by using PCA
Note that data loss occurs when you reduce dimensionality with PCA, so you
couldn’t use the resulting plot for clustering. But, the plot does retain
the most important variance in the data and is therefore useful for this
purpose. Other methods for this reduction include kernel PCA and linear
discriminant analysis.
Visualizing the algorithmic
process
Viewing the metadata of a machine learning algorithm can be instructive.
For example, you can view the algorithm’s selection of classes for each
observation over iterations of the algorithm. The stacked line plot in
Figure 4 shows the
observations (horizontal axis) and the class (vertical axis) over three
iterations of the ART1 clustering process.
Figure 4. ART1 clustering metadata (class
membership over iterations)
Recall from the last article that ART1 begins with no clusters and allows
the creation of clusters as new observations are found that don’t fit into
the existing clusters. In the first iteration (see the top plot in Figure 4), you can see that the first
observation is placed into cluster 0; then, the next four are placed in
cluster 1. The vigilance parameter, which serves as a barrier to new
cluster creation, holds the number of clusters to two for nine
observations. In the last line plot of Figure 4, you can see that the observations have moved
to new clusters because four exist in this case (having been created in
the prior iteration). As the first iteration continues, you can see that
the barrier to new cluster creation is slowly overcome, with eight
clusters having been created before the 70th observation.
The stacked plot in Figure 4 draws your
eye to the differences between each iteration. In the first iteration, you
see the build-out of the clusters as observations are encountered. In the
next iteration (the middle plot), all observations are evaluated against
the prototype vectors, and significant movement occurs. By the last
iteration, only a few observations settle into new classes.
I created Figure 4 in gnuplot, a portable,
command-line-driven graphing utility. The example in Listing 2 shows that a
multiplot
has been requested (that is, multiple plots in a
single image), with three plots of type linespoints
(that is,
points with lines between them). The pause
command at the end
simply holds the image on the screen until I dismiss it. Each data file
(t*.dat
) is a sample per line, where the sample represents
the class for the observation.
Listing 2. Stacked line plots
of ART clustering
progress
set multiplot layout 3,1 title "ART1 Clustering" set yrange [-1:9] plot 't1.dat' with linespoints title 't1' plot 't2.dat' with linespoints title 't2' plot 't3.dat' with linespoints title 't3' unset multiplot pause -1
Another source of interesting metadata in the ART1 algorithm is the
prototype vectors themselves. Recall that a prototype vector is
the “centroid” for the cluster, and an observation is part of the cluster
if it meets the ART1 membership criteria. Because ART1 operates solely on
binary data, each prototype vector is a vector with a dimension of 21
binary-valued features, where a feature is set if it is relevant for the
cluster. Figure 5 provides a
visualization of these prototype vectors using a multiplot
radar plot with gnuplot. For the nine existing clusters, the radar plots
show the features that make up the prototype vectors. Without specifically
naming the feature vectors for each cluster, this plot illustrates the
diversity of the features that the clusters use (in some cases, the same
feature is used by multiple clusters; in others, the feature is used by
only one cluster).
Figure 5. Radar (or spider) plot of the ART1
prototype vectors
The gnuplot script shown in Listing 3 generates the plot in Figure 5. This plot uses multiplot
to
generate nine plots together with polar coordinates. Each data file
contains the prototype feature vector for the given cluster, one line per
feature (the first parameter is the degree in increments of 17.14; the
second is the feature binary value).
Listing 3. Radar plot script
in
gnuplot
unset border set polar set angles degrees set style line 10 lt 1 lc 0 lw 0.3 set grid polar 17.14 set grid ls 10 set xtics axis format "" scale 0 set ytics axis format "" set size square set style line 11 lt 1 lw 2 pt 2 ps 2 set multiplot layout 3,3 plot 'cluster1.txt' using 1:2 t "" w lp ls 11 plot 'cluster2.txt' using 1:2 t "" w lp ls 11 plot 'cluster3.txt' using 1:2 t "" w lp ls 11 plot 'cluster4.txt' using 1:2 t "" w lp ls 11 plot 'cluster5.txt' using 1:2 t "" w lp ls 11 plot 'cluster6.txt' using 1:2 t "" w lp ls 11 plot 'cluster7.txt' using 1:2 t "" w lp ls 11 plot 'cluster8.txt' using 1:2 t "" w lp ls 11 plot 'cluster9.txt' using 1:2 t "" w lp ls 11 unset multiplot pause -1
Sometimes, however, simpler is better. Constructing a simple visualization
that allows a simpler view and a way to compare clusters can yield much
more information. Consider the visualization in Figure 6: It shows the prototype
vectors from Figure 5 but visualized together, which permits a
simpler comparison. You can see certain features that are never used
(airborne, aquatic, fins, 5 legs) and features that are commonly used
(eggs and breathes).
Figure 6. Prototype vectors for
ART1
On the right side of the image is the number of observations for the
cluster. In one case, a single observation — the “scorpion” — makes up the
cluster. Its feature vector is sufficiently unique that ART1 isolated it
in its own cluster (which is an error, given that a scorpion is an
invertebrate and so a member with nine other observations of that class).
Cluster 5 represents the mammals, though five observations were classified
elsewhere.
Visualizing the clusters
In the final visualization, I look at a nontraditional method of
constructing a representation of the data. In this visualization, I use
the Graphviz package to generate
a diagram that represents the clusters that the vector quantization
algorithm produces (see Figure 7).
Figure 7. Visualizing clusters with
Graphviz
In Figure 7, you see a hub (defined as the algorithm VQ), with the
algorithm forming the seven clusters. Each cluster is defined as a record
that contains the cluster number and the animals that make up that
cluster. As a graph language, you can see the construction of the image
using graph semantics. VQ is defined as a node, and the ->
operator is used to draw an edge between the VQ node and the Cluster_0
node. Each cluster is constructed from records, which allows the grouping
of multiple elements. Listing 4 shows the code.
Listing 4. Graphviz dot file
to construct the diagram in Figure 7
(vq.dot)
digraph G { size="16,16"; overlap=scale; fontsize=8; VQ [shape=box center=true]; node [shape=record]; VQ -> Cluster_0 Cluster_0 [ label = "Cluster0 | { flea, gnat, honeybee, housefly, ladybird, moth, slug, termite, wasp, worm } " ] VQ -> Cluster_1 Cluster_1 [ label = "Cluster1 | { { chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich } | { parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren } }" ] VQ -> Cluster_2 Cluster_2 [ label = "Cluster2 | { { bass, carp, catfish, chub, dogfish, dolphin, haddock, herring, pike } | { piranha, porpoise, seahorse, sea snake, sole, stingray, tuna } }" ] VQ -> Cluster_3 Cluster_3 [ label = "Cluster3 | frog, frog, newt, pit viper, slowworm, toad, tortoise, tuatara" ] VQ -> Cluster_4 Cluster_4 [ label = "Cluster4 | { { aardvark, antelope, bear, boar, buffalo, calf, cheetah, deer, elephant, giraffe} | { girl, goat, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus} | { polecat, pony, puma, pussycat, raccoon, reindeer, seal, sea lion, wolf } }" ] VQ -> Cluster_5 Cluster_5 [ label = "Cluster5 | clam, crab, crayfish, lobster, octopus, scorpion, sea wasp, starfish" ] VQ -> Cluster_6 Cluster_6 [ label = "Cluster6 | cavy, fruit bat, gorilla, hamster, hare, squirrel, vampire, vole, wallaby" ] }
To generate the image in Figure 7,
Graphviz invokes the following command:
$ neato -Tgif -o out.gif vq.dot
The dot
command generates hierarchical layouts of directed
graphs, but neato
is better suited to undirected graphs that
use spring model layouts (where edges correspond to springs and vertices
correspond to equally charged bodies). Both of these utilities are part of
the Graphviz package.
Going further
This tutorial explored some of the more useful applications for visualizing
data and a few of the approaches that you can use to generate that
visualization. The R programming language is a popular and expansive
solution for processing and visualizing data, and it’s a common element in
the data scientist’s toolbox. Gnuplot is another popular visualization
application that offers a variety of plotting options. Finally, Graphviz
is an ideal solution for generating data in the form of graphs.
Downloadable resources
Related topics
Credit: IBM