Sunday, February 28, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Technology Companies

Cleansing, processing, and visualizing a data set, Part 3: Visualizing data

January 14, 2019
in Technology Companies
Cleansing, processing, and visualizing a data set, Part 3: Visualizing data
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: IBM

Cleansing, processing, and visualizing a data set, Part
3

You might also like

2021 is the year that open source overcomes its diversity problems – IBM Developer

The Case of the Miscoded Credentials – IBM Developer

Two new Call for Code for Racial Justice projects just went open source – IBM Developer


Content series:

This content is part # of 3 in the series: Cleansing, processing, and visualizing a data set, Part 3

https://www.ibm.com/developerworks/library/?series_title_by=**auto**

Stay tuned for additional content in this series.

This content is part of the series:Cleansing, processing, and visualizing a data set, Part 3

Stay tuned for additional content in this series.

Someone wise once said, “A picture is worth a thousand words.” Certainly,
there’s truth in that: A picture or graph can reveal a tremendous amount
of information in a short amount of time compared to reading through a
large table of data. In this tutorial, I explore some of the methods you
can use to visualize data, including the R programming environment,
gnuplot, and Graphviz.

The previous tutorial in this series explored machine learning algorithms
for data clustering. This third part of the series focuses on visualizing
data (see Figure 1).

Figure 1. The data processing
pipeline

Three blocks in a row labeled data cleansing, machine learning, and data visualization

Three blocks in a row labeled data cleansing, machine learning, and data visualization

In Part 2, I used two algorithms to cluster a cleansed data set:
vector quantization and adaptive resonance theory. You can visualize the
resulting clustering data in many ways, but there’s a tremendous amount of
raw data that you can also visualize, such as the algorithmic process that
led up to the clustering.

Data visualization is half art, half science. It’s more than just about
plot style or color scheme; instead, it’s about bringing data together
with art to communicate information or some insight. The options for
visualizing data range from the standard line, bar, pie, and area charts
to the more complex radar, polar, chord, and tree maps. Regardless of the
visualization type you choose, however, the data that you want to
visualize must match the chosen method so that your viewers understand or
gain insight. In other words, ask a question, and then let the
visualization answer that question for you.

It’s also important to consider the audience for your visualization. This
consideration should go beyond just the technical background of your
audience to include any limitations that audience might have. For example,
worldwide, approximately 1 in 12 men and 1 in 200 women have some form of
color blindness. Color is useful and widespread, but it’s not for
everyone.

Visualizing the original data
set

In Part 1 and Part 2 of this series, I used the zoo data set as an example for
data cleansing and clustering with machine learning algorithms. The zoo
data set consists of 101 feature vectors, each of which contains 18
features (animal name, class, numeric number of legs, and 15 binary-valued
features). This data set is tiny compared with production data sets but
can’t be visualized easily. Figure 2 (plotted using Microsoft® Excel) shows
the entire data set in two formats: stacked area and stacked bar plots.
The horizontal axis shows the observations (per animal), and the vertical
axis represents the features. Other than the fact that animals have a
diversity of features, you can’t glean much from this visualization.

Figure 2. Visualizing the entire zoo data
set

Visual observation of the entire data set

Visual observation of the entire data set

A common approach to visualizing highly dimensional data is to reduce the
data set’s dimensionality. One way to do this is called principal
component analysis
(PCA). PCA is a statistical process that
converts a set of observations or correlated variables into a set of
values that are linearly uncorrelated (called principal
components
) by applying an orthogonal transformation to the data
so that a lower-dimensional visualization can be created (in two or three
dimensions).

For this example, I use the R
programming language
, which is an environment for statistical
computing and visualization. As Listing 1 shows, I read the data frame from its source
and indicate that it’s a comma-separated values data set. Typing the data
frame’s name shows the data set that was read and parsed. I extract the
classes from the data set by using the factor command (the
last column of the data set), and then create a features frame that
consists only of the features for each observation (which is then
displayed, showing the omission of the name and class).

Listing 1. Using R with PCA to
reduce
dimensionality
> zooDataset <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data", sep="," )
> zooDataset
          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1   aardvark  1  0  0  1  0  0  1  1   1   1   0   0   4   0   0   1   1
2   antelope  1  0  0  1  0  0  0  1   1   1   0   0   4   1   0   1   1
3       bass  0  0  1  0  0  1  1  1   1   0   0   1   0   1   0   0   4
4       bear  1  0  0  1  0  0  1  1   1   1   0   0   4   0   0   1   1
...
99      wolf  1  0  0  1  0  0  1  1   1   1   0   0   4   1   0   1   1
100     worm  0  0  1  0  0  0  0  0   0   1   0   0   0   0   0   0   7
101     wren  0  1  1  0  1  0  0  0   1   1   0   0   2   1   0   0   2
> zooClasses <- factor(zooDataset$V18)
> zooFeatures <- zooDataset[c(2:17)]
> zooFeatures
    V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1    1  0  0  1  0  0  1  1   1   1   0   0   4   0   0   1
2    1  0  0  1  0  0  0  1   1   1   0   0   4   1   0   1
3    0  0  1  0  0  1  1  1   1   0   0   1   0   1   0   0
4    1  0  0  1  0  0  1  1   1   1   0   0   4   0   0   1
...
99   1  0  0  1  0  0  1  1   1   1   0   0   4   1   0   1
100  0  0  1  0  0  0  0  0   0   1   0   0   0   0   0   0
101  0  1  1  0  1  0  0  0   1   1   0   0   2   1   0   0
> zooPCA <- prcomp( scale( zooFeatures[,-1] ) )
> plot(zooPCA$x[,1:2], col=zooClasses)
>

The final two steps in Listing 1 are
the meat of the PCA. I use the prcomp command to perform the
PCA on the given data, which I scale from my features data frame by using
the scale command. I then plot the first two eigenvectors
from the PCA object by using the classes that the zooClasses
frame defines for colorizing. The resulting two-dimensional (2D) plot is
shown in Figure 3.

Figure 3. Plotting the zoo data set in two
dimensions by using PCA

2D plot of the zoo data set

2D plot of the zoo data set

Note that data loss occurs when you reduce dimensionality with PCA, so you
couldn’t use the resulting plot for clustering. But, the plot does retain
the most important variance in the data and is therefore useful for this
purpose. Other methods for this reduction include kernel PCA and linear
discriminant analysis.

Visualizing the algorithmic
process

Viewing the metadata of a machine learning algorithm can be instructive.
For example, you can view the algorithm’s selection of classes for each
observation over iterations of the algorithm. The stacked line plot in
Figure 4 shows the
observations (horizontal axis) and the class (vertical axis) over three
iterations of the ART1 clustering process.

Figure 4. ART1 clustering metadata (class
membership over iterations)

Plots for each of 3 iterations of class membership

Plots for each of 3 iterations of class membership

Recall from the last article that ART1 begins with no clusters and allows
the creation of clusters as new observations are found that don’t fit into
the existing clusters. In the first iteration (see the top plot in Figure 4), you can see that the first
observation is placed into cluster 0; then, the next four are placed in
cluster 1. The vigilance parameter, which serves as a barrier to new
cluster creation, holds the number of clusters to two for nine
observations. In the last line plot of Figure 4, you can see that the observations have moved
to new clusters because four exist in this case (having been created in
the prior iteration). As the first iteration continues, you can see that
the barrier to new cluster creation is slowly overcome, with eight
clusters having been created before the 70th observation.

The stacked plot in Figure 4 draws your
eye to the differences between each iteration. In the first iteration, you
see the build-out of the clusters as observations are encountered. In the
next iteration (the middle plot), all observations are evaluated against
the prototype vectors, and significant movement occurs. By the last
iteration, only a few observations settle into new classes.

I created Figure 4 in gnuplot, a portable,
command-line-driven graphing utility. The example in Listing 2 shows that a
multiplot has been requested (that is, multiple plots in a
single image), with three plots of type linespoints (that is,
points with lines between them). The pause command at the end
simply holds the image on the screen until I dismiss it. Each data file
(t*.dat) is a sample per line, where the sample represents
the class for the observation.

Listing 2. Stacked line plots
of ART clustering
progress
set multiplot layout 3,1 title "ART1 Clustering"
set yrange [-1:9]
plot 't1.dat' with linespoints title 't1'
plot 't2.dat' with linespoints title 't2'
plot 't3.dat' with linespoints title 't3'
unset multiplot
pause -1

Another source of interesting metadata in the ART1 algorithm is the
prototype vectors themselves. Recall that a prototype vector is
the “centroid” for the cluster, and an observation is part of the cluster
if it meets the ART1 membership criteria. Because ART1 operates solely on
binary data, each prototype vector is a vector with a dimension of 21
binary-valued features, where a feature is set if it is relevant for the
cluster. Figure 5 provides a
visualization of these prototype vectors using a multiplot
radar plot with gnuplot. For the nine existing clusters, the radar plots
show the features that make up the prototype vectors. Without specifically
naming the feature vectors for each cluster, this plot illustrates the
diversity of the features that the clusters use (in some cases, the same
feature is used by multiple clusters; in others, the feature is used by
only one cluster).

Figure 5. Radar (or spider) plot of the ART1
prototype vectors

9 circular plots, one for each prototype vector

9 circular plots, one for each prototype vector

The gnuplot script shown in Listing 3 generates the plot in Figure 5. This plot uses multiplot to
generate nine plots together with polar coordinates. Each data file
contains the prototype feature vector for the given cluster, one line per
feature (the first parameter is the degree in increments of 17.14; the
second is the feature binary value).

Listing 3. Radar plot script
in
gnuplot
unset border
set polar
set angles degrees

set style line 10 lt 1 lc 0 lw 0.3

set grid polar 17.14
set grid ls 10

set xtics axis format "" scale 0
set ytics axis format ""

set size square 

set style line 11 lt 1 lw 2 pt 2 ps 2

set multiplot layout 3,3

plot 'cluster1.txt' using 1:2 t "" w lp ls 11
plot 'cluster2.txt' using 1:2 t "" w lp ls 11
plot 'cluster3.txt' using 1:2 t "" w lp ls 11
plot 'cluster4.txt' using 1:2 t "" w lp ls 11
plot 'cluster5.txt' using 1:2 t "" w lp ls 11
plot 'cluster6.txt' using 1:2 t "" w lp ls 11
plot 'cluster7.txt' using 1:2 t "" w lp ls 11
plot 'cluster8.txt' using 1:2 t "" w lp ls 11
plot 'cluster9.txt' using 1:2 t "" w lp ls 11

unset multiplot

pause -1

Sometimes, however, simpler is better. Constructing a simple visualization
that allows a simpler view and a way to compare clusters can yield much
more information. Consider the visualization in Figure 6: It shows the prototype
vectors from Figure 5 but visualized together, which permits a
simpler comparison. You can see certain features that are never used
(airborne, aquatic, fins, 5 legs) and features that are commonly used
(eggs and breathes).

Figure 6. Prototype vectors for
ART1

Vectors for 8 ART1 clusters

Vectors for 8 ART1 clusters

On the right side of the image is the number of observations for the
cluster. In one case, a single observation — the “scorpion” — makes up the
cluster. Its feature vector is sufficiently unique that ART1 isolated it
in its own cluster (which is an error, given that a scorpion is an
invertebrate and so a member with nine other observations of that class).
Cluster 5 represents the mammals, though five observations were classified
elsewhere.

Visualizing the clusters

In the final visualization, I look at a nontraditional method of
constructing a representation of the data. In this visualization, I use
the Graphviz package to generate
a diagram that represents the clusters that the vector quantization
algorithm produces (see Figure 7).

Figure 7. Visualizing clusters with
Graphviz

Hub-and-spoke–type visualization

Hub-and-spoke–type visualization

In Figure 7, you see a hub (defined as the algorithm VQ), with the
algorithm forming the seven clusters. Each cluster is defined as a record
that contains the cluster number and the animals that make up that
cluster. As a graph language, you can see the construction of the image
using graph semantics. VQ is defined as a node, and the ->
operator is used to draw an edge between the VQ node and the Cluster_0
node. Each cluster is constructed from records, which allows the grouping
of multiple elements. Listing 4 shows the code.

Listing 4. Graphviz dot file
to construct the diagram in Figure 7
(vq.dot)
digraph G {
size="16,16";
overlap=scale;
fontsize=8;
VQ [shape=box center=true];
node [shape=record];
VQ -> Cluster_0
Cluster_0 [ label = "Cluster0 | { flea, gnat, honeybee, housefly, ladybird,
                                  moth, slug, termite, wasp, worm } " ]
VQ -> Cluster_1
Cluster_1 [ label = "Cluster1 | { { chicken, crow, dove, duck, flamingo,
                                    gull, hawk, kiwi, lark, ostrich } |                                
                                  { parakeet, penguin, pheasant, rhea,
                                    skimmer, skua, sparrow, swan, vulture,
                                    wren } }" ]
VQ -> Cluster_2
Cluster_2 [ label = "Cluster2 | { { bass, carp, catfish, chub, dogfish,
                                    dolphin, haddock, herring, pike } |
                                  { piranha, porpoise, seahorse, sea snake,
                                    sole, stingray, tuna } }" ]
VQ -> Cluster_3
Cluster_3 [ label = "Cluster3 | frog, frog, newt, pit viper, slowworm, toad,
                                tortoise, tuatara" ]
VQ -> Cluster_4
Cluster_4 [ label = "Cluster4 | { { aardvark, antelope, bear, boar, buffalo,
                                    calf, cheetah, deer, elephant, giraffe} |     
                                  { girl, goat, leopard, lion, lynx, mink,
                                    mole, mongoose, opossum, oryx, platypus} |                                 
                                  { polecat, pony, puma, pussycat, raccoon,
                                    reindeer, seal, sea lion, wolf } }" ]
VQ -> Cluster_5
Cluster_5 [ label = "Cluster5 | clam, crab, crayfish, lobster, octopus,
                                scorpion, sea wasp, starfish" ]
VQ -> Cluster_6
Cluster_6 [ label = "Cluster6 | cavy, fruit bat, gorilla, hamster, hare,
                                squirrel, vampire, vole, wallaby" ]
}

To generate the image in Figure 7,
Graphviz invokes the following command:

$ neato -Tgif -o out.gif vq.dot

The dot command generates hierarchical layouts of directed
graphs, but neato is better suited to undirected graphs that
use spring model layouts (where edges correspond to springs and vertices
correspond to equally charged bodies). Both of these utilities are part of
the Graphviz package.

Going further

This tutorial explored some of the more useful applications for visualizing
data and a few of the approaches that you can use to generate that
visualization. The R programming language is a popular and expansive
solution for processing and visualizing data, and it’s a common element in
the data scientist’s toolbox. Gnuplot is another popular visualization
application that offers a variety of plotting options. Finally, Graphviz
is an ideal solution for generating data in the form of graphs.


Downloadable resources

Related topics

Credit: IBM

Previous Post

Good news for job-seekers! 1,286 openings in Amazon India, the highest in Asia-Pacific

Next Post

Machine Learning Uncovers New Insights into the Human Brain

Related Posts

Introducing the technology preview of IBM API Hub on IBM Developer, where you can discover, try, adopt, and consume APIs from IBM and our ecosystem partners – IBM Developer
Technology Companies

2021 is the year that open source overcomes its diversity problems – IBM Developer

February 27, 2021
Introducing the technology preview of IBM API Hub on IBM Developer, where you can discover, try, adopt, and consume APIs from IBM and our ecosystem partners – IBM Developer
Technology Companies

The Case of the Miscoded Credentials – IBM Developer

February 24, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Two new Call for Code for Racial Justice projects just went open source – IBM Developer

February 20, 2021
Introducing the technology preview of IBM API Hub on IBM Developer, where you can discover, try, adopt, and consume APIs from IBM and our ecosystem partners – IBM Developer
Technology Companies

Introducing the technology preview of IBM API Hub on IBM Developer, where you can discover, try, adopt, and consume APIs from IBM and our ecosystem partners – IBM Developer

February 18, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Project OWL announces new release of ClusterDuck Protocol to build emergency mesh networks – IBM Developer

February 17, 2021
Next Post
Machine Learning Uncovers New Insights into the Human Brain

Machine Learning Uncovers New Insights into the Human Brain

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Cybercrime groups are selling their hacking skills. Some countries are buying
Internet Security

Cybercrime groups are selling their hacking skills. Some countries are buying

February 28, 2021
New AI Machine Learning Reduces Mental Health Misdiagnosis
Machine Learning

Machine Learning May Reduce Mental Health Misdiagnosis

February 28, 2021
Why would you ever trust Amazon’s Alexa after this?
Internet Security

Why would you ever trust Amazon’s Alexa after this?

February 28, 2021
AI & ML Are Not Same. Here's Why – Analytics India Magazine
Machine Learning

AI & ML Are Not Same. Here's Why – Analytics India Magazine

February 27, 2021
Microsoft: We’ve open-sourced this tool we used to hunt for code by SolarWinds hackers
Internet Security

Microsoft: We’ve open-sourced this tool we used to hunt for code by SolarWinds hackers

February 27, 2021
Is Wattpad and its machine learning tool the future of TV? — Quartz
Machine Learning

Is Wattpad and its machine learning tool the future of TV? — Quartz

February 27, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Cybercrime groups are selling their hacking skills. Some countries are buying February 28, 2021
  • Machine Learning May Reduce Mental Health Misdiagnosis February 28, 2021
  • Why would you ever trust Amazon’s Alexa after this? February 28, 2021
  • AI & ML Are Not Same. Here's Why – Analytics India Magazine February 27, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates