Dimensional models are conventionally based on parameters such as data reality,
grain, dimensions, and facts or measurements. But there are business scenarios where
valuable unstructured information is stored around the grains or at different levels
of hierarchy within the data model.
Using the linguistic algorithms of natural language processing (NLP), IBM Watson
Explorer can understand all of the words in a question, not just the keywords. In fact, it can analyze grammar, context, jargon, idioms, and even mode-of-thought. It can
then draw statistical inferences and make recommendations or provide advice within a
given confidence level.
This tutorial describes how to analyze a hierarchical data structure by using IBM Watson
Explorer within the context of content analytics while retaining the hierarchical
structure or grains of the underlying data model. This strategy gives you
the flexibility and capability of analyzing aspects of unstructured content and
structured dimensions together. I demonstrate step-by-step how to
overcome the issue of multiplicity while performing content analysis with
IBM Watson Explorer and IBM Watson Content Analytics Miner.
In preparing this tutorial, I used the following software components:
- IBM Watson Explorer Content Analytics Version 11.0.2
- IBM Content Analytics Studio Version 11
- Eclipse Java EE IDE for Web Developers
- IBM dashDB
- JDBC Driver for DB2
- Windows Server 2012 R2
Problem description
I have constructed this tutorial using a simple example of an incident management
system. Although this is a very simple example, the concepts it illustrates can be
easily extended to more complex use case scenarios, from either incident management
or from other systems such as quality management, case management, workflow
management, or content management.
Figure 1 through Figure 3 show the entity relation diagram, the fact table, and the
hierarchical representation of the example. Consider the data structure used for
incident reporting shown in Figure 1.
Figure 1. Entity relation diagram: Incident management
system
The system illustrated here is an example of an Operational Source System. (For more
information about Operational Source Systems, see The Data Warehouse
Toolkit.) This system consists of a
transaction header row that is associated with multiple transaction lines. With
header/line schemas (also known as parent-child schemas), all of the header-level
dimension foreign keys and “degenerate dimensions” should be included on the line-level fact table
with differing granularity at each level. For this tutorial,
I used INCIDENT_CODE, INCIDENT_LINE_ITEM, INVESTIGATION_CODE, RESOLUTION_CODE, and
CLOSURE_CODE to create the fact table in Figure 2, which was
generated from the data model shown in Figure 1. I added a few rows of example
data to the fact table to build and demonstrate the solution.
Figure 2. Fact table structure & example
data
By virtue of the denormalized structure of the derived fact table, it is obvious that
only
the leaf level is unique.
As shown in Figure 3, a symmetrical hierarchical data structure with 5 levels and
3 children at each level produces 81 rows for the root level. Repetition of
each unstructured/structured content of field will increase from 3 to 81 from level
1 to 5. (Note: Figure 3 represents a fragment of the hierarchical structure
representation. For more details, see the sample data file in GitHub.)
Figure 3. Hierarchical view of example data
Everything that is crawled by IBM Watson Explorer is considered a document or a
single entity (that is, each row of the database table is considered a separate
document). That document is the only grain available within it. Unfortunately, these
multiple occurrences of unstructured content, when analyzed using IBM Watson,
produce misleading statistics. For example, a given text in the INCIDENT_DESCRIPTION
field in level 2 as shown in Figure 1 through Figure 3 is counted nine times
(multiples of the child level) for one INVESTIGATION_CODE field.
This tutorial gives you a strategy to overcome this problem of
multiplicity. I demonstrate how to create a custom database crawler
plug-in to crawl the derived fact table in a specific way so that the dimensions of
all the levels of hierarchy are contained, but multiplicity is eliminated.
What you’ll need to build your
application
The build your own application, you’ll need the following skills and tools:
- Working knowledge of IBM Watson Explorer (Advanced Edition)
- Working knowledge of Content Analytics Studio and an understanding of Apache UMIA
content analytics concepts - IBM Watson Explorer Content Analytics V.11.0.2
- IBM Watson Explorer Content Analytics Studio V.11
- IBM Cloud Account (optional)
- IBM dashDB (or any standard DBMS)
- Understanding of DBMS concepts
- Java development skills
Set up the content analytics collection
To create the content analytics collection:
- Note the following settings and options for use by the collection.
Option name Option value General options Collection name ATutorialCollection Collection type Content analytics collection Solution package Do not apply a solution package Document cache Enable the document cache Thumbnail generation (required for showing thumbnail images in
the search results)Do not enable thumbnail generation Advanced options Description Collection for demonstrating the concept of Contextual Content
Analytics of Hierarchical Data – 02/04/2018Collection security Do not enable security for the collection Overlay index Do not enable overlay index N-gram segmentation Do not enable N-gram segmentation Note: For more details, see “Administering Watson Explorer Content Analytics” in the IBM
Knowledge Center. - Create the fact table.
A sample fact table, shown in Figure 2, is created in
the IBM dashDB, which contains the sample data. Figure 4 shows the sample
data in the dashDB console. Detailed steps for creating the table are out of
the scope of this tutorial. However, you can use any standard relational database.Figure 4. Sample data in dashDB
- Create the JDBC crawler.
Follow the standard process for creating a JDBC
crawler, as described under “Crawler administration” in the IBM Knowledge Center.The following table lists the JDBC connection details for reference.
Parameter Value JDBC driver name com.ibm.db2.jcc.DB2Driver JDBC driver class path F:DevWTutorialjars Database URL jdbc:db2://awh-yp-small02.services.dal.bluemix.net:50000/BLUDB The JDBC driver used to connect dashDB is the same as DB2 and can be
downloaded from the IBM
Support site. - Populate the index field and facet mapping.
While creating the crawler, the
index field mapping and the facet tree are populated according to the
following table. Index field names are important, as these fields will be
used within the custom crawler plug-in for the custom crawling process of
de-duplication.Field Name Return-able Faceted search Free text search In summary Fielded search Parametric search Analyzable Unique identifier Facet mapping Group ID Yes INCIDENT_CODE Yes Yes Yes Yes Yes No No Incident Code Incident Header INCIDENT_TYPE_CODE Yes Yes Yes Yes Yes No No Incident Type Code Incident Header REQUESTED_BY Yes Yes Yes Yes Yes No No Raised By Incident Header AFFECTED_ITEM Yes Yes Yes Yes Yes No No Affected Product Incident Header HELP_DESK_ID Yes Yes Yes Yes Yes No No Agent Code Incident Header INCIDENT_LINE_ITEM Yes Yes Yes Yes Yes No No Incident Line Item Incident Details ASSIGNED_DEPT Yes Yes Yes Yes Yes No No Assigned to Dept. Incident Details INCIDENT_PART_NO Yes Yes Yes Yes Yes No No Affected Parts(s) Incident Details INCIDENT_DESCRIPTION Yes No Yes Yes No No Yes INVESTIGATION_CODE Yes Yes Yes Yes Yes No No Investigation Code Investigation Details DIAGNOSIS_CODE Yes Yes Yes Yes Yes No No Diagnosis Code Investigation Details DEFECTIVE_PARTS Yes Yes Yes Yes Yes No No Defective Part(s) Investigation Details DIAGNOSIS_DESCRIPTION Yes No Yes Yes No No Yes RESOLUTION_CODE Yes Yes Yes Yes Yes No No Resolution Code Resolution Details R_DEPT_CODE Yes Yes Yes Yes Yes No No Resolution Dept. Code Resolution Details R_WORK_CENTER Yes Yes Yes Yes Yes No No Resolution Work Center Resolution Details RESOLUTION_DESC Yes No Yes Yes No No Yes CLOSURE_CODE Yes Yes Yes Yes Yes No No Closure Code Final Disposition Logs CLOSING_COMMENT Yes No Yes Yes No No Yes Figure 5 shows the facet tree hierarchy that is displayed in IBM Watson
Explorer Content Analytics Miner. After the facet tree is created, the table
is crawled, initially with the above configuration.Figure 5. Facet tree structure
- Create a defect dictionary and deploy it in the content analytics server.
The
following table lists the keywords of the defect dictionary along with their
parts of speech and inflections.Defects
(Keyword)Part of speech Inflection(s) damage Noun damages damage Verb damaged damages damaging foul Verb fouled fouling fouls foul Noun fouls crack Noun cracks crack Verb cracked cracking cracks clip Verb clipped clipping clips clip Noun clips break Noun breaks break Verb breaking breaks broke broken rip Verb ripped ripping rips rip Noun rips abrasion Noun abrasions mark Noun marks mark Verb marked marking marks seal Verb sealed sealing seals seal Noun seals scrap Noun scraps scrap Verb scrapped scrapping scraps The defect dictionary is generated using the keywords and is used to analyze
the instructed text content in the fact table. For a more detailed
understanding, see the sample data file in GitHub.Figure 6 shows how the dictionary is organized within the IBM Watson Content
Analytics Studio. For more information, see “Content Analytics Studio for advance text analytics” in the IBM
Knowledge Center.Figure 6. Defect dictionary in Content
Analytics StudioAfter the defect dictionary is created, the configured UIMA pipeline is exported to the Content Analytics Server.
(Detailed steps for creating the dictionary and exporting the UIMA pipeline
are beyond the scope of this tutorial. This section is for reference only to
maintain the continuity of the tutorial.)
Generate initial output without the custom
crawler plug-in
- Get the initial statistics in IBM Watson Explorer Content Analytics Miner.
After the collection has been crawled, indexed, and made ready for
analysis, use the IBM Watson Explorer Content Analytics to verify the
initial statistics. Note that in Figure 7, 57 occurrences of
“fouling
” and 50 occurrences of “scrap
”
keywords are found from the table, followed by other keywords.Figure 7. Initial statistics in IBM Watson
Explorer Content Analytics Miner - Now query the documents (a single row of the table) in the IBM Watson Explorer
Content Miner by selecting the keywordfouling
in the following query:[Query Text : (*:*) AND (keyword::/"Explore"/"Defect
Type(s)"/"fouling")]If you look at the sample fact table, you’ll see that the
INCIDENT_DESCRIPTION field contains the same value repeated 20 times as it
is joined with the INVESTIGATION_DETAILS table. This is misleading and is an
example of the multiplicity problem that is inherent in the nature of
relational joins. In the next section , I show you how to create a custom plug-in to overcome this issue.INCIDENT_HDR INCIDENT_DETAILS ID INCIDENT_CODE INCIDENT_LINE_ITEM INCIDENT_DESCRIPTION 1 D1 D1.1 PART-01 is fouling and cracked or clipped 2 D1 D1.1 PART-01 is
fouling and cracked or clipped3 D1 D1.1 PART-01 is
fouling and cracked or clipped4 D1 D1.1 PART-01 is
fouling and cracked or clipped5 D1 D1.1 PART-01 is
fouling and cracked or clipped6 D1 D1.1 PART-01 is
fouling and cracked or clipped7 D1 D1.1 PART-01 is
fouling and cracked or clipped8 D1 D1.1 PART-01 is
fouling and cracked or clipped9 D1 D1.1 PART-01 is
fouling and cracked or clipped10 D1 D1.1 PART-01 is
fouling and cracked or clipped11 D1 D1.1 PART-01 is
fouling and cracked or clipped12 D1 D1.1 PART-01 is
fouling and cracked or clipped13 D1 D1.1 PART-01 is
fouling and cracked or clipped14 D1 D1.1 PART-01 is
fouling and cracked or clipped15 D1 D1.1 PART-01 is
fouling and cracked or clipped16 D1 D1.1 PART-01 is
fouling and cracked or clipped17 D1 D1.1 PART-01 is
fouling and cracked or clipped18 D1 D1.1 PART-01 is
fouling and cracked or clipped19 D1 D1.1 PART-01 is
fouling and cracked or clipped20 D1 D1.1 PART-01 is
fouling and cracked or clipped21 D1 D1.2 PART-02 is broken or damaged
Create a custom Java crawler plug-in
This section describes the custom crawler plug-in and using the conceptual view
of IBM Watson Explorer Content Analytics Miner to obtain statistically correct
insights.
Crawler plug-ins for IBM Watson Explorer Content Analytics are divided into two
categories based on data sources:
Type A data sources | Type B data sources |
---|---|
Content Integrator |
|
Content Manager | |
DB2 | |
JDBC database | |
Notes | |
Quickr for Domino | SharePoint crawler |
Seed list | |
File system |
In this procedure, I use a JDBC database, which is a Type A data
source. I can create a Java class to programmatically update the value of
security tokens, metadata, and the document content of Type A data sources.
The following code is for the Java plug-in. It is followed by a description of the
deployment process and an analysis of the code’s logic. PluginLogger
logs
the issues related to the crawler.
Custom Java crawler
plug-in
package com.ibm.es.crawler.dbplugins; import java.util.List; import java.util.Map; import java.util.stream.Collectors; import com.ibm.es.crawler.plugin.AbstractCrawlerPlugin; import com.ibm.es.crawler.plugin.CrawledData; import com.ibm.es.crawler.plugin.CrawlerPluginException; import com.ibm.es.crawler.plugin.FieldMetadata; import com.ibm.es.crawler.plugin.logging.PluginLogger; public class TutorialDBCrawlerPlugin extends AbstractCrawlerPlugin { /** Logger */ private static final PluginLogger logger; static { PluginLogger.init(PluginLogger.LOGTYPE_OSS,PluginLogger.LOGLEVEL_INFO); logger = PluginLogger.getInstance(); } /** End Logger **/ private static final String INCIDENT_DESCRIPTION = "INCIDENT_DESCRIPTION"; private static final String DIAGNOSIS_DESCRIPTION = "DIAGNOSIS_DESCRIPTION"; private static final String RESOLUTION_DESC = "RESOLUTION_DESC"; private static final String INVESTIGATION_CODE = "INVESTIGATION_CODE"; private static final String RESOLUTION_CODE = "RESOLUTION_CODE"; private static final String CLOSURE_CODE = "CLOSURE_CODE"; public TutorialDBCrawlerPlugin() { super(); } public void init() throws CrawlerPluginException { } public boolean isMetadataUsed() { return true; } public void term() throws CrawlerPluginException { return; } public CrawledData updateDocument(CrawledData crawledData) throws CrawlerPluginException { // print metadata List<FieldMetadata> metadataList = crawledData.getMetadataList(); if (!(metadataList == null)) { /*for (int i = 0; i < metadataList.size(); i++) { //logger.warn(metadataList.get(i).getFieldName() + // ": " + metadataList.get(i).getValue()); }*/ } else logger.error("metadatalist is empty"); crawledData.setMetadataList(setIndexContext(metadata List)); return crawledData; } public boolean isContentUsed() { return true; > } public List<FieldMetadata> setIndexContext(List<FieldMetadata> metaList){ Map<String, String> OneRow = metaList.stream().collect( Collectors.toMap(FieldMetadata::getFieldName, FieldMetadata::getValue)); String closureCode = getLeafValue(OneRow.get(CLOSURE_CODE)); String resolutionCode = getLeafValue(OneRow.get(RESOLUTION_CODE)); String investigationCode = getLeafValue(OneRow.get(INVESTIGATION_CODE)); logger.warn("Modified Map Details :"+ OneRow.toString()); List<FieldMetadata> returnMetaList = metaList; for (int i = 0; i < returnMetaList.size(); i++) { if (returnMetaList.get(i).getFieldName(). contentEquals(INCIDENT_DESCRIPTION)) { if(closureCode.equals("1") && resolutionCode.equals("1") && investigationCode.equals("1")){ } else returnMetaList.get(i).setValue(""); } if (returnMetaList.get(i).getFieldName(). contentEquals(DIAGNOSIS_DESCRIPTION)) { if(closureCode.equals("1") && resolutionCode.equals("1")){ } else returnMetaList.get(i).setValue(""); } if (returnMetaList.get(i).getFieldName(). contentEquals(RESOLUTION_DESC)) { if(closureCode.equals("1")){ } else returnMetaList.get(i).setValue(""); } } return returnMetaList; } public String getLeafValue(String keyField){ String hierarchy[] = keyField.split("\."); return hierarchy[hierarchy.length-1]; } }
Custom crawler plug-in methods
description
To create a Java class for use as a crawler plug-in with content-related functions
for Type A data sources, extend com.ibm.es.crawler.plugin.AbstractCrawlerPlugin
and
implement the following methods:
init()
isMetadataUsed()
isContentUsed()
activate()
deactivate()
term()
updateDocument()
The AbstractCrawlerPlugin
class is an abstract class. The
init
, activate
, deactivate
, and
term
methods are implemented to do nothing. The
isMetadataUsed
method and isContentUsed
method are
implemented to return false by default. The updateDocument
method is an
abstract method, so you must implement it.
For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
When the crawler session starts, the plug-in process is forked. An
AbstractCrawlerPlugin
object is instantiated with the default
constructor and the init
, isMetadataUsed
, and
isContentUsed
methods are called one time. During the crawler
session, the activate
method is called when the crawler starts its
crawling and the deactivate
method is called when the crawler finishes
its crawling. When the crawler session ends, the term
method is called
and the object is destroyed. If the crawler scheduler is enabled, the
activate
method is called when the crawling is scheduled to start
and the deactivate
method is called when the crawling is scheduled to
end. Because a single crawler session runs continuously when the crawler scheduler
is enabled, the term
method is not called to destroy the object.
Custom crawler plug-in compile and
deployment
- Compile the implemented code and make a JAR file for it. Add the
ES_INSTALL_ROOT/lib/dscrawler.jar file to the class path when you compile. - In the administration console, follow these steps:
- Edit the appropriate collection.
- Select the Crawl page and edit the crawler properties for the crawler
that will use the custom Java class. - Specify the following items:
- The fully qualified class name of the implemented Java class:
for example,
com.ibm.es.crawler.dbplugins.TutorialDBCrawlerPlugin
. - Include the name of the JAR file in your path declaration: for
example,
F:DevWTutorialPluginsDBCrawlerPluginTutorialDBCrawlerPlugin.jar.
- The fully qualified class name of the implemented Java class:
- On the Crawl page, click Monitor. Then click
Stop and Start to restart the
session for the crawler that you edited. - Click Details and start a full crawl.
Other than the methods extended from the AbstractCrawlerPlugin
abstract
class, two key methods have been implemented: getLeafValue()
and
setIndexContext()
. These methods eliminate the duplicates from the
collection index.
- The
getLeafValue ()
method is called from
setIndexContext
with a key field value and returns the value of
the leaf assuming “.” is the separator. For example, ifD1.1.1.1.1
is passed as an argument,1
is returned. - The
setIndexContext()
method is called from theupdateDocument()
method with an argument of a list ofFieldMetadata
(com.ibm.es.crawler.plugin.FieldMetadata
). This method works on the
FieldMetadata
list based on the given logic and returns the
de-duplicatedFieldMetadata
, which is eventually set into the
crawled data and used by the crawler process.
After the code is ready, it can be deployed as explained in the procedure of this
section (substeps 1 through 3). For more information, see “Creating a crawler plug-in for Type A data sources” in the IBM Knowledge
Center.
Contextual views
A unique feature of IBM Watson Explorer is the contextual view. You can explore the data’s contents in a specific context by grouping analyzable
fields. For example, if the data is configured to analyze “Incident Description and
Resolution Description” grouped together or with individual contextual views, users
can limit the results to the parts of documents that match those contexts.
Contextual view creation
- You create a contextual view in the administrative console by first selecting
Configure parse and index setting.Figure 8. Configure parse and index setting,
Contextual view option - Select Contextual views and then create the contextual views for
textual content based on the business requirements. The contextual views created
for this example are listed in the following table.Context fields Contextual view name INCIDENT_DESCRIPTION Incident Details DIAGNOSIS_DESCRIPTION Investigation INCIDENT_DESCRIPTION & DIAGNOSIS_DESCRIPTION Incident & Investigation RESOLUTION_DESC Resolution CLOSING_COMMENT Final Disposition RESOLUTION_DESC & CLOSING_COMMENT Resolution & Final Disposition Figure 9 shows the list of contextual views created for this tutorial as
given in the previous table.Figure 9. Contextual view details
Now business users are enabled to derive insights based on the context of the
hierarchical data and the facet hierarchy, which is described in the following
section.
Contextual analysis using IBM Watson Explorer Content Analytics
Miner
The following outputs in IBM Watson Explorer Content Miner are indicative samples of
occurrences of defect types based on the context selected by the user. They show the
exact frequency of the occurrences.
Incident Details
Figure 10 shows the IBM Content Analytics Miner output when the Incident Details
contextual view is selected. The figure shows two occurrences of the keyword
fouling
from the unstructured field INCIDENT_DESCRIPTION and so on
with the other keywords. The output can be also viewed for other configured
contextual views.
Figure 10. Incident details contextual view in IBM
Content Analytics Miner
Resolution & Final Disposition
In Figure 11, you see the fouling
keyword in 24 fields, which include
the fields RESOLUTION_DESC and CLOSING_COMMENT. As the CLOSING_COMMENT is the leaf
level in the hierarchy, it is unique and the occurrences are high.
Figure 11. Resolution & Final Disposition
contextual view in IBM Content Analytics Miner
Use case analysis
Now let’s consider a use case scenario where business users want to analyze the
occurrence of a keywords and want to find the relationship with the facet derived
from the dimensions at different levels of hierarchy.
Assume that you see that the fouling
keyword has unusual
occurrences compared with other defect types in the context of the Resolution &
Final Disposition contextual view. You want to investigate the issue in more
detail and to analyze the relationship of the “reworking” and “fouling” components
referred to in those incidents. Here are the steps you would follow.
- Add the
fouling
keyword to the query, which refines the analysis to
24 results out of 150 within the context of Resolution & Final
Disposition”((*:*) AND (keyword::/"Explore"/"Defect Type(s)"/"fouling")) IN resolution_final_disposition
- Out of all the incidents, incidents D2 and D1 have the most occurrences of the
fouling
keyword in Resolution and Closing Comment. The user
adds these two incident codes to the query, which narrows the results further to
16.Figure 12. Frequency of incident based on the criteria
(((*:*) AND (keyword::/"Explore"/"Defect Type(s)"/"fouling")) IN resolution_final_disposition) AND (keyword::/"Facet Hierarchy"/"Incident Header"/"Incident Code"/"D2" OR keyword::/"Facet Hierarchy"/"Incident Header"/"Incident Code"/"D1")
- Now a free form of text “rework and use” is added to the query, which further
refines the scope and retrieves five specific records. (Instead of free form
text, it would be possible to use the phrase annotation to retrieve the
different writing patterns: for example, “rework and reinspect”, “rwrk &
use.”, and “rework & re-use”.) Here is the final query
syntax:((("rework and use") AND (keyword::/"Explore"/"Defect Type(s)"/"fouling")) IN resolution_final_disposition) AND (keyword::/"Facet Hierarchy"/"Incident Header"/"Incident Code"/"D2" OR keyword::/"Facet Hierarchy"/"Incident Header"/"Incident Code"/"D1")
Now you can find the trend choosing different facets from the facet tree. Figure
13 shows the trend charts in IBM Watson Explorer Content Analytics Miner’s dashboard
for Affected Product vs. Affected Part(s) based on the text analytics query derived
above.
Figure 13. Final output in IBM Watson Content
Analytics Miner dashboard
You can see that out of six different parts affected in incidents D1 and D2, part-04,
part-13, and part-01 are affected in this use case scenario.
Conclusion
This tutorial has described a strategy for analyzing hierarchical data structure in
IBM Watson Explorer within the context of content analytics while retaining the
hierarchical structure or grains of the dimensional model. The dataset used in this
tutorial is representative of real-world scenarios, but it is not very large or
complex. In an actual business scenario, this kind of analysis would give you
the combined power of both content analytics and dimensional analysis, and
would provide you with machine-assisted decision making, automation, and
business optimization.
Such scenarios are common in the following domains:
- Incident management
- Case management
- Content management
- Quality management system
- Enterprise information management
The concepts described here can be implemented for analyzing data inconsistency in
a source system, automated data profiling, big data analytics, data governance, and
data ontology creation.
Using these concepts, you can overcome the problem of data duplicity and analyze the
unstructured text contents at each hierarchical level (separately or together). You
can thereby derive insights that can be made accessible to business users.
Downloadable resources
Credit: Source link