Credit: IBM
The power of a big data platform: Purpose built for data security
requirements
Note from editor: This article reflects changes as a
result of the announcement that the SonarG solution from JSonar is now
available from IBM as part of the IBM Guardium product portfolio under the
name Guardium Big Data Intelligence.
Organizations that use IBM Security Guardium activity-monitoring for data
security and compliance struggle with the quantity of collected
audit data, especially if they have 10 or more Guardium collectors. IBM
Security Guardium Big Data Intelligence provides the power of a big data
platform that is purpose-built for data security requirements. It helps
augment existing Guardium deployments with the ability to quickly create
an optimized security data lake that can retain large quantities of
historical data over long time horizons.
This article describes the Guardium Big Data Intelligence solution, how it
fits into a Guardium enterprise deployment, and how to integrate the Big
Data Intelligence solution with the Guardium data and file activity
monitoring solution.
How much data does
Guardium’s activity monitoring generate?
IBM Security Guardium provides a comprehensive data protection solution to
help secure data and files. Capabilities include vulnerability
assessments, activity monitoring, and at-rest encryption. Guardium
actively monitors all access to sensitive data by using a host-based
probe. Data events are then collected and processed at the Guardium
collector (see Figure 1).
Figure 1: S-TAP-to-collector data flow
Because so much data is being sent at high velocity to a collector, the
solution scales horizontally to allow many collectors to handle the
workload. In a typical Guardium deployment, there is a hierarchy of
components as shown in Figure 2. S-TAPs intercept database activity and
send them to collectors. Collectors analyze the activity and log them to
local storage on the collector. Collectors dump the data and copy them to
aggregators where data is merged for reporting across the enterprise.
Finally, a Central Manager manages the entire federated system.
Figure 2: Appliances in an enterprise Guardium
deployment
Guardium collects a vast number of data-related events such as session
data, query data, exceptions, authorization failures, and other
violations. A typical Guardium system will capture many billions of
complex records. Unlike other security events, database activity includes
complex data, such as objects that are being operated on, what
is being done, the context of the operation within the session, and much
more. Because of the quantity and velocity of data, Guardium is a prime
candidate to take advantage of big data technologies.
Guardium Big Data Intelligence
As shown in Figure 3, Guardium Big Data Intelligence is a big data solution
for consolidating, storing, and managing all of the complex data that is
captured by Guardium activity monitoring, and also providing the means to
enrich that data for expanded context. By doing so, it optimizes the value
and capabilities of Guardium activity monitoring and paves the way to new
functionality and benefit. It takes advantage of big data technologies to
streamline the collection and analysis of the large, growing pools of
Guardium activity-monitoring data.
Figure 3: Guardium Big Data Intelligence provides
a big data solution for data security insights
Guardium Big Data Intelligence augments the activity monitoring platform in
four primary areas:
- By simplifying data collection and reducing the cost and time
of aggregation and reporting. - By enabling long-term, online retention of Guardium data in a
low-cost and efficient manner. In addition, enabling
organizations to easily and rapidly interact with several years’ worth
of data. As an example, the following extract from an internal log
shows a complex query that found 19 matches from a data set of 27
billion full SQLs (representing data from 100 collectors over a period
of 12 months). While looking at both session-level information (such
as client IPs, DB user names) and a regular expression on the full SQL
itself, this query took less than 5 minutes:Listing 1: Sample data from
a Guardium Big Data Intelligence logcommand gdm3.$cmd command:{"aggregate":"full_sql","pipeline":[{"$match":{"Full Sql":/GUARD_TABLE73/i,,"Timestamp":{"$gte":ISODate('2014-01- 01T05:00:00.000Z'),"$lte”:ISODate('2014-01-10T05:00:00.000Z')}}},{"$join": {"$joined":"session","$as":"session","$match":{"Session Id":"$session.$Session Id"},"$project": {"Client IP":"$session.$Analyzed Client IP","Server IP":"$session.$Server IP","Server Type":"$session.$Server Type","Service Name":"$session.$Service Name","Source Program":"$session.$Source Program","DB User":"$session.$DB User... ntoreturn:0 keyUpdates:0 locks(micros) w:0 reslen:130 Execution Time = 245501ms
- By creating faster and broader access to valuable activity
data that provides self-service access for different groups in the
organization. This self-service capability reduces the
need for Guardium administrators to be involved in every report that
is required by the business and empowers multiple stakeholders to
better leverage the high value data that Guardium captures. - By enabling high-performance analytics across expanded data
sets. For example, Figure 4 shows a graph of daily
session throughput for a period of 6 months from a large collector
environment. Figure 5 shows analysis of all exceptions for a period of
6 months, clearly highlighting several periods of high exception
events. Because the analytics run on all collector data and over a
large period, they are very precise in identifying outliers. For
example, Figure 5 displays a visualization of unusual User Connections
over time via the Outliers model.
Guardium Big Data Intelligence
delivers increased functionality, fast reporting, and advanced
analytics. At the same time, it simplifies Guardium deployment and
reduces the total cost of the solution through hardware reduction and
operational efficiencies.
Figure 4: Session throughput for a period of 6
months

Figure 5: Exception analytics for a period of 6
months
From an architectural perspective, Guardium Big Data Intelligence
integrates with the traditional Guardium architecture as an alternative to
the aggregation tier. Collectors are configured to create hourly data
extracts of various activities/data by using the Guardium data mart
mechanism. These data mart extracts are copied by the collectors to the
Big Data Intelligence solution over SCP where they are consumed and merged
into a single big data store of all Guardium data, even when they are
coming from hundreds of collectors. We’ll cover that in more detail in the
next section.
Impact of Big Data Intelligence
on a Guardium enterprise deployment
As mentioned previously, Guardium enterprise deployments use collectors and
aggregators. Collectors get feeds from S-TAPs (the host-based probes) and
store the data locally first. Every 24 hours each collector will then move
its daily data to an aggregator where data is aggregated, as shown in
Figure 6. When enterprise-level queries are needed, organizations either use
distributed reports that run on these aggregators (by using a reporting
server) or in older deployments, by using a second tier of aggregation. A
typical enterprise Guardium deployment will have more than one aggregator
since the recommended ratio is between 1:8 and 1:10 aggregators per
collector.
Figure 6: Guardium infrastructure
architecture
With a Guardium Big Data Intelligence architecture, collectors communicate
directly to the big data lake, as shown in Figure 7. This communication
greatly simplifies the data collection mechanics and facilitates much more
efficient collection of larger data sets while also using less hardware
infrastructure.
Figure 7: Guardium Big Data Intelligence can
simplify a Guardium Deployment
The collectors push data to the Guardium Big Data Intelligence data lake on
an hourly basis where the data is merged with previous data, avoiding the
24-48 hour data lag that is common in previous architectures.
Guardium Big Data Intelligence can even consolidate activity data across
multiple Central Manager domains. This is especially important for larger
enterprise deployments that typically use multiple Central Managers to
consolidate data, since it provides a true enterprise-wide view of activity
data.
The interface between Guardium collectors and the Big Data Intelligence
warehouse is based on the Guardium data mart mechanism. Data marts were
introduced in Guardium V9 and are a powerful mechanism for enabling query
reports to be generated on an ongoing basis without the need for
aggregation processes and with very low collector processor usage. Data
marts also provide the most efficient way to extract data from Guardium.
Guardium includes several data marts to feed the Big Data Intelligence data
lake. These data marts run on an hourly schedule and copy files to the
data lake by using scheduled jobs that generate extract files and copy the
files over an encrypted channel (by using SCP). These files are processed
by ETL and put in a form that makes reporting and analytics fast and
simple as shown in Figure 8.
Figure 8: Guardium collectors generate data that
is processed by ETL
Advantages to big data
architecture by using the Guardium Big Data Intelligence Solution
From an infrastructure perspective, there are many advantages to this
approach, including the following potential benefits:
- Reduced hardware and operational costs of aggregation.
- Reduced latency of data consolidation, down from 24 hours to 1
hour. - Reduced collector storage footprint by purging data more frequently
from the collectors.
Importantly, both lab testing and production deployments indicate the
reports run anywhere from 10-100 times faster on the big data
platform than on aggregators, making the system even more valuable for
both compliance reporting and security analytics. The system uses various
compression techniques for keeping the cost of long-term retention down
while it also keeps data in a form that allows reporting over extended
periods of time without having to restore archive files.
Guardium Big Data Intelligence
Requirements
Guardium Big Data Intelligence requires Linux® to run and is typically
installed on top of the organization’s standard Linux build. It can run on
various Linux flavors but is most often installed over Red Hat Enterprise
Linux.
A typical Guardium Big Data Intelligence node will have 2 CPUs, 64-128G of
memory (depending on load and concurrency requirements), and 2TB-30TB of
disk (depending on number of collectors, retention needs, and Guardium
policy). The system can be physical or virtual. Virtual systems are easier
to deploy, but physical systems with local disks tend to be less expensive
due to access to lower-cost local storage.
Guardium Big Data Intelligence allows the use of inexpensive
commodity-spinning disks and is specifically optimized to leverage new
cloud and object store technologies. These allow enterprises to easily
construct large scale security data lakes at a fraction of the cost of
traditional enterprise storage.
Data marts: Integrating the Big Data
Intelligence data lake with Guardium collectors and Central Managers
Integration between Guardium activity collection and the big data lake
relies on the use of data marts on both Guardium collectors and Central
Managers in order to organize and package the appropriate data to be
exported on a recurring basis from Guardium into the Big Data Intelligence
data lake.
These data-extract files are typically pushed hourly, although it varies
depending on the specific data sets. For example, operational data such as
S-TAP health is published every 5 minutes in order to further reduce the
latency of information and improve the ability to respond to issues as
they arise. Classification and Vulnerability Assessment (VA) results are
published on a daily schedule.
The data mart push mechanism is highly resilient. If data is not pushed on
time due to a network outage or for any other reason, it is pushed during
a future cycle. All data is counted and cross-referenced by using an
additional, independent data-mart extract that allows Guardium Big Data
Intelligence to validate all data transfers and confirm completeness of
the data transfer. If data does not arrive or there is an inconsistency,
the operator is immediately alerted with details.
The data marts that are used in the integration are associated with the
following types of data:
The data marts that form the integration layer are available for Guardium
V9 and V10, but not all are available in both. For example, outlier data
looks different in V9 versus V10.1.2. Table 1 shows the different data
mart names. Use DMs 49 and 50 for V10.1.2 and up. Use DMs 27 and 28 for
versions preceding V10.1.2 (but do not use both). The DMs available as of
the end of 2016 are shown in Table 1.
Important: The precise set of DMs are typically chosen based on the
implementation scope and the Guardium versions.
Table 1: Available data marts for Guardium Big
Data Intelligence
Data mart name | Report title | Unit type | Guardium version | Data mart ID |
---|---|---|---|---|
Export: Access log | Export: Access log | Collector | All | 22 |
Export: Session log | Export: Session log | Collector | All | 23 |
Export: Session log ended | Export: Session log | Collector | All | 24 |
Export: Exception log | Export: Exception log |
Any | All | 25 |
Export: Full SQL | Export: Full SQL | Collector | All | 26 |
Export: Outliers list | Analytic Outliers list |
Any | Versions preceding V10.1.2 | 27 |
Export: Outliers summary by hour | Analytic outliers summary By Date |
Any | Versions preceding V10.1.2 | 28 |
Export: Export extraction log | User-defined extraction log |
Any | All | 31 |
Export: Group members | Export: Group members |
Any | All | 29 |
Export: Policy violations | Export: Policy violations |
Collector | All | 32 |
Export: Buff usage monitor | Buff usage monitor | Any | All | 33 |
Export: VA results | Security assessment export |
Any | All | 34 |
Export: Policy violations – detailed |
Export: Policy violations |
Collector | All | 38 |
Export: Access log – detailed | Export: Access log |
Collector | All | 39 |
Export: Discovered instances | Discovered instances |
Any | All | 40 |
Export: Databases Discovered | Databases Discovered |
Any | All | 41 |
Export: Classifier results |
Classifier results |
Any | All | 42 |
Export: Datasources | Data-sources | Central Manager, stand-alone |
All | 43 |
Export: S-TAP status | S-TAP status monitor | Collector | All | 44 |
Export: Installed patches | Installed patches | Any | All | 45 |
Export: System info | Installed patches | Any | All | 46 |
Export: User – role | User – role | Central Manager, stand-alone |
All | 47 |
Export: classification process log | Classification process log | Any | All | 48 |
Export: Outliers list – enhanced | Analytic outliers list – enhanced |
Any | V10.1.2 and up | 49 |
Export: Outliers summary by hour – enhanced |
Analytic outliers summary by date – enhanced |
Any | V10.1.2 and up | 50 |
These data marts are usually bundled into the latest GPUs but are also
provided as separate individual patches for customers that did not yet
apply the GPUs. Depending on your patch level, install the appropriate
patches:
- V9:
- V10:
- V10.1 (p120) and p172, p174, p175
- V10.1.2 (p200) and p175
- Releases above 10.1.2 do not have specific dependencies as of
the publication of this article. You will need to check for
appropriate prerequisites for the release of Guardium Big Data
Intelligence that you use.
Configuring the
Guardium appliances
There are three primary steps that you execute on the Guardium appliances
to enable integration with the Big Data Intelligence solution:
- Ensure that you are on the right patch level.
- Enable and schedule data-mart extraction by using GuardAPI
(grdapi
) commands that are described in the following
section. - Adjust purge schedules on collectors to reduce storage footprint
(optional).
Enabling and scheduling the various data-mart extracts also involves three
primary steps that are described below:
- Enable the appropriate DMs and point their output to the Guardium Big
Data Intelligence system. - Schedule the extract.
- Determine the extract data start (optional).
The following sample grdapi
command string enables session
data to be passed to the Guardium Big Data Intelligence solution. This
grdapi
command tells the Guardium collectors where to copy
the data mart data via an SCP process. Configuring any of the other data
marts simply requires that the Name field change.
grdapi datamart_update_copy_file_info destinationHost="yourhosthere" destinationPassword="yourpwdhere" destinationPath="/local/raid0/sonargd/incoming" destinationUser="sonargd" Name="Export:Session Log" transferMethod="SCP"
Executing the data mart configuration commands is done only once per CM,
since all collectors can then receive this information from the CM.
Replace the hostname, password, and data path to reflect the details for
your Guardium Big Data installation.
Once the data mart is enabled, you need to schedule the extracts. Since
Guardium schedulers are local to the appliance, you need to run the
grdapi
scheduling command on each appliance from which data
is extracted. For example, for each collector that needs to send session
data you would run:
grdapi schedule_job jobType=dataMartExtraction cronString="0 45 0/1 ? * 1,2,3,4,5,6,7" objectName="Export:Session Log"
Here is an example of how to delete a schedule for session data mart:
grdapi delete_schedule deleteJob=true jobGroup="DataMartExtractionJobGroup" jobName="DataMartExtractionJob_23
The job name is "DataMartExtractionJob_"
concatenated with the
ID shown in Table 1.
Because there are many grdapi
calls to issue, use the SonarCLI
Expect script to automate the process and reduce work (see next
section).
The recommended schedules when you enable all data marts are shown in Table
2.
Table 2: Recommended data mart
schedules
Export: Access log | 0 40 0/1 ? * 1,2,3,4,5,6,7 | 00:40 |
Export: Session log | 0 45 0/1 ? * 1,2,3,4,5,6,7 | 00:45 |
Export: Session log ended | 0 46 0/1 ? * 1,2,3,4,5,6,7 | 00:46 |
Export: Exception Log | 0 25 0/1 ? * 1,2,3,4,5,6,7 | 00:25 |
Export: Full SQL | 0 30 0/1 ? * 1,2,3,4,5,6,7 | 00:30 |
Export: Outliers list | 0 10 0/1 ? * 1,2,3,4,5,6,7 | 00:10 |
Export: Outliers summary by hour | 0 10 0/1 ? * 1,2,3,4,5,6,7 | 00:10 |
Export: Export extraction Log | 0 50 0/1 ? * 1,2,3,4,5,6,7 | 00:50 |
Export: Group members | 0 15 0/1 ? * 1,2,3,4,5,6,7 | 00:15 |
Export: Policy violations | 0 5 0/1 ? * 1,2,3,4,5,6,7 | 00:05 |
Export: Buff usage monitor | 0 12 0/1 ? * 1,2,3,4,5,6,7 | 00:12 |
Export: VA results | 0 0 2 ? * 1,2,3,4,5,6,7 | Daily at 2 AM |
Export: Policy violations – detailed |
0 5 0/1 ? * 1,2,3,4,5,6,7 |
00:05 |
Export: Access log – detailed | 0 40 0/1 ? * 1,2,3,4,5,6,7 | 00:40 |
Export: Discovered instances | 0 20 0/1? * 1,2,3,4,5,6,7 | 00:20 |
Export: Databases discovered | 0 20 0/1? * 1,2,3,4,5,6,7 | 00:20 |
Export: Classifier results |
0 20 0/1? * 1,2,3,4,5,6,7 |
00:20 |
Export: Data sources | 0 0 7 ? * 1,2,3,4,5,6,7 | Daily at 7 AM |
Export: S-TAP Status | 0 0/5 0/1 ? * 1,2,3,4,5,6,7 | Every 5 minutes |
Export: Installed patches | 0 0 5 ? * 1,2,3,4,5,6,7 | Daily at 5 AM |
Export: System info | 0 0 5 ? * 1,2,3,4,5,6,7 | Daily at 5 AM |
Export: User – role | 0 5 0/1 ? * 1,2,3,4,5,6,7 | 00:05 |
Export: Classification process log | 0 25 0/1 ? * 1,2,3,4,5,6,7 | 00:25 |
Export: Outliers list – enhanced | 0 10 0/1 ? * 1,2,3,4,5,6,7 | 00:10 |
Export: Outliers summary by hour – enhanced |
0 10 0/1 ? * 1,2,3,4,5,6,7 |
00:10 |
In most cases where you schedule the data mart, it will start to export
data “from now” into the future. If you already have data on the collector
(for example, from the past 10 days) and you want to have the data also
moved to the big data lake, then you can set the start date for the data
mart to be in the past as shown in Figure 9. Edit the data mart in the CM
GUI and set the desired start date before you issue the
grdapi
schedule commands. If you have GPU 200 (V10.1.2 p200)
or later, you can set the start date by using a grdapi
. You
do not need to use the GUI.
grdapi update_datamart Name="Export:User - Role" initial_start="2016-12-01 00:00:00"
Figure 9: Enabling a start date in the
past
Optional:
Reduce collector storage footprint
Once you enable an integration between Guardium and the Guardium Big Data
Intelligence solution, the data is moved off the collector more frequently
than when using aggregation: hourly versus daily. This creates the
opportunity to purge more aggressively on the collectors and reduce the
collector storage footprint. For example, rather than allocate 300GB or
600GB per collector you can allocate 100G per collector.
Note: This can only be done if you adjust your retention per collector
appropriately (that is, keep less data on each collector).
The simplest path to this storage reduction is to build new collector VMs
with the smaller storage footprint and ensure that the purge schedule is
defined to keep only three days of data on the new collector. Redirect the
S-TAPs to point at the new collectors, which in turn point to the Big Data
Intelligence system. After a one-day period where both the old and new
collectors point to Big Data Intelligence system concurrently, the old
collectors can be backed up and decommissioned. The transition to the new
collectors will be complete then. This method can also be used to simplify
and accelerate Guardium upgrades since you do not have to worry about data
management on the collectors.
Use SonarCLI
to automate data mart setup
SonarCLI is a utility that combines a customer-provided list of Guardium
appliances with a set of predefined data marts and then communicates with
all Guardium appliances to execute and validate the grdapi
commands necessary to establish this communication (see Figure 10). Script
execution takes minutes and once completed, the big data lake will begin
receiving data mart data. Note that SonarCLI is a general-purpose script
execution framework and can also be used to automate grdapi
executions that are unrelated to the big data solution.
Figure 10: SonarCLI scripting
To use SonarCLI, you set up a file that tells the system what scripts to
run on collectors and CMs. The script then opens a CLI session per
appliance, runs the appropriate script as defined by the config file,
stores all output in a log file, and creates a summary log. Once finished
you review the summary to see if everything ran to completion and you’re
done.
For more information, visit ibm-gbdi.jsonar.com
Creating custom data
marts
In addition to the prebuilt data marts that are used by the data lake, you
can push additional data to the data lake from any report built in your
Guardium environment. Any report that is executed on the Guardium
collector or central manager can be converted into a data mart and its
results piped directly into the data lake by using the standard data
transfer process. Figures 11 and 12 show how to convert a query into a
data mart from within the query builder. As shown in Figure 11, click on
the Data Mart button for the query that you want to use for the data mart.
Figure 11: Converting a query to a data
mart
Figure 12 defines a file name with the prefix EXP that is created on an
hourly basis. The EXP prefix informs the appliance that this data mart is
being created for delivery to the Big Data Intelligence application. The
data mart name must begin with EXPORT and the EXP prefix must appear at
the start of the filename in order for the transfer to the Big Data
Intelligence solution to complete successfully.
Figure 12: Scheduling data mart delivery from
Guardium to Big Data Intelligence
As with the standard data marts, grdapi
commands must be
executed to configure the SCP transfer of the file to the data lake and
also to schedule this transfer on an hourly basis. Define the SCP transfer
configuration by using:
grdapi datamart_update_copy_file_info destinationHost="yourhosthere" destinationPassword="yourpwdhere" destinationPath="/local/raid0/sonargd/incoming" destinationUser="sonargd" Name="Export:GUARD_USER_ACTIVITY“ transferMethod="SCP"
Schedule the extract/push by using:
grdapi schedule_job jobType=dataMartExtraction cronString="0 40 0/1 ? * 1,2,3,4,5,6,7" objectName="Export:GUARD USER_ACTIVITY“
Conclusion
IBM Guardium Big Data Intelligence allows you to optimize your Guardium
environment by using a true big data solution for managing and accessing
Guardium data. When you use data marts, you can efficiently move data from
Guardium appliances faster than ever before, reduce your hardware
footprint and costs, enable fast reporting, and log online retention of
data and advanced analytics. Augmenting Guardium with a purpose-built big
data solution creates a very powerful platform for expanding the use cases
and benefits of IBM Security Guardium’s data protection solutions.
Downloadable resources
Related topics
Credit: IBM