Credit: IBM
Analyze crime data with Apache Spark and Hive ETL, Part 1
Content series:
This content is part # of 2 in the series: Analyze crime data with Apache Spark and Hive ETL, Part 1
http://www.ibm.com/developerworks/library/?series_title_by=analyze+crime+data+apache+hive
Stay tuned for additional content in this series.
This content is part of the series:Analyze crime data with Apache Spark and Hive ETL, Part 1
Stay tuned for additional content in this series.
This tutorial describes the analysis of United Kingdom crime data from inception to
final results. Follow along to learn about data download, data transformation,
loading into a distributed data warehouse, Apache Hive, and subsequent
analysis using Apache Spark. Part 1 of this series describes the
Extract, Transform, and Load (ETL) activities, and Part 2 explores the analyses performed.
Spark and Hive as alternatives to
traditional ETL tools
Many ETL tools exist, but often require programmers to be familiar with
proprietary architectures and languages. Apache Hive is a cloud-based data
warehouse that offers SQL-based tools to transform structured and
semi-structured data into a schema-based cloud data warehouse. Both Hive
and Spark work with data within the Hadoop ecosystem and Hadoop
Distributed File System (HDFS).
Spark and the SparkSQL component are well suited to fast transformations
of data within the data warehouse. Given the superior performance of
SparkSQL compared with MapReduce, this provides a speed improvement over
traditional ETL tools.
Figure 1. U.K. crime data download
The U.K. crime database was chosen to illustrate the speed and ease of use of
Spark analytics with the Hive data warehouse. The program code and
scripts for this tutorial are on GitHub.
There is more than one method to retrieve data from the U.K. crime database.
The API download is suitable for analysis of local areas only. For this
tutorial, use the CSV file
download to analyze national data. This
also lets you take advantage of a nice Hive feature during data
load.
The figure below shows the data download from the U.K. crime dataset. I will
illustrate the data route by selecting just London, but you may choose as
many as you wish. I selected the Metropolitan Police Service (London) from September 2014 to August 2017.
On the custom download tab:
- Select the date range of September 2014
through August 2017. - Select the forces Cambridgeshire Constabulary and Metropolitan Police. You can choose as many counties as you’d like, but
for illustration purposes in this tutorial, only two counties are
used. Scroll down and select the Include stop and search checkbox. - Click the Generate file option. It takes a few seconds to generate the file.
- When complete, click Download now to download the data
file.
Repeat this process for the street crime checkbox and the outcomes
checkbox. This results in three files, each containing all the data for
street crimes, outcomes, and stop and search. The figure below shows the three available dataset checkboxes available
for download.
Figure 2. U.K. crime data download
After the download completes, you can find the file in your Download
folder. Double-click the file to open the standard unzip utility on your
computer. Create a folder where you want the unzipped files to live and name
it UK-Crime
.
The next step is cleaning the data and subsequently loading the data into
Hive. Create a folder and name it staging
. This folder is for the cleaned
data to load into Apache Hive.
Data cleaning
The initial data download, when unzipped, results in a directory hierarchy
with files stored in numerous subdirectories. Hive has a useful
data load feature that ingests every file in a given directory when each
file has the same structure (for example, the same column types). However,
the data files provided by the U.K. Police use several file types
in a directory hierarchy, by month and year, and by county. Therefore, the
first step is to create a staging folder for each of the different file types.
To simplify this process, I created a Java™ program that parses the
directory hierarchy and moves each file of a given type to the custom
staging area for that file type. This simple Java utility program
traverses the folder hierarchy of the unzipped data files and moves all
the files of a given type into the staging area folder I created.
This program copes with minor changes in the directory hierarchy that have
been implemented by the U.K. police over time. All files in a staging area
folder load into Apache Hive Warehouse in one step.
public class MoveFilesUtility { public static void main(String[] args) { //String mypath = "/home/hduser/zdata/uk-crime/crime-stop-search"; String mypath = "/home/hduser/zdata/uk-crime/crime-street-crime"; MoveFilesUtility o1 = new MoveFilesUtility(); o1.listFilesAndFilesSubDirectories(mypath); } public void listFilesAndFilesSubDirectories(String directoryName){ String dest_dir = "/home/hduser/zdata/uk-crime/spark-lake/staging/"; File directory = new File(directoryName); //get all the files from a directory File[] fList = directory.listFiles(); for (File file : fList){ if (file.isFile()){ System.out.println(file.getAbsolutePath()); moveUkCrimeFilesToDir( dest_dir, file.getAbsolutePath()); } else if (file.isDirectory()){ listFilesAndFilesSubDirectories(file.getAbsolutePath()); } } } /** * Copy uk crime file to another directory ( for hive data load * @param directoryNameDestination to be listed */ public void moveUkCrimeFilesToDir(String directoryNameDestination, String mypath){ try{ File afile =new File(mypath); if(!afile.isFile()) return; if(afile.renameTo(new File(directoryNameDestination + afile.getName()))){ System.out.println("File move successful!"); }else{ System.out.println("File failed to move! " + afile.getName() ); } }catch(Exception e){ e.printStackTrace(); } } }
You can copy and paste the Java code from the listing above or pull the
code from GitHub.
Apache Hive and ETL
Apache Hive is a distributed data warehouse system built to work on Hadoop.
It is used to query and manage large datasets that reside in HDFS storage. Hive provides a mechanism to project structure onto
the data in Hadoop and HDFS and to query that data using a SQL-like
language called HiveQL (HQL). Because Hive provides SQL-based tools to
enable easy data extract, transform, and load, it makes sense to use
HQL scripts to load data into Hive.
Loading data into Apache Hive
With Hive, you can load all files in a given directory as long as
they have the same data structure. For the street crime data, create a
subdirectory named staging_area. Make a note of the path on your computer
and use that path in the Hive script.
Defining Hive tables
The Hive HQL file below creates the Hive data tables and loads all the
data files in the directory staging_area into the table.
For subsequent analysis, it helps to understand the nature of the police
crime dataset and how the data is collected. Files made available each
month include:
- Crime and Anti-Social Behavior (ASB) or Street Crime File
— Contains street crimes, such as
robberies, assaults, criminal damage. - Police Outcomes File — Investigations often last months, and this file will contain data and
updates on events months, and sometimes years, past. - Court Results — Matching police data submitted to the
government is matched against
Ministry of Justice court hearing outcomes. Some crimes can be reclassified
as investigations proceed; for others, latitude and longitude data is not
necessarily precise, such as where a victim is unable to say exactly
where a crime took place. The exact location of a crime is approximate to
protect victims’ identities. Usually, this is chosen from a list of 750,000
reference points closest to the crime.
Typical data integration issues
There is no unique identifier for crimes that run from
police service through criminal prosecution service to the courts
themselves. The police use a fuzzy matching process to match these. Data
uploaded is a snapshot of time; crimes are often reclassified as
different types or confirmed as a false report after investigation.
Following are the output
files from the system and file formats.
Table 1. Street Crime File
Fields | Description |
---|---|
Reported by | The force that provided data |
Falls within | The force that provided data |
Latitude | Latitude of crime |
Longitude | Longitude of crime |
LSOA code | Lower Layer Super Output Area code |
LSOA name | Lower Layer Super Output Area name |
Crime type | Standard Crime classification |
Last_Outcome_category | A reference to whichever of the outcomes associated with the crime occurred most recently |
Table 2. Stop and Search File
Fields | Description |
---|---|
Reported by | The force that provided data |
Falls within | The force that provided data |
Latitude | Latitude of crime |
Longitude | Longitude of crime |
LSOA code | Lower Layer Super Output Area code |
LSOA name | Lower Layer Super Output Area name |
Crime type | Standard Crime classification |
Last_Outcome_category | A reference to whichever of the outcomes associated with the crime occurred most recently |
This file contains data for outcomes reported within the past month. The
crimes may have occurred many months prior.
Table 3. Outcomes File
Fields | Description |
---|---|
Offense Reference | ReferenceID |
Outcome Date | Date of outcome |
Outcome Category | Category of outcome |
Run Hive SQL scripts and display results
A Hive SQL (HQL) script is just a series of Hive query language commands. They are
the same ones you would use in the Hive shell. This script file will
consist of the Hive commands in an external HQL script and returning a
value.
load_master_stop_and_search2.sql: create table if not exists stop_and_search ( searchType STRING, searchDate STRING, part_police_operation STRING, police_operation STRING, latitude DOUBLE, longitude DOUBLE, gender STRING, age_range STRING, self_define_ethnicity STRING, officer_defined_ethnicity STRING, legislation STRING, object_of_search STRING, outcome STRING, outcome_linked_to_object STRING, removal_clothing STRING ) row format delimited fields terminated by ',' stored as textfile; load data local inpath '/home/hduser/zdata/uk-crime/spark-lake/staging/' into table stop_and_search;
It is efficient to enter a set of Hive commands into a file. This permits
reuse of common commands, which are often lengthy definitions. This file
contains Hive table definitions, followed by instructions to load the
data files into the Hive table. You invoke the Hive SQL script file from
the bash command line by typing the following command at the bash
prompt: $hive -f load_master_stop_and_search2.sql
.
The above script loaded 1,601 files into Hive. This file is available on
GitHub. A similar script is available on GitHub for the outcomes data and
for the street crime data. Next, we use SparkSQL to organize and catalog
the data and sort the resulting set from largest to smallest. The SparkSQL program is written
in the Java programming language using the Spark Java API. The code is
built using Maven. The Maven Project Object Model (POM) file is available
of GitHub. The program is invoked from the bash command line with
spark-submit:
$mvn package $usr/local/spark/bin/spark-submit –class “DemoHive” --master local[4] target/demo-hive-1.0.jar
The first command (mvn package
) builds to Java program and integrates the
Spark and Hive API functions automatically. This results in a Java JAR
file as output. The second command runs the program, which interrogates the
Hive data warehouse, performs calculations, and displays the results. An
example output run is shown below.
Figure 3. Java SparkAPI program output
Summary
In this tutorial, we showed the data loading, followed by transformations
within the warehouse to integrate disparate data sources. This provides
greater traceability of data changes, which remains a key factor in
ensuring data quality. Without data quality and traceability, you cannot
trust the accuracy of subsequent analyses.
The design of our database files for the analysis was taken after the data
load of different CSV files from different sources, and the decisions were
taken after checks of the data from the initial data loads. These checks
were performed in iterative manner within the data warehouse itself.
Downloadable resources
Related topics
Credit: IBM