Tuesday, April 13, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Technology Companies

Learn about Extract, Transform, and Load (ETL)

January 15, 2019
in Technology Companies
On demand data in Python, Part 3: Coroutines and asyncio
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: IBM

Analyze crime data with Apache Spark and Hive ETL, Part 1

You might also like

Day 1 inside the digital ops center – IBM Developer

AIOps & Integration on April 20 – IBM Developer

Build and run regulated workloads in the cloud – IBM Developer


Content series:

This content is part # of 2 in the series: Analyze crime data with Apache Spark and Hive ETL, Part 1

http://www.ibm.com/developerworks/library/?series_title_by=analyze+crime+data+apache+hive

Stay tuned for additional content in this series.

This content is part of the series:Analyze crime data with Apache Spark and Hive ETL, Part 1

Stay tuned for additional content in this series.

This tutorial describes the analysis of United Kingdom crime data from inception to
final results. Follow along to learn about data download, data transformation,
loading into a distributed data warehouse, Apache Hive, and subsequent
analysis using Apache Spark. Part 1 of this series describes the
Extract, Transform, and Load (ETL) activities, and Part 2 explores the analyses performed.

Spark and Hive as alternatives to
traditional ETL tools

Many ETL tools exist, but often require programmers to be familiar with
proprietary architectures and languages. Apache Hive is a cloud-based data
warehouse that offers SQL-based tools to transform structured and
semi-structured data into a schema-based cloud data warehouse. Both Hive
and Spark work with data within the Hadoop ecosystem and Hadoop
Distributed File System (HDFS).
Spark and the SparkSQL component are well suited to fast transformations
of data within the data warehouse. Given the superior performance of
SparkSQL compared with MapReduce, this provides a speed improvement over
traditional ETL tools.

Figure 1. U.K. crime data download

Image shows U.K. crime data download

Image shows U.K. crime data download

The U.K. crime database was chosen to illustrate the speed and ease of use of
Spark analytics with the Hive data warehouse. The program code and
scripts for this tutorial are on GitHub.

There is more than one method to retrieve data from the U.K. crime database.
The API download is suitable for analysis of local areas only. For this
tutorial, use the CSV file
download
to analyze national data. This
also lets you take advantage of a nice Hive feature during data
load.

The figure below shows the data download from the U.K. crime dataset. I will
illustrate the data route by selecting just London, but you may choose as
many as you wish. I selected the Metropolitan Police Service (London) from September 2014 to August 2017.

On the custom download tab:

  • Select the date range of September 2014
    through August 2017.
  • Select the forces Cambridgeshire Constabulary and Metropolitan Police. You can choose as many counties as you’d like, but
    for illustration purposes in this tutorial, only two counties are
    used. Scroll down and select the Include stop and search checkbox.
  • Click the Generate file option. It takes a few seconds to generate the file.
  • When complete, click Download now to download the data
    file.

Repeat this process for the street crime checkbox and the outcomes
checkbox. This results in three files, each containing all the data for
street crimes, outcomes, and stop and search. The figure below shows the three available dataset checkboxes available
for download.

Figure 2. U.K. crime data download

image shows U.K. crime data download

image shows U.K. crime data download

After the download completes, you can find the file in your Download
folder. Double-click the file to open the standard unzip utility on your
computer. Create a folder where you want the unzipped files to live and name
it UK-Crime.

The next step is cleaning the data and subsequently loading the data into
Hive. Create a folder and name it staging. This folder is for the cleaned
data to load into Apache Hive.

Data cleaning

The initial data download, when unzipped, results in a directory hierarchy
with files stored in numerous subdirectories. Hive has a useful
data load feature that ingests every file in a given directory when each
file has the same structure (for example, the same column types). However,
the data files provided by the U.K. Police use several file types
in a directory hierarchy, by month and year, and by county. Therefore, the
first step is to create a staging folder for each of the different file types.

To simplify this process, I created a Java™ program that parses the
directory hierarchy and moves each file of a given type to the custom
staging area for that file type. This simple Java utility program
traverses the folder hierarchy of the unzipped data files and moves all
the files of a given type into the staging area folder I created.
This program copes with minor changes in the directory hierarchy that have
been implemented by the U.K. police over time. All files in a staging area
folder load into Apache Hive Warehouse in one step.

public class MoveFilesUtility {
    
    public static void main(String[] args) {
       //String mypath = "/home/hduser/zdata/uk-crime/crime-stop-search";
       String mypath = "/home/hduser/zdata/uk-crime/crime-street-crime";
       
       MoveFilesUtility o1 = new MoveFilesUtility();

       o1.listFilesAndFilesSubDirectories(mypath);  
    }
     public void listFilesAndFilesSubDirectories(String directoryName){

        String dest_dir = "/home/hduser/zdata/uk-crime/spark-lake/staging/";
        File directory = new File(directoryName);
        //get all the files from a directory
        File[] fList = directory.listFiles();
        for (File file : fList){
            if (file.isFile()){
                System.out.println(file.getAbsolutePath());
                moveUkCrimeFilesToDir( dest_dir, file.getAbsolutePath());
            } else if (file.isDirectory()){
                listFilesAndFilesSubDirectories(file.getAbsolutePath());
            }
        }
    }      
    /**
     * Copy uk crime file to another directory ( for hive data load
     * @param directoryNameDestination to be listed
     */
    public void moveUkCrimeFilesToDir(String directoryNameDestination, String mypath){
        try{
           File afile =new File(mypath);
           if(!afile.isFile()) return;
           if(afile.renameTo(new File(directoryNameDestination + afile.getName()))){           
    		System.out.println("File move successful!");
    	   }else{
    		System.out.println("File  failed to move!  " + afile.getName() );
    	   }
    	}catch(Exception e){
    		e.printStackTrace();
    	}
    }    
}

You can copy and paste the Java code from the listing above or pull the
code from GitHub.

Apache Hive and ETL

Apache Hive is a distributed data warehouse system built to work on Hadoop.
It is used to query and manage large datasets that reside in HDFS storage. Hive provides a mechanism to project structure onto
the data in Hadoop and HDFS and to query that data using a SQL-like
language called HiveQL (HQL). Because Hive provides SQL-based tools to
enable easy data extract, transform, and load, it makes sense to use
HQL scripts to load data into Hive.

Loading data into Apache Hive

With Hive, you can load all files in a given directory as long as
they have the same data structure. For the street crime data, create a
subdirectory named staging_area. Make a note of the path on your computer
and use that path in the Hive script.

Defining Hive tables

The Hive HQL file below creates the Hive data tables and loads all the
data files in the directory staging_area into the table.

For subsequent analysis, it helps to understand the nature of the police
crime dataset and how the data is collected. Files made available each
month include:

  • Crime and Anti-Social Behavior (ASB) or Street Crime File
    — Contains street crimes, such as
    robberies, assaults, criminal damage.
  • Police Outcomes File — Investigations often last months, and this file will contain data and
    updates on events months, and sometimes years, past.
  • Court Results — Matching police data submitted to the
    government is matched against
    Ministry of Justice court hearing outcomes. Some crimes can be reclassified
    as investigations proceed; for others, latitude and longitude data is not
    necessarily precise, such as where a victim is unable to say exactly
    where a crime took place. The exact location of a crime is approximate to
    protect victims’ identities. Usually, this is chosen from a list of 750,000
    reference points closest to the crime.

Typical data integration issues

There is no unique identifier for crimes that run from
police service through criminal prosecution service to the courts
themselves. The police use a fuzzy matching process to match these. Data
uploaded is a snapshot of time; crimes are often reclassified as
different types or confirmed as a false report after investigation.
Following are the output
files from the system and file formats.

Table 1. Street Crime File
Fields Description
Reported by The force that provided data
Falls within The force that provided data
Latitude Latitude of crime
Longitude Longitude of crime
LSOA code Lower Layer Super Output Area code
LSOA name Lower Layer Super Output Area name
Crime type Standard Crime classification
Last_Outcome_category A reference to whichever of the outcomes associated with the
crime occurred most recently
Table 2. Stop and Search File
Fields Description
Reported by The force that provided data
Falls within The force that provided data
Latitude Latitude of crime
Longitude Longitude of crime
LSOA code Lower Layer Super Output Area code
LSOA name Lower Layer Super Output Area name
Crime type Standard Crime classification
Last_Outcome_category A reference to whichever of the outcomes associated with the
crime occurred most recently

This file contains data for outcomes reported within the past month. The
crimes may have occurred many months prior.

Table 3. Outcomes File
Fields Description
Offense Reference ReferenceID
Outcome Date Date of outcome
Outcome Category Category of outcome

Run Hive SQL scripts and display results

A Hive SQL (HQL) script is just a series of Hive query language commands. They are
the same ones you would use in the Hive shell. This script file will
consist of the Hive commands in an external HQL script and returning a
value.

load_master_stop_and_search2.sql:
create table if not exists stop_and_search (
searchType STRING, searchDate STRING, part_police_operation STRING, police_operation STRING, latitude DOUBLE, longitude DOUBLE, gender STRING, age_range STRING, self_define_ethnicity STRING, officer_defined_ethnicity STRING, legislation STRING, object_of_search STRING, outcome STRING, outcome_linked_to_object STRING, removal_clothing STRING
)
row format delimited fields terminated by ',' 
stored as textfile;
load data local inpath '/home/hduser/zdata/uk-crime/spark-lake/staging/' into table stop_and_search;

It is efficient to enter a set of Hive commands into a file. This permits
reuse of common commands, which are often lengthy definitions. This file
contains Hive table definitions, followed by instructions to load the
data files into the Hive table. You invoke the Hive SQL script file from
the bash command line by typing the following command at the bash
prompt:
$hive -f load_master_stop_and_search2.sql.

The above script loaded 1,601 files into Hive. This file is available on
GitHub. A similar script is available on GitHub for the outcomes data and
for the street crime data. Next, we use SparkSQL to organize and catalog
the data and sort the resulting set from largest to smallest. The SparkSQL program is written
in the Java programming language using the Spark Java API. The code is
built using Maven. The Maven Project Object Model (POM) file is available
of GitHub. The program is invoked from the bash command line with
spark-submit:

$mvn package
$usr/local/spark/bin/spark-submit –class “DemoHive” --master local[4] target/demo-hive-1.0.jar

The first command (mvn package) builds to Java program and integrates the
Spark and Hive API functions automatically. This results in a Java JAR
file as output. The second command runs the program, which interrogates the
Hive data warehouse, performs calculations, and displays the results. An
example output run is shown below.

Figure 3. Java SparkAPI program output

Image shows Java SparkAPI program output

Image shows Java SparkAPI program output

Summary

In this tutorial, we showed the data loading, followed by transformations
within the warehouse to integrate disparate data sources. This provides
greater traceability of data changes, which remains a key factor in
ensuring data quality. Without data quality and traceability, you cannot
trust the accuracy of subsequent analyses.

The design of our database files for the analysis was taken after the data
load of different CSV files from different sources, and the decisions were
taken after checks of the data from the initial data loads. These checks
were performed in iterative manner within the data warehouse itself.


Downloadable resources

Related topics

Credit: IBM

Previous Post

GE sees digital ghost deployments in 2019

Next Post

3491), “Japan’s Amazon in Real Estate” Powered by Artificial Intelligence…

Related Posts

Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Day 1 inside the digital ops center – IBM Developer

April 10, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

AIOps & Integration on April 20 – IBM Developer

April 10, 2021
Build and run regulated workloads in the cloud – IBM Developer
Technology Companies

Build and run regulated workloads in the cloud – IBM Developer

April 9, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

How a growing ecosystem of 90+ partners creates opportunities for clients with IBM Cloud for Financial Services – IBM Developer

April 7, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Round 1 – IBM Developer

April 7, 2021
Next Post
3491), “Japan’s Amazon in Real Estate” Powered by Artificial Intelligence…

3491), “Japan’s Amazon in Real Estate” Powered by Artificial Intelligence…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Billions of smartphone owners will soon be authorising payments using facial recognition
Internet Security

Billions of smartphone owners will soon be authorising payments using facial recognition

April 13, 2021
Indian Brokerage Firm Upstox Suffers Data Breach Leaking 2.5 Millions Users’ Data
Internet Privacy

Indian Brokerage Firm Upstox Suffers Data Breach Leaking 2.5 Millions Users’ Data

April 13, 2021
Caruso real estate to accept Bitcoin as rent payment in industry first
Blockchain

Caruso real estate to accept Bitcoin as rent payment in industry first

April 12, 2021
AI, Machine And Deep Learning: Filling Today’s Need for Speed And Iteration
Machine Learning

AI, Machine And Deep Learning: Filling Today’s Need for Speed And Iteration

April 12, 2021
WOMEN IN A.I. ~ Future is Female
Neural Networks

WOMEN IN A.I. ~ Future is Female

April 12, 2021
Stumbling blocks to digital transformation: Monday’s daily brief
Digital Marketing

Stumbling blocks to digital transformation: Monday’s daily brief

April 12, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Billions of smartphone owners will soon be authorising payments using facial recognition April 13, 2021
  • Indian Brokerage Firm Upstox Suffers Data Breach Leaking 2.5 Millions Users’ Data April 13, 2021
  • Caruso real estate to accept Bitcoin as rent payment in industry first April 12, 2021
  • AI, Machine And Deep Learning: Filling Today’s Need for Speed And Iteration April 12, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates