Saturday, February 27, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

NLP – Text Pre-Processing Reusable Helper Function | by Rakesh Thoppaen | Aug, 2020

August 16, 2020
in Neural Networks
NLP – Text Pre-Processing Reusable Helper Function | by Rakesh Thoppaen | Aug, 2020
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

This article contains most of essential pre-processing that are we may have to do on our text data. All the pre-processing are bind as separate functions, which can be readily used by passing your text data as a parameter to the “dataCleaningProcess” function that we are going to write in one shot.

It is assumed that the readers are aware of the need for text cleaning and pre-processing.

You might also like

Creative Destruction and Godlike Technology in the 21st Century | by Madhav Kunal

How 3D Cuboid Annotation Service is better than free Tool? | by ANOLYTICS

Role of Image Annotation in Applying Machine Learning for Precision Agriculture | by ANOLYTICS

1. Machine Learning Concepts Every Data Scientist Should Know

2. AI for CFD: byteLAKE’s approach (part3)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

4. Top 5 Jupyter Widgets to boost your productivity!

The below code performs the following activities:

  1. Remove punctuation
  2. Clean null records in a data frame
  3. Convert all the text to lower-case
  4. Remove duplicate items in the data frame
  5. Remove single character from a sentence
  6. Remove stop words
  7. Apply Stemmer
# Import Pandas library 
import pandas as pd

# Import some basic text pre-processing libraries
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')
nltk.download('punkt')

Jobs in ML

Before we pre-process any text data, it is better to have it converted to a pandas data frame.

## set English language
stop_words = set(stopwords.words('english'))

## declaration of Porter stemmer.
porter=PorterStemmer()

## Clean Null Record in dataframe
def cleanEmptyData(columnName,df):
return df[df[columnName].notnull()]

## Remove Punctuation
def remove_punctuation(columnName,df):
return df.loc[:,columnName].apply(lambda x: re.sub('[^a-zA-zs]','',x))

## Convert To Lower Case
def lower_case(input_str):
input_str = input_str.lower()
return input_str

## Remove duplicate item in the dataframe
def removeDuplicate(df,list):
df.drop_duplicates(list, inplace=True)

## Remove nlp stop words
def remove_stop_words(columnName,df):
return df.loc[:,columnName].apply(lambda x: [word for word in x.split() if word not in stop_words])

##Remove single character from the sentence
def remove_one_character_word(columnName,df):
return df.loc[:,columnName].apply(lambda x: [i for i in x if len(i) > 1])

## Join as a single text with separator
def join_seperator(columnName,df):
seperator = ', '
return df.loc[:,columnName].apply(lambda x: seperator.join(x))

## apply stemmer to data frame fields
def apply_stemmer(columnName,df):
return df.loc[:,columnName].apply(lambda x: [porter.stem(word) for word in x])

Function that utilizes the above pre-processing functions and returns the clean data.

## Data Cleaning Process function | Customize col1_name and col2_name with your data column names, if there are more than 2 columns who's text needs to be cleaned, include them as third attribute 
def dataCleaningProcess(dataFrame):
## remove duplicate records
removeDuplicate(dataFrame,['col1_name', 'col2_name'])

## clean null value records
clean_data = cleanEmptyData('col1',dataFrame)
clean_data.loc[:,'text_body_clean'] = clean_data.loc[:,'text_body'].apply(lambda x: lower_case(x))

## removing punctuation
clean_data.loc[:,'text_body_clean'] = remove_punctuation('text_body_clean',clean_data)

## apply stop words
clean_data.loc[:,'text_body_clean'] = remove_stop_words('text_body_clean',clean_data)

## apply stemmer for each tokens
clean_data.loc[:,'text_body_clean'] = apply_stemmer('text_body_clean',clean_data)

## removing single charter word in the sentence
clean_data.loc[:,'text_body_clean'] = remove_one_character_word('text_body_clean',clean_data)

## join as a single text from words token
clean_data.loc[:,'text_body_clean'] = join_seperator('text_body_clean',clean_data)

## remove coma after join
clean_data.loc[:,'text_body_clean'] = remove_punctuation('text_body_clean',clean_data)

return clean_data

After writing the above pre-processing function, we read the un-cleaned data using pandas and load it to an object “data”.

data = pd.read_csv('/PATH/FILE_NAME.CSV',dtype='str')

Pass the data object to the ‘dataCleaningProcess’ function as a parameter.

clean_data =  dataCleaningProcess(data)

The above reusable pre-processing helper function is part of the kaggle COVID-19 challenge i along with Soubam Kohinoor took part in.

To know more about the co-author Kohinoor Soubam→ https://www.linkedin.com/in/soubam-kohinoor-singh-1a397327/

Credit: BecomingHuman By: Rakesh Thoppaen

Previous Post

AI/Machine Learning Market 2020 Size by Product Analysis, Application, End-Users, Regional Outlook, Competitive Strategies and Forecast to 2027 – Owned

Next Post

Artificial Intelligence and Machine Learning Market Evolving

Related Posts

Creative Destruction and Godlike Technology in the 21st Century | by Madhav Kunal
Neural Networks

Creative Destruction and Godlike Technology in the 21st Century | by Madhav Kunal

February 26, 2021
How 3D Cuboid Annotation Service is better than free Tool? | by ANOLYTICS
Neural Networks

How 3D Cuboid Annotation Service is better than free Tool? | by ANOLYTICS

February 26, 2021
Role of Image Annotation in Applying Machine Learning for Precision Agriculture | by ANOLYTICS
Neural Networks

Role of Image Annotation in Applying Machine Learning for Precision Agriculture | by ANOLYTICS

February 26, 2021
Label a Dataset with a Few Lines of Code | by Eric Landau | Jan, 2021
Neural Networks

Label a Dataset with a Few Lines of Code | by Eric Landau | Jan, 2021

February 25, 2021
How to Make Data Annotation More Efficient? | by ByteBridge | Feb, 2021
Neural Networks

How to Make Data Annotation More Efficient? | by ByteBridge | Feb, 2021

February 25, 2021
Next Post
Artificial Intelligence and Machine Learning Market Evolving

Artificial Intelligence and Machine Learning Market Evolving

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Malicious Amazon Alexa Skills Can Easily Bypass Vetting Process
Internet Privacy

Malicious Amazon Alexa Skills Can Easily Bypass Vetting Process

February 26, 2021
Give Your Business Users Simple Augmented Analytics
Data Science

Give Your Business Users Simple Augmented Analytics

February 26, 2021
AI and machine learning to help global battle with cancer
Machine Learning

AI and machine learning to help global battle with cancer

February 26, 2021
Why your diversity and inclusion efforts should include neurodiverse workers
Internet Security

Why your diversity and inclusion efforts should include neurodiverse workers

February 26, 2021
North Korean Hackers Targeting Defense Firms with ThreatNeedle Malware
Internet Privacy

North Korean Hackers Targeting Defense Firms with ThreatNeedle Malware

February 26, 2021
The Beginner Guide for Creating a Multi-Vendor eCommerce Website
Data Science

The Beginner Guide for Creating a Multi-Vendor eCommerce Website

February 26, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Malicious Amazon Alexa Skills Can Easily Bypass Vetting Process February 26, 2021
  • Give Your Business Users Simple Augmented Analytics February 26, 2021
  • AI and machine learning to help global battle with cancer February 26, 2021
  • Why your diversity and inclusion efforts should include neurodiverse workers February 26, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates