This article contains most of essential pre-processing that are we may have to do on our text data. All the pre-processing are bind as separate functions, which can be readily used by passing your text data as a parameter to the “dataCleaningProcess” function that we are going to write in one shot.
It is assumed that the readers are aware of the need for text cleaning and pre-processing.
1. Machine Learning Concepts Every Data Scientist Should Know
2. AI for CFD: byteLAKE’s approach (part3)
3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data
4. Top 5 Jupyter Widgets to boost your productivity!
The below code performs the following activities:
- Remove punctuation
- Clean null records in a data frame
- Convert all the text to lower-case
- Remove duplicate items in the data frame
- Remove single character from a sentence
- Remove stop words
- Apply Stemmer
# Import Pandas library
import pandas as pd# Import some basic text pre-processing libraries
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
Before we pre-process any text data, it is better to have it converted to a pandas data frame.
## set English language
stop_words = set(stopwords.words('english'))## declaration of Porter stemmer.
porter=PorterStemmer()## Clean Null Record in dataframe
def cleanEmptyData(columnName,df):
return df[df[columnName].notnull()]## Remove Punctuation
def remove_punctuation(columnName,df):
return df.loc[:,columnName].apply(lambda x: re.sub('[^a-zA-zs]','',x))## Convert To Lower Case
## Remove duplicate item in the dataframe
def lower_case(input_str):
input_str = input_str.lower()
return input_str
def removeDuplicate(df,list):
df.drop_duplicates(list, inplace=True)## Remove nlp stop words
def remove_stop_words(columnName,df):
return df.loc[:,columnName].apply(lambda x: [word for word in x.split() if word not in stop_words])##Remove single character from the sentence
def remove_one_character_word(columnName,df):
return df.loc[:,columnName].apply(lambda x: [i for i in x if len(i) > 1])## Join as a single text with separator
def join_seperator(columnName,df):
seperator = ', '
return df.loc[:,columnName].apply(lambda x: seperator.join(x))## apply stemmer to data frame fields
def apply_stemmer(columnName,df):
return df.loc[:,columnName].apply(lambda x: [porter.stem(word) for word in x])
Function that utilizes the above pre-processing functions and returns the clean data.
## Data Cleaning Process function | Customize col1_name and col2_name with your data column names, if there are more than 2 columns who's text needs to be cleaned, include them as third attribute
def dataCleaningProcess(dataFrame):
## remove duplicate records
removeDuplicate(dataFrame,['col1_name', 'col2_name'])## clean null value records
clean_data = cleanEmptyData('col1',dataFrame)
clean_data.loc[:,'text_body_clean'] = clean_data.loc[:,'text_body'].apply(lambda x: lower_case(x))
## removing punctuation
clean_data.loc[:,'text_body_clean'] = remove_punctuation('text_body_clean',clean_data)
## apply stop words
clean_data.loc[:,'text_body_clean'] = remove_stop_words('text_body_clean',clean_data)
## apply stemmer for each tokens
clean_data.loc[:,'text_body_clean'] = apply_stemmer('text_body_clean',clean_data)
## removing single charter word in the sentence
clean_data.loc[:,'text_body_clean'] = remove_one_character_word('text_body_clean',clean_data)
## join as a single text from words token
clean_data.loc[:,'text_body_clean'] = join_seperator('text_body_clean',clean_data)
## remove coma after join
clean_data.loc[:,'text_body_clean'] = remove_punctuation('text_body_clean',clean_data)
return clean_data
After writing the above pre-processing function, we read the un-cleaned data using pandas and load it to an object “data”.
data = pd.read_csv('/PATH/FILE_NAME.CSV',dtype='str')
Pass the data object to the ‘dataCleaningProcess’ function as a parameter.
clean_data = dataCleaningProcess(data)
The above reusable pre-processing helper function is part of the kaggle COVID-19 challenge i along with Soubam Kohinoor took part in.
To know more about the co-author Kohinoor Soubam→ https://www.linkedin.com/in/soubam-kohinoor-singh-1a397327/
Credit: BecomingHuman By: Rakesh Thoppaen