Introduction
Deforestation is a clearing, destroying, or removing trees due to farming, mostly cattle, logging, for materials and development, mining, and drilling. All of this combined responsible for more than half of all deforestation. Over the past 50 years, nearly 17 percent of the Amazon rain forest has been lost and losses have recently been on the rise.
Exploring the domain
While exploring and analyzing the data I noticed a tendency that most of the deforestation happen not far from places where it occurred in the past. Mostly it is connected with farms, where more space needed for the cattle to grow, logging to produce more paper or wood materials and other work.
Check my web app for more visual analysis & a little demonstration.
Dataset
I took the dataset from terrabrasilis.dpi.inpe.br/en where records quantified deforested areas larger than 6.25 hectares from 2008 to 2018 discretized per year.
The dataset includes columns: gid — unique identifier of each feature;
origin_id — unique identifier for traceability of the feature in the origin for geo data;
geo data — feature composed of one or more polygons — geometry obtained by visual interpretation of satellite image;
uf — state abbreviation;
pathrow — scene code formed by line of the satellite (the land is is divided in squares as in 2d space);
mainclass— name of the main class assigned to the feature;
class_name — name of the specific class assigned to the feature;
dsfnv — indicates if there was a cloud in the previous year about the feature;
julday— julian day;
view_date — date of the scene used to obtain the feature;
year — year of deforestation, used to facilitate queries to the areakm database;
areakm— area calculated for the feature in km²;
scene_id — identifier of the scene in the database, used for publish_year queries;
publish_year — used to allow the publication of data on the GeoServer with temporal dimension.
Wrangling and Cleaning the Data
I used pandas profiling to look at my data and features closely and see the distribution, check for cardinality, zeros and nulls. Most of the features didn’t make sense to me in predicting the locations of the deforested area except for the states and view_date, also I took into consideration area in kilometres squared (areakm_squared), some states have more deforested areas than others. Geo data had the longitude and latitude, the centroid of the deforested area, which I decided to be the targets for my predictions.
Visualizing the data
Mapbox plot below shows deforestation areas spread out through all Amazon states. We see that most of the deforestation comes to Para state and least goes to Amapa.
The Process
The aim is to predict the location of an area (centroid)where deforestation most likely to occur. I decided to try different models for my predictions and see which performs better. Since it is a spatial data, for the sake of simplicity I decided to treat the location as it is in 2D space and not the sphere and predict latitude and longitude as two separate values using two different models.
1. AI for CFD: Intro (part 1)
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
I made a time-based split of the data – train, validation and test. Train data 2008–2015, validation 2016, and test 2017–2018. After cleaning up and feature engineering, I ended up with only five features: [‘areakm_squared’, ‘day’, ‘month’, ‘year’, ‘states’].
Ridge Model
The first model I tried is Ridge model, which of course is not so good at predicting coordinates, but at least I wanted to see where I stood. For the encoding I chose TargetEncoder, also I scaled values with StandardScaler, and used SelectKBest for the features.
For metrics on validation set, I chose MAE (mean absolute error), RMSE (root mean absolute error)and R² score (r-squared).
Ridge model validation MAE: 1.3207 lat
Ridge model validation MAE: 1.8468 lon
Ridge model Validation RMSE: 2.8329 lat
Ridge model Validation RMSE: 5.4658 lon
Ridge model Validation R^2 coefficient: 0.7949 lat
Ridge model Validation R^2 coefficient: 0.8801 lon
The plot below shows the predictions on validation set, and as I expected, it performed poorly, even though the metrics look good.
Credit: BecomingHuman By: Iuliia Stanina