In this article we shall explore the process of me and my team developing a web application to predict the probability of an electricity outage in Bangalore. Results can be reproduced in any city as long as sufficient data can be collected.
We found publicly available unscheduled power outage data from BESCOM’s website. Though not an elegant solution, due to the website loading all the 50,000+ rows all at once, a simple Ctrl+A, Ctrl+C and Ctrl+V into an excel sheet saves us from writing an elaborate scrapping code.
Loading this into our python code is fairly straight forward. Our file is saved as BESCOM-electricity-data-outrage.xlsx.
Sl. No. ... Updated By
0 1.0 ... Sushma
1 2.0 ... Sushma
2 3.0 ... Sushma
3 4.0 ... SUSHMA
4 5.0 ... Devaraj R[5 rows x 12 columns]
To see all the individual columns, we use
array(['Sl. No.', 'Circle', 'Division', 'Subdivision', 'Substation',
'From', 'To', 'LC Duration', 'Feeders', 'Areas Affected',
'Reason For LC', 'Updated By'], dtype=object)
As most of these attributes do not add a lot of information to our understanding, to make our task simpler and easier to visualize, we will only keep the columns Sl. No., Division, From and LC Duration.
Division From LC Duration
0 HSR Division 2019-11-06 21:07:00 0 days 1 hours
1 HSR Division 2019-11-06 16:00:00 0 days 2 hours
2 Koramangala Division 2019-11-06 14:00:00 0 days 2 hours
3 HSR Division 2019-11-06 14:29:00 0 days 1 hours
4 Indiranagar 2019-11-06 13:00:00 0 days 2 hours
Next we need to add data about whether a particular day was a holiday or not. The intuition being that there is an increased electricity consumption on a holiday thus a higher chance of an outage. For this we first find the earliest and oldest date which is contained in our dataset.
Now we find all the public holidays within this range and add them to a separate excel sheet holidayData.xlsx.
The next step is to map these holidays into our existing dataset. For this we create another column called Holiday in our dataset and assign all the rows to zeros. Going through the dates we update the column to ‘1’ is the date appears in holidayData.xlsx.
Holiday Division From LC Duration Date
0.0 16136 16179 16179 16179
1.0 1116 1117 1117 1117
Next we will map the population of the the various Divisions with higher population divisions having a theoretically higher chance of suffering from power outage.
First we find the list of unique Divisions
array(['HSR Division', 'Koramangala Division', 'Indiranagar',
'Shivajinagar', 'Hebbal', 'Whitefield', 'Malleshwaram',
'Rajaji Nagara Division', 'Jayanagar', 'Jalahalli',
'Kengeri Division', 'R R NAGAR', 'Vidhanasoudha',
'Peenya Division', nan], dtype=object)
We now find the estimated population for each of these regions. This task will have varied accuracy depending on if this data is publicly available. In our case the last census was done in 2011 so we have to extrapolate the population to reflect the current total population of Bangalore. As there are only 14 regions, instead of adding this data into another excel file, we will enter the values in a dictionary. Then we create another column in the dataset called Population and assign it the same values as Division. We then use the replace function in python to replace the keys in Population with the values in div_pop_dict.
Name: Population, dtype: float64
Going on ahead, we have found the daily minimum and maximum temperature between the date range we mentioned before. This data is stored in weather_data_max_min.xlsx. We will load the dataset and change the datetime values into dates.
Date Max Temp Min Temp
0 2014-01-01 18.7 14.8
1 2014-01-02 27.3 15.4
2 2014-01-03 27.7 13.1
3 2014-01-04 29.5 12.0
4 2014-01-05 29.6 14.1
Now we follow the same process as with the population dataset i.e. creating a dictionary with dates and temperature values and replacing outage dates with min/max temperatures.
Max Temp Min Temp
0 34.3 15.3
1 34.3 15.3
2 34.3 15.3
3 34.3 15.3
4 34.3 15.3
These are all the features that we will be working with for this example.
We need to, now, create labels for classification. Right now we have details for the combination of features that produce and outage. However we need to feed data about which conditions do not cause outage into the model for it to properly understand the distinction and formulate rules accordingly.
How we are going to approach this particular problem is by selecting each row where we have the date and time of when and where an outage occurs, and labeling that as ‘1’ for outage. For all the other time durations, the label for the exact same combination of features except time is labelled as ‘0’ for not outage. To further illustrate this point we shall take an example:
Say the following row is selected:
Then it is labelled as ‘1’ for outage as shown below:
Notice the time when this particular outage has occurred is roughly between 2100–2200 hours. Hence we can conclude that there might have been no outage in the time duration between 0000–2100 and 2200–0000. Hence we can reproduce the row with these time durations and label them as ‘0’ for not outage. Obtaining:
As the final step before training, we need to one-hot encode the Divisions value. What this means is, the Divisions(Indiranagar, Koramangala, HSR Layout…) as such mean nothing to a computer. So we form separate columns for each of these Divisions which can have 0/1 for each row. For example, say a row is formed for the Division of Indirangar, a value of 0 in a row would mean that this particular row does not have Indiranagar as its Division and as value of 1 would indicate that it is. This lets the computer understand the data in its own terms and form better patterns. For an even in-depth understanding of One-Hot Encoding you can read this article.
This gives us the following columns where we can notice that in each row, only 1 column has the value ‘1’ denoting the division.