Written by: Chua Chiah Soon, Li Zhaochen, Lin Min Htoo, Quah Jia Yong, all NTU students from Singapore.
This project is a submission for the Deep Learning Datathon jointly organised by Nanyang Technological University Singapore & ai4impact. We are greatly honoured to have won first prize in this challenge. This extensive article summarises our project. For any queries, do reach out to us.
Here is an outline of our article:
- Objective & Metrics of success
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Research methodology
- Convert 1d time to 2d time
- XGBoost SHAP Feature Importance Values
- Results & Discussion: Features
- Results & Discussion: Model
- Our model’s strengths & weaknesses
- A note on interpretable machine learning
Managing electrical energy consumption is crucial, simply because of one fact: Electricity cannot be stored, unless converted to other forms. It is best for produced electricity to be instantly consumed; otherwise, additional resources and costs are incurred to store convert and store the excess energy. Energy-efficient buildings provide both economic and environmental benefits, maximising profits and social welfare. Conversely, underestimating energy consumption could be fatal, with excess demand overloading the supply line and even causing blackouts, leading to operational downtime. Clearly, there are tangible benefits in closely monitoring the energy consumption of buildings — be they office, commercial or household.
With the advent of machine learning and data science, accurately predicting future energy consumption becomes increasingly possible. This provides two-fold benefits: firstly, managers gain key insights into factors affecting their building’s energy demand, providing opportunities to address them and improve energy efficiency. Not only that, forecasts provide a benchmark to single out anomalously high/low energy consumption and alert managers to faults within the building. A key assumption behind time-series forecasting is that energy consumption follows recurring trends: an office building might have similar daily energy demand patterns across working days. By exploiting these cyclical trends or ‘seasonality’, educated predictions can be made about future energy consumption on a multitude of scales, from 1 hour ahead to 1 day ahead.
However, the difficulty lies in the nonlinearity and volatility of real-time energy usage, which is highly susceptible to changes in external factors. For instance, ambient temperature is known to significantly influence a building’s energy demand via heating and air-conditioning . Furthermore, there can be unexpected surges and drops in energy consumption due to equipment failure, supply failure, or simply random fluctuations that are difficult to be explained.
Our task was to predict a building’s energy consumption 1 day ahead of time based on 2-year historical energy demand data provided in 15-minute intervals, from July 2014 to May 2016. In addition, we were given temperature data from 4 locations of varying (undisclosed) distances from the building, in the order wx1 (nearest), wx2, wx3 and wx4 (farthest). We used a conventional Artificial Neural Network as it is capable of capturing complex, non-linear relationships between diverse numerical data, and relatively fast to build and train, compared to more sophisticated architectures like Long Short-Term Memory networks (LSTMs).
We used two metrics to evaluate our model: Mean Squared Error (MSE) (noting that reducing MSE translates into reducing Root Mean Squared Error or RMSE) and Lag. Mean Squared Error measures the average of the squares of errors (actual values — predicted values):
As for our model’s loss function, we used Euclidean loss, mathematically analogous to MSE. The strength of MSE is that it punishes the model for larger errors due to its squared nature, reducing our model’s likelihood of making extreme predictions which would be costly or even dangerous. Minimally, our model must achieve lower MSE than persistence, a trivial benchmark forecast where the ” predicted future value 1 day ahead = observed present value”. Persistence is a good benchmark to start with because of the highly periodic nature of energy consumption .
As for lag, our goal is a peak lag of 0 between our predictions and actual energy consumption values, where our model, on average, would not be delayed in its predictions and can capture rising/decreasing energy consumption on time.
Our workflow to this problem involved:
- Extensive literature review, data visualisation and analysis
- Pre-processing, data cleaning and feature engineering in Python
- Exporting the data into AutoCaffe to train and evaluate the neural network.
As the dataset given is anonymised with minimal context, we first scrutinised it to gain a comprehensive intuition for effective feature engineering. Arguably, this is the most important step of any machine learning project, and we spent close to an entire week (out of ~2 weeks) on this, as firm believers of ‘garbage in, garbage out’. We discovered that both the energy and temperature data contain a non-trivial amount of missing values, necessitating an effective method of filling those values. Further, wx4 has very sparse data (only containing data from 2016), so it is unlikely for us to make use of it.
Firstly, we plotted the energy data in 2015, the year with the most complete data, unlike 2014 and 2016. Mean monthly values were superimposed to offer clearer overview of trends across months.
As seen in the graph, temperature around the building ranges from sub 0 to 30 °C; given that cold months are from December to February and warm months from June to August, the building should be from the Northern Hemisphere with latitude >30°. Interestingly, two local maxima of energy consumption exist, occurring at the two tail ends of temperature: once during the coldest months, and again during the hottest month (July), suggesting that air-conditioning and heating are significant drivers of energy demand. Across the year, we identified 3 different energy-temperature regimes:
Winter, December to February: Frequent and large fluctuations in energy consumption, with relatively large mean energy consumption. Temperature is generally below 10 °C.
Summer: June to August: Frequent but smaller variations in energy consumption compared to winter. Energy consumption steadily increases with temperature. Temperature is generally above 20 °C.
Transition: March to May & September to November: Relatively constant and stable energy consumption pattern, with small fluctuations. Temperature ranges from 10 to 20 °C.
This analysis inspired two dummy variables (values either 1 for True or 0 for False): 1) is_season_winter & 2) is_season_transition to facilitate better learning of the neural network. Note that an is_summer column will have value 1 (True) whenever both is_winter & is_summer columns are 0 (False); thus, we dropped the is_summer column to avoid the Dummy Variable Trap  where one variable can be straightforwardly inferred from one or more other variables leading to multicollinearity issues.
Moving on, we plotted the time series of energy consumption over the entire time frame available.
We realised that energy consumption for July-Oct 2014 was anomalously low. There could be a variety of reasons: the building could be newly built and slowly ramping up operations (hence not full load) or undergoing maintenance. While discarding data is normally discouraged, we decided to do so here as such anomalous data would hurt more than help our model which relies on historical data to make predictions. Thus, our energy data begins from 29 October 2014. With a train/test split of 70/30, our test data begins on 7 December 2015.
We then visualised energy consumption across different days of the week. To do so, we calculated the mean, max and min consumption for each day of the week for the entire year, excluding public holidays first.
We observed that, on the whole, energy consumption was significantly lower during the weekends, implying the building is likely an office building — busy on weekdays, empty on weekends, rather than a shopping mall or a library. To exploit this pattern, we created a dummy variable on whether the day being predicted for was a weekend, called is_weekend.
Next, we plotted the distribution of energy consumption for each month, categorised into weekdays, weekends and public holidays.
We were able to make two observations from the figure above: Firstly, energy consumption on weekdays were clearly higher than on weekends and public holidays in general. Secondly, while there are significant counts of anomalously high energy demand on weekends, the general distribution of energy consumption for weekends and public holidays are very similar. This implies that we should not expect weekdays that are also public holidays to have similar energy consumption patterns to other normal weekdays. This means interpolating a weekday that is also a public holiday with other non-public-holiday weekday values would most likely be overestimating the energy consumption. Instead, a value from the previous, nearest weekend or public holiday would offer a more reliable proxy to fill the data.
Next, we scrutinized daily consumption patterns.
Generally, on weekdays, energy consumption picks up sharply at 7 am and drops off sharply after 6 pm, most likely the standard working hours of that building. Note that some of the plots look strangely shaped or have strange axes labels because of missing values, which further illustrates the need to fill these gaps. Zooming into a plot of average energy demand by hour for a higher resolution, we found that on average, there is a noticeable drop in energy consumption around 12 pm, which we attribute to office lunchtime hours.
As such, we decided to introduce the dummy variables: is_lunchtime (when hour = 12 on weekdays that are not public holidays) and is_working hours (Between 7 am and 6 pm on weekdays that are not public holidays), to further assist the neural network in identifying recurring trends.
Moving on, we plotted an autocorrelation plot of energy consumption to identify cyclical patterns backed by statistical analysis rather than ‘eye-balling’.
As the data was given in 15-minute intervals, 24 hours apart corresponds to 96 timesteps, and 12 hours to 48 timesteps etc. Energy consumption for a particular hour each day was most strongly correlated to the same hour of the day before. This relationship weakens as the number of days increases but peaks again at 672 timesteps or 1 week apart, which in fact has stronger correlation than 1 day apart. On the other hand, autocorrelation was the weakest 12 hours apart. This hinted to us that strong predictive features may include T:-576 (6 days ago from current time, but 1 week ago from time being predicted for), T:0 (1 day ago from time being predicted) and T:-96 (2 days ago from time being predicted).
Next, we delved into the relationship between energy consumption and temperature. We plotted energy consumption against temperature and attempted to fit a polynomial trend, inspired by Valor et al 2001 .
While the best fit lines are clearly not ideal, the scatterplots still revealed useful insights. At the tail ends of temperature (too hot or too cold), energy consumption tends to rise, most likely due to increased air conditioning or heating respectively. Moreover, the relationship is unlikely to be purely linear, showing hints of a quadratic one with a ‘most comfortable’ temperature at about 19 °C. To explore this further, we crafted a correlation heatmap using the Python Seaborn library, dividing the data into winter, summer and transition months.
Firstly, we observed that wx3 has noticeably higher absolute correlation value across all periods (0.29 vs 0.24), winter (0.051 vs 0.0079 & 0.016) and transition months (0.17 vs 0.11). In summer, it was slightly lower (0.43 vs 0.46) than wx1 and wx2. This was also confirmed by preliminary investigations with feature importance values in XGBoost, which consistently ranked wx3 at T+96 higher than that of wx1 or wx2. Thus, we focused on creating windowed features for temperature mostly off wx3.
Next, a quadratic relationship seemed to slightly outperform the linear one with a higher absolute correlation values in summer (for all 3 temperatures) and winter (for wx3). Therefore, on top of the raw energy values, the squared value of wx3 at T+96 (time being predicted) might be a useful feature to consider.
Lastly, in winter, both raw and squared temperature have very poor correlation with energy. This might be due to greater and more extreme fluctuations in temperature, even as the building’s heating systems may be turned on indefinitely with less ‘sensitivity’ to temperature.
We conducted data pre-processing in Python instead of AutoCaffe, mainly because our team is more proficient in various Python libraries than Smojo, and we also had greater control in creating more specific features like dummy variables. A general outline of the pre-processing pipeline involved aligning temperature data to 15-minute intervals, interpolation and normalisation.
There were 9238 missing energy values from our selected period (2014–10–29 00:00:00 to 2016–05–26 20:15:00), approximately 17% of the entire timeline. We first implemented linear interpolation by replacing missing values with the mean of the values just before and just after the missing data. However, we realised two critical mistakes. First, and most importantly, this led to data leakage as calculating the mean involves data from future (after the missing timestamp), which is not available in the real-world. Secondly, we observed that the missing values usually occur for the whole period of up to two days, which led to complete blank of 96, 192 or even more timestamps (24h = 96 x 15mins). Therefore, linear interpolation fails to capture the inherent seasonality of energy consumption data due to many factors like temperature and working hours. We then tried simple forward filling, where the missing data is replaced by the data exactly 24 hours before. However, this did not reflect the weekly trend well, because of the difference between weekdays and weekends/public holidays.
Therefore, we implemented a customized filling method, considering the type of day (public holiday/weekend/weekday). If the missing value day is not a public holiday, the missing value would be replaced with the value exactly one week before (given that this is not missing too! Which luckily does not happen in this dataset), as our data analysis had suggested a strong correlation. Otherwise, if the missing value day is a public holiday, it would be replaced with the nearest, previous weekend value as supported by our data analysis. For example, if the energy consumption value on Monday 11:00am is missing and that Monday happened to be a public holiday, it would be replaced with the energy value of Sunday 11:00am. All in all, we ensured that the type of day (weekend/public holiday/weekday) being brought forward is the same as the missing value day and that no NaN values are being forwarded.
This method more accurately reflects the seasonality and is relatively easy to implement as opposed to more sophisticated methods. In fact, we did consider two improvements:
- Instead of exactly copying the energy values from the past, add a random ‘jitter’ to the values being brought forward, by multiplying all the energy values from that day by a small, random factor between say, 0.8 to 1.2 that varies for each day. This method reduces the chance of our model overfitting to the historical data and makes it more robust as long as a suitable range for the random factor is chosen. It is important to note that this random jitter would only be applied to the training dataset, and not the test set. However, as we did not face a massive overfitting problem, we did not implement this idea.
- Fit a neural network or a time-series forecasting algorithm that also considers temperature data to impute the missing values as it might give even more realistic results. However, we decided this overcomplicates the task at hand given the limited timeframe.
As for interpolating temperature where we have access to future data in the next 24 hours from weather forecasts, we chose linear interpolation for two reasons. Firstly, unlike energy consumption values, missing temperature values do not occur for a long period of time. Therefore, linearly interpolated data could still capture most of the trend. Secondly, as we are given future data, data leakage will not be an issue (as confirmed by Arnold). Interpolation was not done for wx4, due to the sparsity of the dataset.
After interpolating all the missing values, we normalized all the values in the energy and temperature datasets using “MinMax scaling”, which refers to scaling all the values in the data into the range [0,1]. This standardisation for all feature inputs is critical for neural networks to ensure that any differences in feature importance is solely due to the feature itself and not its numerical magnitude. We also took care to take the min/max values from the training data to prevent data leakage from the test set.
During our literature review, we discovered a creative time data manipulation method: Moon et al 2019  transformed calendar time to 2d continuous format. While calendar data like month, day, hour and minute have periodic properties, they are represented by sequential data which loses some of that periodicity. For example, 0000hrs follows right after 2359hrs, but this periodicity is not well captured by the 1d representation of sequential calendar time at all. To reflect such periodic properties, Moon et al  utilised the following equations (EoM = end of month, equivalent to the number of days in that month e.g. February’s EoM for non-leap year is 28):
Using the above equation, we observed that our test loss performance improved by at least 2.5 to 4% when we used the 2d representation time across multiple feature combinations.
Another interesting feature we wanted to explore was the concept of “public holiday /weekend inertia”, proposed by Valor et al 2001 . Their data analysis suggested that energy consumption in office buildings were systematically low on working days after weekends (ie Mondays) or after public holidays, because of inertia caused by economic activity reduction from the non-working day. A feature to exploit this phenomenon could be ‘days since last public holiday’ and ‘days since last weekend’. However, careful analysis of our data suggested that this “inertia” effect was not present for our dataset, and we did not pursue it further.
Our approach to generating the best possible model involved training the model in AutoCaffe and adding/dropping feature by feature based on the test score and lag achieved, a methodology akin to One-at-a-time sensitivity analysis. However, while AutoCaffe and the compute resources provided allowed for fast training, a limitation was that we could not ‘automate’ the permutation of features and had to do it manually. To minimise time spent tediously permutating features, we relied on our data analysis, domain knowledge, extensive feature engineering & XGboost SHAP feature importance values to cut down the feature combinations to experiment with.
One big assumption made in adopting the one-at-a-time approach is that features are independent of one another with minimal feature interaction. Given that this assumption is not likely to always hold, we also ran certain combination of features together based on our intuition and domain knowledge gained from reading the scientific literature. We were also careful to conduct sufficient repeats and consider the variance in final test losses due to the random Xavier initialisation of weights.
For preliminary investigations, we prepared a Pandas dataframe containing the raw values of energy, temperature (wx1 to wx3), datetime features (like month, day) and windowed features (like min, max, mean, range, first-order differences, mean of first-order differences, second order differences and so on).
Good windowing is crucial to help our ANN cluster the data better. Our choice of windows was guided by our understanding of the cyclical pattern of energy consumption:
- Small windows of 1 hour (e.g. T:0:-4 mean) to capture recent fluctuations in energy
- Slightly larger windows of 5 hours to capture larger changes throughout the day
- Larger windows of 12 hours to capture cyclic day & night patterns
- Largest windows of 1 day to 1 week to capture seasonal transitions
We considered using windows larger than 1 week, but did not ultimately do so, because it would shorten the already limited data we had, and from the autocorrelation plot, since energy values past 1 week ago are less strongly correlated with energy at T:0, we felt that they may only serve to introduce more noise into our data.
With the windowed features of energy and wx3 temperature, dummy variables and 2d time features, we conducted preliminary experiments on AutoCaffe to eliminate unhelpful features, such as ‘range’, ‘skew’ and ‘kurtosis’. We had hoped that ‘skew’ and ‘kurtosis’ could signal to our model the recent presence of extreme values (e.g. a short, sudden heatwave with higher than normal temperatures against a background of normal temperatures) that might increase its robustness in anticipating unexpected events. Unfortunately, these features did not improve our test loss and lag despite repeated experiments. Regarding wx4, we did try our best to utilise it, such as by having a ‘previous month’s average temperature’ calculated across all 4 sensors, but such features unfortunately did not improve our results.
After about 150 experiments, we generated a refined list of ~130 features.
At this stage, to provide rigorous justification to our feature selection process, we tapped on the Python XGBoost library, a fast and user-friendly implementation of the gradient-boosting decision trees algorithm. We chose decision trees as they are better at handling high-dimensional datasets (>100 columns of features) than deep neural networks, which are more prone to drawing poor decision boundaries due to the curse of dimensionality and unimportant inputs.
We fed the ~130 features into an XGBoost regressor model to predict the difference between T:0 and T+96 energy values (mimicking the ‘difference’ neural network in AutoCaffe). Interestingly, this model with the following rather standard hyperparameters achieved a test MSE of 0.010853 (after factor of 0.5), which already beats persistence of 0.019377 by 44%, although we did not visualise its lag correlation nor scatterplot.
Using the Python SHAP library [4, 5], we could easily visualise the contribution of various features to the XGBoost model’s outputs. SHAP was chosen due to its consistency and accuracy across models, which many feature importance regimes lack , including XGBoost’s/scikit-learn’s built-in versions. We ranked the top features out of the ~130 by SHAP values (higher = more important), and focused on permutating these top features during an additional round experimentation on AutoCaffe. These top features are very likely to facilitate better clustering of the data, which should be transferable to neural networks. Of course, we understood that inherent differences exist in the algorithms of gradient-boosted trees vs neural networks; the SHAP feature importance values are not the be-all and end-all, and we did include other features occasionally.
To clarify our syntax, Emean_diff96–96:-195 is the average (to reduce noise) of the following: 1) E:-96 minus E:-192 2) E:-97 minus E:-193 … 4) E:-99 minus E:-195. Similarly, Emean_diff48–0:-51 is the average of: 1) E:0 minus E:-48 2) E:-1 minus E:-49 … 4) E:-3 minus E:-51.
For 1week_meandiffdiff: we first calculate seven 1st-order differentials Emean_diff96–0:-99, Emean_diff96–96:-195, … Emean_diff96–576:-675. We then calculate the successive difference between these values (to get six 2nd-order differentials) and take their average to get a single value of 1week_meandiffdiff. We could have equally used the six 2nd-order differentials without taking an average, but we wanted to minimise the number of features and did not want to introduce too much ‘unimportant inputs’.
Our best neural network achieved a test loss of 0.00907207 (with 50 repeats) and test lag of 0, beating our persistence of 0.01937720 by 53%.
The 36 features we used for our difference network were (many of them from figure 14)
- Time features
- 2d time: hourx, houry, dayx, day, monthx, monthy
- Day of the week (0 = Monday, 6 = Sunday, scaled by MinMax to [0,1])
3. Dummy variables: (as inspired by our data analysis)
- is_pubhol, is_weekday, is_working_hour, is_season_transition, is_season_winter, is_lunchtime
- Note that we dropped is_weekend & is_season_summer to avoid multi-collinearity problems where, across the same row, the values of one column in our data can be inferred from the values of other columns
4. Energy features:
- Energy at T:-1, First-order difference between T:0 & T:-1, Min & Mean of T:0:-4, Mean of first-order difference from T:0:-4 (1 hour window)
- First-order difference between T:0 and T:-20 (5 hour window)
- Min of T:0:-48, First-order difference between T:0 & T:-48 (12 hour window)
- Mean of T:0:-96 (1 day window)
- First-order difference between T:-96 &T:-192 (1 day window)
- Energy at T:-576, Max of T:0:-576 (1 week window from T+96)
5. Temperature features:
- Wx3 at T+96, Max & Min of T+96:+92, first-order difference between T:+96 & T+95 (1 hour window)
- Min, Max & Mean of T+96:0 (1 day window)
- Mean & Max of T+96:-576 (1 week window)
6. Nonlinear features
- Square of wx3 at T+96
- Product of energy at T:0 & wx3 at T+96
After seeing improvements from using the square of wx3 at T+96, we experimented with a variety of other nonlinear features, cube of wx3 at T+96, ln(wx3 at T+96 in Kelvins), square root of wx3 at T+96 etc. The idea for the product of energy at T:0 & wx3 at T+96 was a natural extension from thinking about the quadratic polynomial (a+b)² = a² + b² + 2ab
Our best model:
- A simple difference network that predicts the difference in energy consumption between T+96 and T:0, with the model’s prediction added to the energy at T:0
- 4 layers with 32 perceptrons in the first layer with layer-by-layer shrinking ratio of 2/3 (i.e. 32, 21, 14, 9 perceptrons)
- Dropout probability of 0, tanh activation, 10,000 iterations with early stopping & Adam optimiser
For most experiments, the number of layers was kept at 3 or 4, screening first-layer perceptron counts of 32, 64, 128 and occasionally 256, with ReLU activation for fast training. We quickly found that perceptron count of 32 to 64 gave the best results, and that tanh activation enabled superior test loss, albeit at the cost of slower training. The intuition behind why tanh seems to give better results than ReLU might be its additional non-linearity, which might be crucial for mapping complex relationships in energy/temperature data. tanh also avoids some issues faced by ReLU like the ‘dying ReLU’ where a neuron with negative activation is unlikely to ever become positive again.
2 layers were insufficient for optimal learning, while 5 or 6 layers quickly overfitted. Control experiments with the SGD optimiser gave dismal results. Trials with square perceptrons, autoencoders/scaling and force/momentum losses did not improve our results. Still, we think there remains an opportunity to harness autoencoders on more columns of windowed energy/wx3 values to reduce dimensionality whilst extracting the most useful information from historical data.
- Comfortable gap between the main lag peak at T=0 and secondary lag peak at T=+96. Zero lag means no time delay in our forecasting, which provides building managers with a reliable, on-time ‘benchmark’ to compare their building’s actual energy consumption where any jarring differences may indicate, for example, otherwise unnoticed faults in the building’s heating/cooling systems.
- Fairly good predictions during the early winter and the spring months. Our model is able to predict the recurring weekly trends well, anticipating the rise during weekdays/daytime and dips during weekends/night-time.
- Overall, minimal jarringly anomalous predictions throughout the test period.
- During the late winter-early spring transition period, our model struggles to predict extreme energy consumption values, as shown in the green boxes on the test prediction graph, also reflected by the two green ovals in the test scatterplot. That being said, there are about ~16000 points in the test dataset, and most of the points seem to lie close to the reasonable 1:1 relationship between prediction and actual. The anomalies appear to constitute a relatively small percentage of total points.
- Unfortunately, despite our best attempts, we were unable to further lower our test loss and overcome this issue. This period is characterised by large fluctuations in temperature and energy consumption, which may be contributing to our model’s difficulties.
Below are zoomed in graphs of predictions vs actuals, which again highlight the strength of our model in closely predicting the seasonal patterns in early winter and the spring months (green boxes), and its weakness in failing to predict some extreme values during February to March 2016 (red boxes). We note that in Dec-Jan & April-May particularly, the green boxes showed a good fit between actual & prediction graphs.
After scrutinising these graphs, we thought that perhaps, our model may not always be at fault. It may entirely be possible that for the peaks/dips that our model failed to accurately predict, they were really outliers/unexpected energy consumption behavior. This does make some sense if we closely re-examine the red boxes in Fig 16c: the first red box shows a sustained high energy consumption for a weekend, which is unusual. Furthermore, the second peak just after the second box falls on a Saturday, yet another weekend. Perhaps certain overtime weekend operations/work were running in those late winter months that is not included in our training data.
That being said, we recognise that for the second red box, which is from Wednesday to Friday, which are weekdays, our model did fail to match the actual peaks & underpredicted. This warrants further analysis; perhaps there was a deviation in correlation between wx3 temperature & actual energy. From our research, weather factors other than temperature, such as wind speed and humidity affects humans’ ‘feel’ temperature, and thus, use of heating/cooling. Thus, having more data might help us judge whether the actual trends were ‘forecast-able’ or actually outliers.
The SHAP feature values are an extremely rich source of information about the multiple relationships that exist in our dataset. In contrast, it would be more difficult to delve into the ‘blackbox’ of neural networks to understand how weights and biases at each layer tell a story about different features. After all, it would be more beneficial for the energy forecasting community if we could also gain insights into how certain features shape the model’s output, rather than blindly hunting for the lowest test loss, where the resultant model may not be transferable to different contexts. Therefore, we made it a point to visualise the SHAP feature importance graphs again on a ‘difference’ XGBoost regressor fitted just on our best 36 features.
For working hours, it is no surprise that a clear separation exists, where a high value (i.e. =1) at the time being predicted for tends to increase energy consumption, while a low value (i.e. =0) decreases it. Still, the presence of a range of predictions hints that is_working_hours is interacting with other features. Similarly, for dayofweek (Monday = 0 to Sunday = 6 minmax scaled to [0,1]), high values (Saturday & Sunday) tend to drive down the model’s output, while lower values increase it.
More interestingly, the graph suggests that high values of energy consumption 1 day and 15 minutes ago (ET:-1) are more likely to result in decreased energy consumption right now, with the converse being true too (although the magnitude is much lower for the converse). Additionally, for high values of ET:-1, we see hints of feature interaction from the variance in model output ranging from -0.25 to ~0. Also, high maximum energy consumption over the past week from T+96 (Emax_0to576) tends to slightly increase the model’s output, while low values have minimal effect. As for temperature, the graph implies that low values of moving weekly average temperature (wx3mean_-0to-672) can both increase and decrease the model’s output, while high values appear to have no effect. The reasons for these are not immediately clear, and follow-up studies can be conducted to examine them further. That being said, we must note that while SHAP values can indicate high correlation between features & model output, they do not imply causation.
All in all, we found it extremely ‘cool’ to visually represent the XGBoost trees’ decision-making process and better understand how certain features influence the model’s prediction.
In conclusion, we have built a relatively accurate neural network model to predict energy consumption for a building 1 day ahead, trained on about a year of historical energy data as well as temperature forecasts up to a day ahead.
The strengths of ANNs include the sheer performance bump it has over other machine learning methods, especially in conventionally difficult problems. However, the trade-off for this performance bump is that large amounts of data are required (the curse of dimensionality) and thus computational costs could get heavy in both time and monetary aspects. The inner workings of how the ANN learns also remains a “black box”, contributing to a significant need for manual evidence-based and statistical feature selection to explain the resulting model, hence time was most heavily invested in data analysis, feature engineering and selection.
It must be noted that the results of this project are a “validation” loss since we have used the test loss values to change our feature combinations and improve our model. Given the limited data we had, we did not have the privilege of a validation set and relied on test loss as a proxy. This may have caused subtle overfitting to the validation set given that we have conducted a few hundred experiments. Thus, it would be ideal to evaluate our model again on an unseen energy dataset for the same building for a more unbiased estimate of its predictive power.
While we hope that our findings are applicable to other contexts, the type of building and its climate should always be considered. The same trends may not apply for a residential building or a shopping mall, or an office building located in an equatorial climate like Singapore’s, which lacks distinct seasons. We should also be mindful of climate change, which may lead to temperatures (and patterns of temperature change) rarely seen historically, and may pose a problem for models that heavily rely on historical data.
Overall, we have truly learnt a tonne from this end-to-end experience, from exercising our object-oriented Python programming skills in building the pre-processing, feature engineering and windowing pipelines, sharpening our data and statistical intuition with extensive visualisations and literature reviews, to understanding the caveats behind different machine learning approaches and making cautious decisions based on algorithms’ results. We brainstormed so many ideas during the competition (including having separate models for winter, transition season and summer) but had only so much time (and data) to try them all.
To end off, we thank ai4impact and NTU CAO for organising such a fun and valuable opportunity!
1. Valor, E., Meneu, V., & Caselles, V. (2001). Daily air temperature and electricity load in Spain. Journal of applied Meteorology, 40(8), 1413–1421. https://doi.org/10.1175/1520-0450(2001)040<1413:DATAEL>2.0.CO;2.
2. Moon, J., Park, S., Rho, S., & Hwang, E. (2019). A comparative analysis of artificial neural network architectures for building energy consumption forecasting. International Journal of Distributed Sensor Networks, 15(9), 1550147719877616. https://doi.org/10.1177/1550147719877616
4. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., … & Lee, S. I. (2020). From local explanations to global understanding with explainable AI for trees. Nature machine intelligence, 2(1), 2522–5839. https://doi.org/10.1038/s42256-019-0138-9
5. Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., … & Lee, S. I. (2018). Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature biomedical engineering, 2(10), 749–760. https://doi.org/10.1038/s41551-018-0304-0