Flood Prediction Using Artificial Neural Networks: A Case Study in Temerloh, Pahang

Flood is one of the natural disasters that causes damage to properties, and sometimes loss of lives. Floods in Malaysia happen every year, especially in East Coast Peninsular Malaysia, due to the Northeast Monsoon and climate change, which may lead to heavy rainfall throughout the end of the year. Temerloh is one of the districts in Pahang that frequently encounters flood events, especially between November and January every year. Even though there are multiple efforts in flood mitigation and preparation, the damage to citizens and properties every year has cost thousands of Ringgits and the time taken to clean the damages caused by floods. Despite this, research on flood prediction in the state needs to be done using machine learning techniques. Due to this, this research explored the hydrological and meteorological factors that caused the flood in Temerloh and developed a machine-learning model capable of predicting the flood occurrence. The study used a dataset from the National Hydrological Network Management System (SPRHiN), which consists of hydrological data, and weather underground for the meteorological data in the location. The correlation analysis found that stream flow and water level are highly correlated to floods, with correlation coefficients (r values) of 0.83 and 0.76, respectively, while the temperature is inversely related to floods with a -0.28 correlation value. A lower temperature has a higher chance of rain and subsequent flooding. The results show that the model, by using an artificial neural network (ANN), has produced an accuracy of 0.9909 and a good performance of the area under the receiver operating characteristic curve (ROC) curve (AUC) at 0.888. The model also shows a low error with the mean squared error (MSE) of 0.009 and the root-mean-squared error (RMSE) of 0.096. The R2 value of 0.768 and F1 value of 0.875 indicate that the model has high precision and recall. Afterwards, a flood monitoring dashboard was created to visualize the data interactively. This research is vital in understanding the flood factors in Pahang and would offer academic insight for future research in floods. In addition, the flood monitoring dashboard will significantly assist governments and authorities in focusing the flood management efforts in areas at high risk of flood and be used to aid the state's future development


Introduction
Flooding is one of the natural disasters that has been a problem in various parts of the world.Floods can be defined as dry terrain areas that have been submerged or overflowed by water due to hydrological and meteorological conditions.
Malaysia is no exception to this problem, immensely because Malaysia has high precipitation throughout the year, receiving 3297.34 mm of rain in 2021 (Trading Economics, n.d.).Across the East Coast of Peninsular Malaysia, the heaviest rainfall is during the Northeast Monsoon Season, which is in the period from November until January.The recent flood in the state of Pahang in 2021 left 63,394 people affected from 17,581 families, with 3500 houses lost and casualties (Department of Statistics Malaysia, 2021).This is the worst flood that has hit the state in history.In addition, businesses have been left crippled and cost millions of ringgits.Recovery to a normal state costs another thousand hours of manpower and money.This is not the first time Pahang has faced a flood, but recent years have seen more frequent occurrences and more disastrous impacts.Thus, there is a need to assess the factors that highly contribute to this misfortunate event.
Numerous reasons can contribute to flooding in an area.One of the factors is the terrain, which affects the direction and rate of surface runoff.Flood possibility increases when there are rises in temperature which elevate the rainfall (Ramayanti et al., 2022).In addition, the density of the population, land use, geographical location, and geological conditions contribute to flooding in an area (Ighile et al., 2022).Other than that, elevation plays a vital role as well, according to Al-Areeq et al. (2022).Despite multiple efforts in flood mitigation, they are not enough to prevent recurring events.Therefore, a new approach needs to be taken in order to reduce the flood severity and enhance the preparedness towards floods.Traditionally, flood prediction is done by using a hydrological rainfall and runoff model.However, this modelling is not very efficient as it requires precise topography, and the data need to be collected from rain precipitation over a certain period.Recent developments in technology have introduced a few techniques which improve flood prediction.One of the developments is a physical-based model that has high effectiveness in simulating possible multiple flood scenarios, but the model requires collecting data over an extended period and its complex prediction technique has led to the method not being preferable by many.Thus, researchers have turned to the technology of machine learning to help in improving the efforts and thus avoid major losses due to floods.
One of the machine learning methods that has been used in modelling floods is artificial neural network (ANN) techniques, as it was applied by Kia et al. (2010) in the Johor River Basin.With the help of a geographic information system (GIS), the research was able to construct a flood map of the area with a satisfactory comparison result between the predicted and the real record.The other machine learning techniques that can be used are logistic regression and support vector machine (SVM).However, the modelling using these two methods does not produce results as good as ANN (Kanwar, 2022).Producing a reliable and accurate flood predictive model is important in preventing the area from flooding, but also in preparing and protecting from the worst outcome.Besides modelling, it is important to investigate the relationship between the variables through correlation analysis in order to know which factors have a significant impact on the flood.
Lastly, flood monitoring is easier to be done through data visualization using a dashboard so the decision making can be done faster by government agencies.
The research aims to acquire a reliable dataset that can give a better understanding of the factors that impact the flood in Temerloh through the National Hydrological Network Management System (SPRHiN).In addition, the research targets to develop an accurate flood prediction model by using artificial neural networks and subsequently producing a user-friendly and interactive dashboard that can visualize and analyse the available dataset using Microsoft Power BI.The study is focused on physical factors that highly impact floods, including rainfall, water level, streamflow, and temperature, and the modelling is done by using Artificial Neural Network (ANN).The research is significant because it offers an opportunity to understand the factors that affect the flood in Temerloh, Pahang.Besides, the research will benefit the state government and locals in the area so they can take precautions before the flood occurs.The study also can be used as a guide to other parties in planning the development of an area and as part of flood mitigation efforts.In addition, a reliable Power BI dashboard could provide insights for future studies on the flood factors and flood risk in other areas in Pahang.Lastly, the study will benefit academicians as it will be one of the references that can be added to the list of the latest technology in predicting floods.

Literature Review
Flooding is a disaster that can significantly affect human beings.There are four categories of floods: flash floods, urban floods, river floods, and coastal floods.In Malaysia, the most common floods are flash floods and monsoon floods.
Several factors can contribute to floods, including slope, altitude, and topography.Besides, floods in Malaysia are caused mainly by prolonged heavy rain and poor urbanisation planning.To reduce the impact of floods on society and property, a functioning flood management system needs to be established.There are four stages of flood management: flood prevention, preparedness, response, and recovery.In assisting flood management, a few technologies are beneficial and efficient, such as the mobile phone short message system (SMS), information and communication technology (ICT), and geographic information systems (GIS).
Machine learning techniques and data mining have been used to prepare for the flood for accurate and reliable flood prediction.Only significant factors must be selected to produce accurate flood predictions using machine learning.From the literature review, it is observed that there are gaps in flood analysis and Flood Susceptibility Map (FSM) in Temerloh, Pahang.Therefore, this paper will address the gap by conducting a detailed area analysis.After reviewing multiple research projects, 8 relevant papers have been selected for the study.From the previous research, the best machine learning technique and the relevant flood factors can be utilized for the research.The summary of selected papers that is The outcome shows that flood is highly susceptible to occur in the convex and urban areas with lower elevation and low slope angle.In addition, FR (0.8833) has better accuracy than FR-AHP (0.8562).The results highlighted that the most critical factors are distance to river, MNDWI, TWI, and LULC.The LR model produced 0.84 accuracy, 0.91 precision, 0.72 recall, and 0.80 F1-score.

Material and Methods
To achieve all three objectives of this research, a good research procedure needs to be established to produce excellent results.Figure 1 shows the operational process flow for the research.Collecting data is the first step in the research procedure.It is extremely important to make sure the data obtained is relatable and appropriate for the research and has a high relation as well as integrity in order to achieve the research objectives.(RMSE).Finally, an interactive Flood Monitoring Dashboard is generated for data exploration and visualization.

Data Preprocessing
For this research, four types of data-rainfall, streamflow, water level, and temperature data-were acquired from two different sources.Rainfall, streamflow, and water level data were requested from the National Hydrological Network Management System (SPRHiN), and the temperature data was extracted from the Weather Underground (wunderground.com)website.There are three data pre-processing methods that were applied in this research.Firstly, data transformation was done in order to change the value, format, or structure of data into more meaningful and useful data, which includes data encoding for string data and data formatting from Fahrenheit into Celsius.Data integration was done manually to combine all 10 datasets into one, which eases the process of understanding and evaluating the data.In addition, data cleaning was done to deal with missing, incorrect, duplicated, irrelevant, and improperly formatted data.
This includes a linear interpolation technique that dictates the value of a function at any intermediate points.

Model Development
Neural networks are one of the machine learning models, and they are a subset of deep learning that mimics how a human brain works.An artificial neural network (ANN) is a concept that simulates how input data is transferred and processed to reach a conclusion at the output.The neural network works by determining the underlying pattern of the data and subsequently learns to make the model better.For this research, there are activation functions at the hidden layer and output layer to segregate the important data, suppress irrelevant information, and help pass through only relevant information to the next layer.Another method to be applied to the model in order to minimize the differences between the predicted and actual output is the learning rate.The machine learning model performance developed in this study was evaluated through four evaluations, which are confusion matrix, area under the Receiver Operating Characteristics (ROC) curve (AUC), mean squared error (MSE), and root-mean-squared error (RMSE).The performance score needs to have good outcomes on four criteria, which are accuracy, recall, precision, and F1-score.Accuracy is to evaluate the number of correct predictions compared to the total predictions, while recall or sensitivity is the capability to find the relevant information within the dataset, and precision is how the model can identify only the relevant data points.The F1-score is basically the best combination of precision and recall.

Results and Discussion
Since the research focuses on the recent big flood that hit Pahang in 2021-2022, the data for all four attributes was taken between the date range of 1 January 2021 and 31 December 2022.There are 10 separate datasets, and each dataset has 13 columns, and 32 rows embody the data collected by days and clustered by month.These data are pre-processed through data transformation, data integration, and data cleaning.From the output, it is noticed that there are 15 missing data points in the "Rainfall" column, 34 data points in the "Water Level" column, and 14 data points in each of the "Stream Flow," "Weather," and "Flood" columns.Irrelevant rows are removed, and individual missing data points are replaced with new values through linear interpolation and null value replacement.To find the strongest factors that contribute to floods, correlation analysis is done to evaluate the relationship between the factors and flood occurrences, as observed in Figure 2. Stream flow and water level have a very strong relationship with floods, with values of 0.83 and 0.76, respectively.
However, weather has an inversely proportional relationship with floods.This is understandable as lower temperatures are highly prone to rainy days.validation loss calculation can be observed in Figure 3, which indicates that the model is fitting well to the training data.
The final step for the machine learning modelling is the prediction.To evaluate the performance of the data, the result is validated through a few evaluations.The first evaluation is a confusion matrix which evaluates the accuracy of the data prediction.From Figure 4, most of the predictions are done accurately, where 210 "No Flood" and 4 "Flood" data points are correctly predicted.Only 5 instances exist where the "Flood" data are predicted as "No Flood," and no event exists where the model predicted "No Flood" as "Flood."The accuracy of the model is calculated to be 0.9909, which is very high.Lastly, the performance and error evaluation is done through MAE, MSE, RMSE, which have values of 0.009, 0.009, and 0.096, respectively, indicating the error in the prediction is very low.R 2 value of 0.768 proves there is a high-variance relationship between the variables, and 76.8% of the observed variation can be explained by the model's inputs.Other than that, the F1 value of 0.875 shows the prediction has strong precision and recall.
To get more insight into the data obtained, a Flood Monitoring Dashboard is created for data exploration and visualization.
The dashboard is first configured in Power BI Desktop before it is published to Power BI online.The interactive dashboard consists of 4 visualizations, which include 1 map and 3 graphs that can be filtered by year, quarter, month, and day using the slicer.

Conclusion
The research has taken initiative to develop a machine learning model by using an artificial neural network (ANN) approach that has 0.9909 accuracy.A confusion matrix and area under the Receiver Operating Characteristics curve (AUC) are produced to validate the accuracy result.From the result evaluation as well, it is found that the prediction has a very low error with MSE of 0.009 and RMSE of 0.096 but has a high sensitivity with R2 value of 0.768, and the F1 value of 0.875 indicates that the prediction has strong precision and recall.The study also was able to determine the factors that highly contribute to flood through correlation analysis, which shows that flood is highly impacted by stream flow (0.83) and water level (0.76).Rainfall has a weak relation to flood (0.12), while temperature has an inversely relationship with flood, which indicates the lower the temperature, the higher the chance of flood.A Flood Monitoring Dashboard has been The research is important for the authorities to be able to take action in an area that is highly prone to flooding and can be used as a guide for other parties in development planning in the future.The machine learning modelling in this research is expected to assist academicians in future studies on flooding in Pahang and worldwide.The research is recommended to be expanded to other districts and states in Malaysia in order to produce a nationwide Flood Susceptibility Map (FSM).
The limitations of time and data availability have restricted the research scope, but it acts as the first step towards wider mapping of the flood.Evaluation by using multiple models could help better in comparing the results between them.A wider range of data can be used to train and test the model, which will increase accuracy but would take a longer time to execute.In addition, it is recommended to revisit the modelling every year as the condition of the location will change from time to time.

Figure 3 .
Figure 3. Learning curve of the ANN model fitted to the training data