Weight Prediction for Fishes in Setiu Wetland, Terengganu, using Machine Learning Regression Model

. Predicting fish weight holds several essential implications in ecology, such as population assessment, trophic interactions within ecosystems, biodiversity studies of fish communities, ecosystem modelling, habitat evaluation for different fish species, climate change research, and support fisheries management practices. The objective of the studies is to analyse the prediction performance of machine learning (ML) regression models by applying different statistical analysis techniques. This study collected biometric measurements (total length and body weight) for 19 fish families from three locations in Setiu Wetland, Terengganu, captured between 2011 and 2012. The study adopts two regression types: Linear Regression (i.e., Multiple Linear, Lasso, and Ridge model) and Tree-based Regression (i.e., Decision Tree, Random Forest, and XGBoost model). Mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R 2 ) were used to evaluate performance. The results showed that the proposed ML regression models successfully predicted fish weight in Setiu Wetlands, and the Tree-based Regression model provides more accurate prediction results than the Linear Regression model. As a result, Random Forest is the best predictive model out of the six suggested ML regressions, with the highest accuracy at 96.1% and the lowest RMSE and MAE scores at


Introduction
Predicting fish weight is crucial in ecology because it helps estimate fish population size and distribution, which is vital for monitoring and managing fish.This data is essential for assessing population changes and health [1,2], understanding ecosystem trophic interactions, energy flow, and food web structure [3], and biodiversity studies to assess species diversity and abundance [4].Additionally, fish weight data is vital for ecosystem modelling [3] and addressing factors like fishing and climate change's impact on fish populations and habitats, as shifts in fish weight can signal changes in growth rates, productivity, and species composition [5], ultimately contributing to the sustainable management of fisheries for setting catch limits, crafting effective fishing regulations, and implementing conservation strategies, thus ensuring the long-term viability of fish stocks [6].
Typically, the process of measuring the weight of fish is conducted manually on a perindividual basis.Nevertheless, this procedure is challenging, time-intensive, and stressful for the fish.Weight measurements on fresh or frozen fish can vary significantly due to factors like fish wetness, environmental conditions, and scale capacity.Multiple scales of different sizes should be used to ensure accuracy, with an accuracy of ±1%.Frozen fish are lighter and shorter than fresh ones [7].Hence, the development of rapid, precise, cost-effective, and indirect measuring techniques would be significant to the field of ecology, and one of the methods is using machine learning.
Machine learning (ML) refers to the scientific investigation of algorithms and statistical models employed by computer systems to execute a designated task based on an example of a dataset without requiring explicit programming [8].Regression is a form of supervised learning that predicts a dependent variable by estimating the correlation between independent variables [9].Linear Regression is a widely utilised learning technique for its fundamental nature and frequent application in predictive analysis [10].Multiple Linear Regression indicates that there is more than one independent variable.The Least Absolute Shrinkage and Selection Operator (Lasso) combines variable selection and regularisation to enhance prediction accuracy by reducing the number of variables, a process known as variable shrinkage, within a statistical model [11].Ridge Regression mitigates the problem of multicollinearity by incorporating a regularisation term into Linear Regression models, particularly in cases where the independent variables exhibit significant correlation [11].
Tree-based Regression models utilise one or more explanatory variables to explain variation in a numeric response variable and systematically divide the training data into progressively smaller groups, each characterised by the mean value of the response variable, group size, and the defining values of the explanatory variables [12].The Decision Tree algorithm constructs a tree-based model that utilising nodes to represent tests on input data, branches for possible outcomes, and leaf nodes for final predictions, with the ability to select the best features for data splitting at each node and options for pruning to prevent overfitting [13].Random Forest combines multiple Decision Tree predictions that create a "forest", and each is trained on a random subset of the data and features and then aggregates their predictions [14].Extreme Gradient Boosting (XGBoost) combines the predictions of multiple Decision Tree models and incorporates techniques such as gradient boosting, regularisation, handling missing data, and parallel processing for efficient training [15].Some existing studies applying the ML regression model to predict fish weight included the use of Linear Regression, Random Forest Regression, and Support Vector Regression to estimate the weight of Tilapia fish with a coefficient of determination (R 2 ) value of 0.70 [16], weight prediction of Rainbow Trout (Oncorhynchus mykiss) using Linear, Power, and Second Order Polynomial Regression with R 2 values 0.98, 0.99, and 0.98 respectively [17], Symbolic Regression was used in the context of Perch, Bream, Roach, and Pike weight prediction with accuracy within the range of 0.98 to 0.99 [18], Linear Regression models for the prediction of Nile Tilapia achieved R 2 values ranging between 0.95 and 0.96 for body weight and between 0.92 and 0.95 for carcass weight [19], and Sparus aurata weight estimation using Linear Regression and Nonlinear Regression, and coefficients of correlation (R) were assessed, and the values ranged from 0.95 to 0.98 and 0.96 to 0.97, respectively [20].
In the Setiu Wetland, the fish growth patterns of 22 families have been studied using the fish length-weight relationship (LWR) method.In the Power Regression model, denoted as W = aL b , the length of the fish (L, measured in centimetres) is utilised as a predictor for the weight of the fish (W, measured in grams).The values of the intercept coefficient (a) and the exponential coefficient (b) exhibit variation based on the specific fish species and their respective environmental conditions [21].
In this study, we use fish sampling data to predict the fish weight of multi-family assemblages in the Setiu Wetland.The objective is to analyse the prediction performance of six widely used regression MLs using different statistical analysis techniques.As a result, this study creates an automated method that is straightforward and inexpensive to measure the fish's weight.By using the default model specifications offered by the statistics package, our study only explores the fundamental interpretations of each model rather than attempting to increase the performances of each ML by changing or adding model parameters.

Data Preparation
The data are pre-processed for analysis.Python programming by Jupyter Notebook of the Anaconda Navigator Software was used as a tool.Initially, the data-cleaning process is carried out on the dataset by wrangling, renaming the header, and deleting any missing values using the Pandas Python library.The outlier values were removed by Interquartile Range (IQR) Analysis for total length and body weight variables.Removing outliers helps the model capture the underlying patterns and relationships in the data more effectively, leading to better generalisation and predictive performance.We can generate more reliable estimates and reduce the potential bias the extreme values introduce [23,24].
Correlation analysis between two variables was used to determine the direction and strength of the linear association.Correlation analysis attempts to measure and understand the effect of two continuous variables on linear or nonlinear interactions.Association coefficients assume values that vary from negative (-1) to uncorrelated (0) to positive (+1) correlations.The sign of the correlation coefficient (i.e., positive or negative) defines the orientation of the relationship.

ML Model Development
In this study, we implemented regression algorithms as a method for prediction.The study adopts two regression types: Linear Regression (i.e., Multiple Linear, Lasso, and Ridge) and Tree-based Regression (i.e., Decision Tree, Random Forest, and XGBoost model).
Two encoding methods in Scikit-Learn were used to convert categorical data into numbers.OneHotEncoder class is required for Linear Regression, and the OrdinalEncoder class is functional for tree-based models.Hyperparameter tuning was performed to the Lasso and Ridge regression model to determine the optimal values of parameters using the GridSearchCV function in Scikit-Learn.
The data were split into the training and testing sets with a split ratio of 70:30, respectively, using the Scikit-learn Python library via the train_test_split() function.The training set is used for training the model, and the testing set is used for testing the model's accuracy.The split ratio value was chosen according to the study by Picard and Berk [25], Dobbin and Simon [26], Pham et al. [27], and Nguyen et al. [28].

Models Validation
A k-fold cross-validation (CV) approach was used to validate all developed models.This approach divides the whole dataset into k non-overlapping folds at random, then utilises k-1 folds for model construction and the remaining 1-fold for validation.This step is performed k times such that each fold is used as the validation set once at the end.The k-fold CV is a robust approach for estimating model accuracy that involves averaging the k outcomes to create a single estimate for a model.It is common practice to replicate the k-fold CV to maintain stability in model performance.The 15-fold CV was adopted since most model performance estimates were almost unbiased when k was between 10 and 20 [29].

Model Performance Evaluation
We evaluated and compared the overall performance of all regression models by employing three model performance measures: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Coefficient of Determination (R 2 ).The MAE informs us of the average size of an error from the forecast, whereas the RMSE displays the average distances between predicted and observed data [30].R 2 measures how effectively predictors account for variation in response variables.R 2 values vary from 0 to 1, with 1 indicating perfect model fit.The best model has the lowest MAE and RMSE values and the highest R 2 .

Dataset Overview
The data consists of 8662 rows and three variable components, as shown in Fig. 1, with no missing value.Correlation analysis helps in selecting relevant variables for a predictive model.The correlation between variables was examined, and results in Table 2 show that all the variables did not exhibit multicollinearity (R-value = |1|), which results in unsatisfactory predictions.So, all three variables were fitted with the Linear and Tree-Based Regression models.

Outlier Removal
The variables presented in Fig. 2 have undergone outlier predictions, and afterwards, the outliers have been identified and removed using IQR Analysis, as depicted in Fig. 3.

Model Performance Result
The best model performance has high R 2 values and low RMSE and MAE values.The performance evaluations are summarised in Table 3 for the Linear Regression Model and  This performance is because the Tree-based Regression is adept at capturing nonlinear patterns and interactions between variables by partitioning the feature space into regions and fitting distinct linear models within each region, providing flexibility in modelling.Treebased Regression models can capture complex relationships among variables effectively.Interactions can be automatically detected and modelled without explicit specification [31].
Out of the six machine learning regression models that were considered, it was found that the Random Forest model had the highest predictive capability, achieving an accuracy rate of 96.1%.Additionally, this model demonstrated the lowest values for the RMSE and MAE, which were determined to be 3.352 and 0.880, respectively.The model possesses limitations that require consideration in intending to incorporate other independent variables or information, including but not limited to fork length, standard length, species, age, and sex.In addition, the modelling method does not account for environmental factors such as water temperature, salinity, pH level, dissolved oxygen, or condition factor, indicating the fish's general well-being and health.These factors are known to influence fish growth and weight significantly.
In the future, developing more sophisticated ML algorithms and techniques can contribute to more accurate and robust prediction models.Exploring advanced deep learning and ensemble learning methods or combining ML algorithms with other techniques, such as image segmentation and processing, may yield improved prediction accuracy, generalisation, and model interpretability.By incorporating weight prediction into ecological research and management practices, we may make informed decisions supporting the conservation and sustainable use of fish populations and their habitats.

Fig. 1 .
Fig. 1.Dataset overview (a) number of rows and features and (b) missing values data

Table 1 .
Location coordinates for three sample sites

Table 2 .
Correlation between variables

Table 4
for the Tree-based Regression model.The Tree-based Regression model has a better R 2 score than the Linear Regression model.When evaluating the RMSE, it was observed that the Tree-based Regression model had a lower score value than the Linear Regression model.Concerning the MAE metric, the Tree-based Regression model scored lower than the Linear Regression model.The empirical findings indicate that the Tree-based Regression model performed better than the Linear Regression model.

Table 3 .
Performance score for Linear Regression Model

Table 4 .
Performance score for Tree-based Regression modelCross-validation is a statistical technique that offers skill estimates to compare and select a prediction model.The results presented in Table5indicate that the ML regression models proposed in this study were effective in predicting fish weight, as evidenced by their high accuracy values ranging from 0.866 to 0.964.Additionally, the models exhibited low RMSE and MAE values, ranging from 3.417 to 6.529 and from 0.864 to 3.962, respectively.

Table 5 .
The performance value of cross-validation in each proposed ML model

Limitation, and Future Works This
research aims to demonstrate the potential use of ML regression algorithms for predicting fish weight in the Setiu Wetland, Terengganu area.The motivation behind this methodology is to analyse the prediction performance of ML regression models by applying different statistical analysis techniques.This paper selected two regression types, Linear Regression (i.e., Multiple Linear, Lasso, and Ridge model) and Tree-based Regression (i.e., XGBoost, Random Forest, and Decision Tree model) for prediction purposes.Regarding the model performance and k-fold CV result (k=15), the ML proposed successfully predicted fish weight with low MSE and MAE values and high R² value.However, the Tree-based Regression model provides more accurate prediction results than the Linear Regression model.Among the six proposed ML regressions, Random Forest is the best predictive model with the highest accuracy value of 96.1%, while the lowest RMSE and MAE score values are 3.352 and 0.880, respectively.