Analyzing trend and prediction of precipitation in Germany using non-parametrical tests and machine learning algorithms

The study explores analyzes the temporal changes in precipitation using the data from 1881 to 2020 across Germany at regional level. Man-Kendall and Hamad-Rao modification test were employed to analyze the precipitation trend ,while Pettit test was used for detecting the change point in the time frame. Machine learning methods like k-nearest neighbor, Support vector machine and Random forest algorithms were applied for prediction. Most of the regions showed increasing trend annually and seasonally in 0.05 significance level while some negative can be seen in summer. Furthermore, Based on Pettit test most of change points detected after 1940 in several regions. In prediction of precipitation, k-NN algorithm showed better performance in terms of mean absolute error rather than Support vector machine and Random forest algorithms. trend analysis, non-parametrical test, change point analysis, machine learning, k-nn, support vector machine, random forests Introduction Precipitation is a vital part of the hydrologic cycle and perhaps the most influential element in defining climate [1; 2]. Agriculture, forestry, the energy industry, and natural heritages can be directly affected by the alteration of its pattern [3]. Precipitation pattern variation resulting from climate change concerns water resource managers, climatologists, and hydrologists [1]. Srivista et al. and Islam et al. explained that the variation of precipitation quantities and frequencies directly affect the pattern of streamflow and its demand, spatiotemporal allocation of runoff, groundwater reserves, and soil moisture [4] [5]. Consequently, due to the extreme precipitation pattern changes, the probability of drought and flood-like hazards increases[1]. The drought event in 2003 across Europe had a significant impact on various sectors, and reduced water availability led to a vast economic loss on the order of 1.5 Billion Euros [6]. Therefore, extensive research on climate change or, more particularly, on precipitation patterns and frequencies is very much needed. Most importantly, a better understanding of the precipitation pattern in the changing environment will assist better decision-making and facilitate the communities’ adapting capacity to sustain extreme weather events[1]. Several scholars asserted that the recurrence and intensity of the precipitation increase due to global warming and are also expected to increase further [7] [8] [9]. Various studies of trend analysis of extreme rainfall across Germany provide concrete evidence for significant changes in the frequency and intensity of precipitation [10] [11] [12]. Trend analysis for central-eastern Germany showed increasing precipitation in Winter and a decreasing pattern in summer [11]. Therefore, the research on trend analysis of precipitation could be advantageous since it provides essential information about the probability of future changes. Parametric and Non-parametric tests [13], like a regression test [14] [15], Man-Kandell test[16] [17], Kendall rank correlation test [18], and Spearmen rank correlation test [19], are the most widely used methods for trend analysis by researchers. In the present study, the Man-Kandell test was used to track down the precipitation pattern trend since it can be used on independent time series data and not sensitive to outliers [20]. Hensel et al. and Schonweise et al. also used the Man-Kandell test to detect Central-eastern Germany and Europe’s Precipitation trend, respectively [11] [12]. Although trend analysis helps climatologists, meteorologists, and hydrologists understand present and past climate changes, the forecasting model tends to be more useful for planners to administer good planning considering upcoming climate variables [21]. Only two possible approaches exist in the literature for precipitation forecasting[22] [23] . The physical-based approach based on physical laws of precipitation is modeled by studying the precipitation process. When the prediction is generated at smaller than 2-3 km scales, a computational problem can occur in the physical model [24]. The Physical models are both time and money-consuming, although they provide a reliable result. Due to high spatial variability and noise in the precipitation data, pattern recognition methods have become more popular among researchers for prediction [25]. Furthermore, Previous studies suggest that pattern recognition methodology based on machine learning techniques positively results in precipitation forecasting [26] [27] [28]. The motivation for applying machine learning methods for prediction comes from the increasing volume and availability of data. Nowadays, a vast amount of literature can be found where artificial intelligence like machine learning methods has been used for prediction since it does not need an enormous amount of information and can handle large and complicated datasets [29]. Researchers have been spending much effort developing precipitation forecasts using numerous machine learning techniques that yield good forecasts. Huang et al. developed a novel approach for precipitation forecast by an improved K-nearest neighbor algorithm, which offers robustness against different choices of the neighborhood size k, particularly in the case of the irregular class distribution of the precipitation dataset [3]. Hong tried to implement a support vector machine (SVM) in rainfall forecasting in Northern Taiwan, and the empirical results revealed that this approach yields well-forecasting performance [26]. Pham et al. used random forest methods to forecast short-term daily streamflow in rainfall and snowmelt-driven watersheds [30]. There is a minimal implementation of the machine learning approach in precipitation forecasting in Germany. In the present study, I tried to implement k-NN, SVM, and RF algorithms to predict the precipitation in April, August, and December every year. Therefore, based on previous literature and research opportunities, the present study’s objectives are 1) variability and trend analysis in the overall annual and seasonal precipitation in German regions. 2) Change point analysis by non parametric Pettitt test 3) predicting precipitation using k-NN, SVM, and RF algorithms Data source and method description The geographical location of Germany is in Western and Central Europe, with sixteen states. Although the whole country experiences a moderate climate, it has four seasons like summer (JuneAugust), Fall( September – November), Winter (December – February), and Spring (March-May). The data here used historical monthly regional averages for Germany and its federal states from 1881 to 2020, available on the DWD website[31].The regional averages result from the averaging of the respective areas of the 1km gridded data. Mann-Kendall trend test [16] based on the ranks and sequence of a time series correlation is nonparmetric test that works for all types of distribution. For a given time series Xi ,i=1,2,. . . n, the null hypothesis for this test is that the data is independently distributed, and the alternate hypothesis is that a trend exists. The test statistics S is can be calculated by S = ∑n−1 i=1 ∑ n j=i+1 sgn (M j−Mi) (1) Where, M j and Mi are the values of the sequence i,j n is the length of the time series and sgn(M j−Mi) =  1 i f (M j−Mi)> 0 0 i f (M j−Mi) = 0 −1 i f (M j−Mi)< 0  (2) var (s) = n(n−1)(2n+5)− ∑y=1 ty(ty−1)(2ty+5) 18 (3) The Zmk Value is used to find out the time-series information is demonstrating a significant trend or not. The Zmk value is computed using equation 4 Zmk =  S−1 √ var(S) i f S > 0 0 i f S < 0 S+1 √ var(S) i f S<0  (4) It is very common and effective to use Pettit [32] test to figure out a single change point in climate time series data It tests the H0: The T variables follow one or more distributions that have the same location parameter (nochange), against the alternative:


Introduction
Precipitation is a vital part of the hydrologic cycle and perhaps the most influential element in defining climate [1; 2]. Agriculture, forestry, the energy industry, and natural heritages can be directly affected by the alteration of its pattern [3]. Precipitation pattern variation resulting from climate change concerns water resource managers, climatologists, and hydrologists [1]. Srivista et al. and Islam et al. explained that the variation of precipitation quantities and frequencies directly affect the pattern of streamflow and its demand, spatiotemporal allocation of runoff, groundwater reserves, and soil moisture [4] [5]. Consequently, due to the extreme precipitation pattern changes, the probability of drought and flood-like hazards increases [1]. The drought event in 2003 across Europe had a significant impact on various sectors, and reduced water availability led to a vast economic loss on the order of 1.5 Billion Euros [6]. Therefore, extensive research on climate change or, more particularly, on precipitation patterns and frequencies is very much needed. Most importantly, a better understanding of the precipitation pattern in the changing environment will assist better decision-making and facilitate the communities' adapting capacity to sustain extreme weather events [1].
Several scholars asserted that the recurrence and intensity of the precipitation increase due to global warming and are also expected to increase further [7] [8] [9]. Various studies of trend analysis of extreme rainfall across Germany provide concrete evidence for significant changes in the frequency and intensity of precipitation [10] [11] [12]. Trend analysis for central-eastern Germany showed increasing precipitation in Winter and a decreasing pattern in summer [11]. Therefore, the research on trend analysis of precipitation could be advantageous since it provides essential information about the probability of future changes. Parametric and Non-parametric tests [13], like a regression test [14] [15], Man-Kandell test [16] [17], Kendall rank correlation test [18], and Spearmen rank correlation test [19], are the most widely used methods for trend analysis by researchers. In the present study, the Man-Kandell test was used to track down the precipitation pattern trend since it can be used on independent time series data and not sensitive to outliers [20]. Hensel et al. and Schonweise et al. also used the Man-Kandell test to detect Central-eastern Germany and Europe's Precipitation trend, respectively [11] [12].
Although trend analysis helps climatologists, meteorologists, and hydrologists understand present and past climate changes, the forecasting model tends to be more useful for planners to administer good planning considering upcoming climate variables [21]. Only two possible approaches exist in the literature for precipitation forecasting [22] [23] . The physical-based approach based on physical laws of precipitation is modeled by studying the precipitation process. When the prediction is generated at smaller than 2-3 km scales, a computational problem can occur in the physical model [24]. The Physical models are both time and money-consuming, although they provide a reliable result. Due to high spatial variability and noise in the precipitation data, pattern recognition methods have become more popular among researchers for prediction [25]. Furthermore, Previous studies suggest that pattern recognition methodology based on machine learning techniques positively results in precipitation forecasting [26] [27] [28]. The motivation for applying machine learning methods for prediction comes from the increasing volume and availability of data. Nowadays, a vast amount of literature can be found where artificial intelligence like machine learning methods has been used for prediction since it does not need an enormous amount of information and can handle large and complicated datasets [29]. Researchers have been spending much effort developing precipitation forecasts using numerous machine learning techniques that yield good forecasts.
Huang et al. developed a novel approach for precipitation forecast by an improved K-nearest neighbor algorithm, which offers robustness against different choices of the neighborhood size k, particularly in the case of the irregular class distribution of the precipitation dataset [3]. Hong tried to implement a support vector machine (SVM) in rainfall forecasting in Northern Taiwan, and the empirical results revealed that this approach yields well-forecasting performance [26]. Pham et al. used random forest methods to forecast short-term daily streamflow in rainfall and snowmelt-driven watersheds [30]. There is a minimal implementation of the machine learning approach in precipitation forecasting in Germany. In the present study, I tried to implement k-NN, SVM, and RF algorithms to predict the precipitation in April, August, and December every year.
Therefore, based on previous literature and research opportunities, the present study's objectives are 1) variability and trend analysis in the overall annual and seasonal precipitation in German regions. 2) Change point analysis by non parametric Pettitt test 3) predicting precipitation using k-NN, SVM, and RF algorithms

Data source and method description
The geographical location of Germany is in Western and Central Europe, with sixteen states. Although the whole country experiences a moderate climate, it has four seasons like summer (June-August), Fall( September -November), Winter (December -February), and Spring (March-May). The data here used historical monthly regional averages for Germany and its federal states from 1881 to 2020, available on the DWD website [31].The regional averages result from the averaging of the respective areas of the 1km gridded data.
Mann-Kendall trend test [16] based on the ranks and sequence of a time series correlation is nonparmetric test that works for all types of distribution. For a given time series Xi ,i=1,2,. . . n, the null hypothesis for this test is that the data is independently distributed, and the alternate hypothesis is that a trend exists. The test statistics S is can be calculated by Where, M j and M i are the values of the sequence i,j n is the length of the time series and The Z mk Value is used to find out the time-series information is demonstrating a significant trend or not. The Z mk value is computed using equation 4 It is very common and effective to use Pettit [32] test to figure out a single change point in climate time series data It tests the H0: The T variables follow one or more distributions that have the same location parameter (nochange), against the alternative:

2/13
a change point exists. The non-parametric statistic is defined as: The change-point of the series is located at K T , provided that the statistic is significant [32]. The significance probability of K T is approximated for p ≤ 0.05 with

Prediction methodologies
The supervised machine learning approach's primary goal is to predict some output variable associated with each input item. In this case, given the precipitation data for the last three months (input), predict the precipitation in the next consecutive month (output). Input and output data are then divided into two sets, i.e., the Training set is used to construct the model, and the test set is to verify the model built. Data in training set excluded from the test set. The ratio for splitting training and testing data used here is 80/20. I used mean absolute error (MAE) to test the model. Figure 1 represents the total workflow for the prediction methodology. I have used three different algorithms, e.g., k-Nearest Neighbor, Support Vector Machine, and Random Forest algorithms, The majority voting of its neighbors predicts the class label of the query point x Where c is a class label and c NN i indicates the class label for the i -th nearest neighbor from its k nearest neighbors. δ c = c NN i , an indicator function, takes a value of one the class c NN i of the neighbor x NN i is the same as class c and zero otherwise. Support Vector Regression is a supervised machine learning algorithm that can be utilized for classification and regression problems. The SVR algorithm's main idea is plotting each data in n -dimensional space (where n is the number of features) [33]. Every value of each feature indicates the value of particular coordinates. It classifies the values by obtaining the best hyperplane that differentiates the two classes in the best possible way. The hypothesis function h can be illustrated as Where w is the direction of the vector x. For example, If vector x = (x 1 , and b is the bias term Random forests develop many individual decision trees during the training process. Each tree makes one prediction, and after that, all predictions from the trees are pooled to make the final prediction [34]. It is also called ensemble techniques since it uses a collection of results to decide. If we assume, there are only two child nodes, the equation for calculating the new node is Where ni j is the importance of node j, w j is the weighted number of samples reaching node j,C j is the impurity value of node j and left (j) and right (j) is the child node from left and right split on node j, respectively.
Mean absoulte error is a model evaluation metric used with regression models. The mean absolute error of a model with respect to a test set is the mean of the absolute values of the individual prediction errors on over all instances in the test set [35]. Each prediction error is the difference between the true value and the predicted value for the instance.
where y i is the true target value for test instance,x i , λ (x i ) is the predicted target value for testinstance x i , and n is the number of test instances.

Variables and data analysis
Conda environment and Python 3.8.5 is used for data analysis and machine learning applications.Jupyter notebook was used for code documentation. pandas [36] , numpy [37], scikit-learn [38], pymannkendall [39], pyhomogeneity [40] packages has been used for data analysis, visualization and prediction processes. The data was organized in one dataframe where it has 19 columns and 2240 entries.The first column contains the years which ranges from 1880 to 2020 and second column represents regions of Germany.From third to fourteenth columns contain total precipitation of every month in every year on different regions. The next four columns depicts the total precipitation in summer, winter, spring, fall and the last column contain total precipitation in a year. After preparing the data , trend analysis has been done by Man-Kendall test and Hamad-Rao modification test. After conduction of change point analysis Pettit test , the data was splitting into two different sets (input and output) for machine learning algorithms. Every input and output datasets further split into two sets training and testing purposes. The training datasets contains 691 entries and testing datasets contains 173 entries. After training the model with training datasets , testing datasets was used for validation purpose. Mean absolute error was calculated to calculate the accuracy of the model.

Results
I calculated the descriptive statistics of annual precipitation and seasonal precipitation from 1880 to 2020 for Germany's sixteen federal states. Results indicate that the states in the western part of Germany, i.e., Baden-Wuerttemberg, Bayern, and Saarland, have observed the highest average rainfall (938.96mm,909.89mm,886.08mm, respectively). On the other hand, minimum average rainfall has been recorded in the Sachsen-Anhalt and Brandenburg (559.1mm and 566.07mm, respectively). The highest variation (standard deviation) in precipitation is found in Saarland state, where Mecklenburg-Vorpommern has the minimum standard deviation among all other states. The skewness for all federal states ranges from 0.   [16] has some significant drawbacks like the presence of autocorrelation with the datasets, nonlinearity could influence the result. Hamed and Rao [41] proposed an innovative trend method to overcome the mentioned drawbacks. Hamed and Rao [41] suggested that this modified MK test can effectively detect the trend in data with serial correlation. Pettitt test [32] is a non-parametric homogeneity test that checks if two or more sets of data are from the same distribution or not. The homogeneity test is applied in the time series data to detect the breakpoint in the series. Table 2 shows the result of the Pettitt test to detect the change point in sixteen federal states. The alternate hypothesis (h) for that change point exists. Change Point selection is based on the performances (p-value) of the test. The h value in several states (e.g., Bayern,

Precipitation prediction
The analyses mentioned above exhibit most of the regions in Germany experienced a significant increase in precipitation. Therefore, prediction of precipitation has become essential for water resource management. Several machine learning algorithms are available for precipitation prediction, but I applied k-NN, SVR, and RFR. The precipitation data from the year of 2010 to 2015 has taken into account to train and test the model. In the first step, I divided the data into two sets, input and output. Precipitation in January, February, March, May, June, July, September, October, November worked as input variable, and April, August, and December precipitation data worked as output. After that, input and output data were split into two datasets for training and testing purposes. In the end, the training set was fed into the algorithms for building the model. The number of training and testing data are 691 and 173 respectively.After that , testing dataset has used for calculating model performance. Figure 3 depicts the model performance where x axis represent test size and y axis is showing the precipitation in mm. The comparisons of three simulations with the observed precipitation data has shown on figure 3. In general, RFR algorithm predictions are more accurate than SVR and k-nn algorithms. Mean absolute percentage error for RFR is 26.527 where algorithms like k-NN and SVR has mean absolute error of 28.416 and 31.012 respectively. Several parameters can be tuned for changing the model performance in these three models. In k-NN model, the k parameter is changed for sensitivity analysis. k is the number of neighbors used for prediction. Figure 4 in additional figures shows the sensitivity analysis of k-NN model in terms of k. There are some slight changes in values of MAE as the value of k changes. When the k has the value 5 ,the MAE is 28.416 (figure 3 above). Apart from that, when the k value are 1, 20 and 50, the MAE's are 27.282, 30.661, 30.443 respectively. In SVR model, The Regularization parameter often termed as C parameter in python's sklearn library tells the SVR optimization how much you want to avoid misclassifying each training example.For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points [42]. Figure 6 shows how SVR model reacted with changing the C parameter. It shows there is no significant change the MAE'S value in different C values. In RFR model, the sensitivity analysis has been done by changing number of estimators (nestimator). Number of estimator is the number of trees to be used in the forest.Since Random Forest is an ensemble method comprising of creating multiple decision trees, this parameter is used to control the number of trees to be used in the process [43]. Figure 5 shows the nestimator effect in the RFR model.

Discussion and conclusion
Although there is a scope for selecting the proper dataset for analyzing, the methods for trend and change point analysis gave us a good idea of precipitation pattern in different year on the period of 1881-2020. The higher temporal resolution data could give much more convenient result and need to work with spatial temporal variability also. Man-Kendall test does not consider the serial autocorreltaion but Hamad Rao modification test considers all significant lags.The choice of splitting the data for prediction can be improved but it gives a clear idea about the three model used here.Prediction model could be optimized by changing parameters of these three models. Moreover, The initiation of using machine learning algorithms in the field of hydrology needs much more appreciation and investigation to use them as useful tool for forecasting.