Application of Data Mining Combined with K-means Clustering Algorithm in Enterprises' Risk Audit

The financial risk management mechanism of enterprises can be more complete through exploration in the application effect of data mining technology combined with K-means clustering algorithm in enterprise risk audit. Hence, K-means clustering algorithm is introduced to study the paperless status of electronic payment in the trading process of e-commerce enterprises. Additionally, a risk audit model of e-commerce enterprises is implemented based on K-means algorithm combined with Random Forest Light Gradient Boosting Machine (RF-LightGBM). In this model, the actual operation process of data preparation, data preprocessing, model construction, model application and evaluation are implemented to study the payment flow in the transaction process of e-commerce enterprises by using big data analysis technology. Eventually, the performance of the model is evaluated by simulation. The results show that, compared with the models and algorithms proposed by scholars in other related fields, the classification accuracy of the model proposed here reaches 95.46 %. Simultaneously, the data message delivery rate of the model algorithm is basically stable at about 81.54 %, and the data message leakage rate, packet loss rate and average delay are lower than those of other models and algorithms. Therefore, under the premise of ensuring the prediction accuracy, the audit model of e-commerce enterprises can also achieve high data transmission security performance, which can provide experimental basis for the safety improvement and risk control of the audit process in e-commerce enterprises.

feature.It sorts the importance of input variables, and has strong adaptive ability and self-learning ability.It is very suitable for nonlinear modeling without the influence of multiple collinearities [5,6].Moreover, RF algorithm can overcome the overfitting problem existing in other models, and have been widely used in many fields, such as bioinformatics, medicine, and social science.K-means algorithm is used to make iterative clustering analysis.When it is carried out, K objects are randomly selected as the clustering center in the first stage.Then, the distance between each object is calculated, and the distance of each seed clustering center is also measured.By these steps, each object is assigned to the nearest clustering center.Clustering centers and their assigned objects refer to a clustering.After one more sample is assigned, anther calculation should be conducted on the cluster center, according to cluster's existing objects.The process should be repeated until the model meets the termination condition [7].Applying it to enterprise audit can effectively classify the transaction process and data, which is of great significance to enterprise risk audit.
To sum up, with the AI algorithms being continuously improved, the audit work is facing not only a rare opportunity, but also a big challenge in the widespread popularity of e-commerce companies.It is innovative to introduce the K-means clustering algorithm into the e-commerce industry.Meanwhile, integrated with the improved RF algorithm in the ML algorithm, the Random Forest Light Gradient Boosting Machine (RF-LightGBM) fusion algorithm is designed.And construction is conducted on the risk audit model of e-commerce enterprises based on K-means algorithm combined with RF-LightGBM.Ultimately, through simulation, its performance is evaluated, to provide experimentally referencing values for the later audit risk reduction and quality improvement.

A. Tendency of the Development of Enterprises' Risk Audit
With the advent of e-commerce, paperless transactions not only affect the traditional manufacturing production, but also bring new risks to the accounting and auditing work.Many scholars have studied the risk audit of enterprises.Shad et al.
(2019) combined the implementation of enterprise risk management with sustainable development reports to test the influence of risk audit on the enterprises' economic added value.Simultaneously, they proposed to use ordinary least square (OLS) analysis to obtain information about enterprise risk management practices and sustainability reports [8].Hanggraeni et al. (2019) displayed significant results of risk management factors by using an offline questionnaire survey.
Simultaneously, through the marketing and financial management risk audit assessment, they found that enterprise identification and management activities would have a vital influence on business performance [9].Cheng et al. (2021) proposed a new q-rung othopair fuzzy weighted averaging operator (q-ROFWAO) to rank and evaluate manufacturing small and middle-size enterprises (SMEs).The results show that the method is effective in Sustainability Enterprise Risk Management (SERM) of SMEs [10].Yang et al. (2021) introduced the time dimension to describe the dynamic, sudden, and timely evolution characteristics of enterprise risk events in view of the static mapping problem in the knowledge map of existing enterprises.ResNet dynamic knowledge reasoning method was also proposed to improve the loss balance function of multi-network model.Experiments show that the new model can effectively improve the accuracy of entity and relationship prediction [11].

B. Application Status of the DM Technology
With the Internet and information technology being increasingly advanced, the scale of data in all walks of life has been increasing.As one of the AI algorithms, ML has a wide range of applications, which provides a strong guarantee for DM of massive data information and is studied by many scientific researchers.Ping et al. (2019) introduced two ML methods for evaluating the fuel efficiency of driving behavior using natural driving data.Results indicate that this method can be adopted to make effective identification on the relationship between driving behavior and fuel consumption from the macro or micro levels, and can effectively predict the driving behavior of vehicles [12].enhanced multi-class normalized optimal clustering algorithm and applied it to data object grouping and classification.The results outline the needs of different regions in India in terms of energy consumption, and show that the proposed method performs significantly better [16].
Through the analysis of the research of the above scholars, it is found that in the era of massive data generation, the application field of DM technology is becoming gradually more extensive.In the field of risk audit, most of enterprises still adapt the traditional manufacturing audit work.Under the trend of rapid popularization in the field of e-commerce, there are not many studies related to risk audit.Therefore, aiming at the risks existing in the audit process of e-commerce enterprises in the field of e-commerce, the standard algorithm is optimized by the ML algorithm and construction is conducted on risk audit evaluation model of e-commerce enterprises, which are of great significance to the safety improvement and risk control of the audit process in e-commerce enterprises.

III. Construction and analysis of risk audit evaluation of e-commerce enterprises based on dm technology A. Requirements Analysis of Data Sources and Risk Audit of E-Commerce Enterprises
In e-commerce enterprises, data sources are more extensive than financial transactions with traditional supply chain.Ecommerce enterprises can improve the real-time update frequency of data through multi-dimensional BD acquisition, thus improving the effectiveness of data audit.Fig. 1 illustrates the BD sources of e-commerce enterprises.The collection of multi-dimensional data shown in Fig. 1 can effectively reduce the information asymmetry and false information, which is conducive to the operation of the BD risk control model in the later stage to avoid the error of the analysis results due to insufficient user information, to better prevent and control various audit risks.Moreover, through the analysis of the credit evaluation model constructed by diversified and deep-seated multidimensional big data, it is conducive to more accurate credit evaluation of users.
The audit mode and method of e-commerce business transactions are also constantly updated with the change of BD technology.The complexity of data types, the expansion of data analysis scope and the increase of audit projects put forward higher requirements for the professional quality of internal auditors.The practical application of BD audit mainly includes the configuration of personnel professional knowledge, the configuration of hardware and software equipment, and the sufficient and accurate data required [17,18].Primarily, in terms of personnel professional quality, since the BD audit is still in the developing stage, some technical personnel in the audit department still adopt the traditional riskoriented internal audit work method, which requires the company to increase the introduction of employees and personnel training.There are differences in the development of hardware and software equipment.Moreover, in terms of data preparation, if the business data collected in the audit operations are incomplete, the audit results will be affected.Data, as the basis of audit judgment, will influence the quality of internal audit in a certain degree.If there are information misrecords or missing records and cross-system extraction failures, the audit results will be misleading.Therefore, when auditing the risk of data information in e-commerce enterprises, DM is inevitable.Here, the combination of ML K-means clustering algorithm and RD is used to audit the risk of BD of transactions in e-commerce enterprises, which is significantly practical to the transaction information security and accurate audit of e-commerce enterprises.

B. Random Forest Algorithm Applied to Big Data Audit Analysis of E-commerce Enterprises
When auditing the transaction data of e-commerce enterprises, classifying the data is the primary work to be carried out.
RF is an improved algorithm based on common decision tree, which has more advanced advantages than common decision trees.It can generate training samples independently in each decision tree, and then form a forest.Ultimately, the results of multiple decision trees are combined by using some strategies.Based on the DT algorithm, a random forest is formed [19].Fig. 2 demonstrates the DT algorithm applied to BD in the e-commerce enterprises.As for the general DT algorithm, the segmented node usually selects an optimal feature attribute from all the sample feature attributes on the node as the basis of the segmented node [20].However, RF randomly selects some feature attributes on the current node, and then selects an optimal feature attribute as the basis for dividing the node.In this way, RF further enhances the generalization ability of the model.Compared with DT method, RF algorithm is more effective to solve the problem of overfitting in DT.After the establishment of the RF, assuming that there is a new sample, it is put into the RF, and then each DT in the RF enters the sample attribute category for decision-making, each tree has a vote, with a few subordinates to the majority method, the categories with the largest number of DT votes are the final classification results of the sample [21].The RF algorithm is applied to DT of e-commerce as shown in Fig. 3. Iterative Dichotomiser 3 (ID3 algorithm) (ID3 algorithm refers to a greedy algorithm used to construct decision trees.) is one of the most basic algorithms in RF.The algorithm first calculates the information gain of each attribute, and then compares the information gain of each feature one by one, and selects the best attribute for node segmentation [22].The so-called best attribute refers to the maximum information gain obtained by dividing the sample set according to the characteristics.Information entropy is a basic concept in algorithm operation, which is used to measure uncertainty.
Equation (1) indicates the sample set of decision tree at node m.
The corresponding sample category can be expressed as: p i represents the probability for each category and X accords to the information gain obtained by dividing sample by m corresponding to attribute a.
In Equation (3),Info(X)means the information entropy of.
Infoa(X) stands for the predicting information required by X.
ID3 algorithm selects the maximum attribute as the test attribute.However, it cannot handle continuous variables and prefers to select properties with more values.Therefore, ID3 algorithm usually leads to that the DT solution is local optimal solution rather than global optimal solution.Scholars have conducted in-depth research and discussion on this issue, and finally proposed the C4.5 algorithm.C4.5 algorithm is based on information gain rate.This algorithm uses the information gain rate to avoid the deviation of segmentation attributes, making it more equitable to select each attribute when dividing nodes [23].Equation ( 6) illustrates the calculation of information acquisition rate.
In Equation ( 6), splitInfoa(X)represents the information segmentation rate, and Equation (7) expresses it in a function.
Although compared with ID3 algorithm, C4.5 algorithm can discrete the original continuous attribute variables, it can handle continuous numerical variables and is also suitable for missing data [24].The classification rules generated by C4.5 algorithm are easy to understand and have high precision; however, the algorithm is not dominant in execution time and storage space.displays the calculation process of Gini minimum impurity criterion.
In Equation (8), p( j| t) refers to the probability of type j on node t.When the same category is composed of all the samples of node t, the minimum value is given to the Gini index, namely, 0, and the sample category is the purest.When the Gini index is maximum 1, the purity of the sample category is the lowest, that is, categories are different.The sample set is divided into m branches, and the Equation ( 9) expresses the Gini index used to split the current node.
In Equation ( 9), m refers to the number of sub-nodes, n i accords to the number of samples at sub-node i, and n represents the number of samples at the upper node.Moreover, the application of CART algorithm needs to calculate the Gini index of each attribute in the training process.After the variables with the smallest Gini index are selected to segment the current node, the decision tree needs to be recursively constructed until it reaches the stopping condition.
However, the Light Gradient Boosting Machine (LightGBM) algorithm, as an open source and efficient distributed gradient boosting tree algorithm newly released in recent years, has the characteristics of fast operation, less memory consumption and high accuracy, and is widely used in classification and regression.In the Gradient Boosting Decision Tree (GBDT) iteration, it is assumed that the learner obtained in the previous round is defined as Z t − 1 (x), whose loss function accords to Equation (10).
Then, the goal of this training is to find a suitable weak learner to minimize the loss function.Equation (11) defines the loss function.
Then, the negative gradient of the loss function is calculated to fit the approximate value of the current wheel loss function.Equation (12) demonstrates the approximate value of the loss function.
Square difference is usually used for approximationh t (x)as shown in Equation ( 13).In this round, the strong learner is defined as displayed in Equation ( 14).
Therefore, the LightGBM algorithm is integrated with the RF, namely RF-LightGBM, to reduce the calculation cost of the audit process of e-commerce enterprises, improve the calculation efficiency of the model, and obtain high accuracy while maintaining high calculation efficiency.

C. Application of K-means Clustering Algorithm in Big Data Audit Analysis of E-commerce Enterprises
The K-means algorithm can be described as a centroid-based partition technology, that is, the centroid of the cluster C i is used to represent the cluster.When the K-means algorithm is applied to the data analysis of e-commerce enterprises, the centroid of the cluster is defined as the mean value of the points in the cluster.In the clustering process, n objects are randomly selected with k as the parameter, and each object represents the initial mean value of a cluster.These objects are then divided into k clusters.The remaining objects are placed to the neighbor cluster based on their center distance from each cluster, so that the cluster has higher similarity [25,26].This time, the mean value of each cluster changes, and the average value are recalculated, and the process is repeated until the result cluster is as independent as possible.
As Equation ( 15) indicates, a known set of n data samples is defined as Ω.
In Equation ( 15), x i = x i1 , x i2 , ⋯, x id refers to a d-dimensional vector, x id refers to the d th identical attributes of the i th data, n represents the sample size.Equation ( 16) illustrates the clustering center.
In Equation ( 16),c j = c j1 , c j2 , ⋯, c jd refers to the center point of the j th cluster.There are d attributes in every c j , and k represents the number of clusters.
Equation ( 17) expresses the Euclidean Distance dis x i , c j , which is the distance betweenx i andc j .
In Equation ( 17), x i = x i1 , x i2 , ⋯, x id c j = c j1 , c j2 , ⋯, c jd ,k refers to the number of the clusters.Equation ( 18) accords to the calculation of the center of the same clustersc j .In Equation ( 18), N ϕ j represents the same cluster's amount of data.The criterion function is generally defined by the sum of error squares, which is expressed as Equation ( 19).
In Equation ( 19), E refers to the total value of the square error of all data objects in the data set of e-commerce enterprise audit, x i accords to the point in the space, k represents the given e-commerce enterprise audit data object, and c j stands for the average value of the center point of the j th cluster class.In the risk audit model of e-commerce enterprises, the first step is to collect the audit data required by e-commerce enterprises.The collected data does not only include financial data, but also contains the data that covers the business situation of the audited unit and the specific audit rules and regulations.After complete collection, the problems in the same specific direction are integrated and classified.Secondly, the audited data are extracted and cleaned, such as error data, invalid data and abnormal data found in the audit process during data extraction.Assuming that data is not processed, it is equivalent to predicting future data with the wrong data, which makes the potential link between the data undetectable and makes the wrong direction for the later development of the enterprise.The continuous accumulation of error data will make enterprises face a huge crisis, so data cleaning should not be ignored.In the data analysis stage, the K-means algorithm is combined with the RF.The construction of this model algorithm can not only avoid the sensitivity to the initial value when using the K-means clustering algorithm alone, but also cause different clustering results for different initial values.The number k of the generated clusters must be given in advance, which can also reduce the calculation cost of the audit process of e-commerce enterprises, improve the calculation efficiency of the model, and obtain high accuracy while maintaining high calculation efficiency.A number of n audit warning indicators are selected from the shared data center of e-commerce enterprises as the object of feature selection, asX = X 1 , X 2 , ⋯, X n ,X i = X i1 , X i2 , ⋯, X in .X in indicates the n th character of the i th audit warning indicator.Now, m samples are randomly selected from N samples, and Equation ( 20) accords to the cumulative weight equation of audit warning features.
In Equation ( 20) j refers to the audit warning features, which varies from 1 to N; i represents the randomly selected samples; diff( ⋅ )means the distance;M(x) stands for the heterogeneous nearest neighbor samples, andH(x) denotes the similar nearest neighbor samples.Finally, the data is divided into training set and test set in an 8:2 ratio.
In the simulation analysis, the risk audit model based on K-means algorithm combined with RF-LightGBM classification algorithm is compared with the models and algorithms proposed by other scholars in related fields, which mainly refer to RF-LightGBM [27], K-means [28], LightGBM [29], Support Vector Machines (SVM) [30] and Bayesian network (BN) [31], respectively, from the classification accuracy, data message delivery rate, leakage rate, packet loss rate and average delay of data transmission security.Among them, the model constructed here uses the cluster module of sklearn when designing the K-means clustering algorithm.The parameters are set as follows: k for the n_cluster classification cluster setting, valuing 2 ~ 6, maximum number of iterations defaults 120 for max_iter.Specific simulation experiment configuration is mainly considered from both hardware and software.In the software, the operating system is Linux 64bit, Python version is Python 3.6.1,and the development platform is PyCharm; in hardware, the CPU is Intel core i7-7700 @ 4.2GHz 8-core, memory is Kingston ddr4 2400MHz 16G, GPU is Nvidia GeForce 1060 8G.

A. Comparative Analysis of Classification Accuracy Performance of Each Model and Algorithms
To study the performance of the risk audit model of e-commerce enterprises based on K-means algorithm combined with RF-LightGBM, the system model constructed here is compared in several aspects with the algorithm put forward by other relevant scholars.The classification accuracy is predicted from Accuracy, Recall, Precision and F1 value, and Fig. 7 displays the results.Fig. 8 presents further acceleration ratio analysis of its classification efficiency.The acceleration performance of each algorithm is further compared and analyzed, and Fig. 8 illustrates the results.It is found that the increase of nodes, acceleration is more accurate than improving the classification of data blocks, and the degree of parallelism is improved.However, with the increase of nodes, the speedup increases more slowly, because the communication between nodes takes up a certain amount of time.Furthermore, it is found that the acceleration ratio of the proposed algorithm is significantly superior to other algorithms, which indicates that the model and the algorithm constructed here can complete the classification of audit data in e-commerce enterprises more quickly.

B. Analysis of Models' Data Transmission Security Performance under Different Algorithms
To study the prediction performance model constructed here, analyzation is made from the aspects of RF, LightGBM, K-  After a further analysis is carried out on each algorithm's data transmission performance.Results show that as the amount of transmitted data increases, the mean of the delivery rate of network audit data shows an upward trend, and the data message delivery rate is not less than 81.54 % (Fig. 9 (a)) ; the average leakage rate of network data has no obvious change, and the data message leakage rate of this study does not exceed 10.83 % (Fig. 9b) ; in terms of average delay, when the transmission audit data increases, the average delay decreases, and the mean value of the delay of the model algorithm in this study is basically stable at about 344.39 ms (Fig. 9 (c)) ; in the packet loss rate analysis, it is found that BN algorithm has a higher packet loss rate, where there may be hidden terminal problems, namely, packet loss.The Xu et al. (2019) proposed detailed methods rooted in remote sensing, ML, and computer vision, and made full use of existing data to combine convolutional neural networks (CNN) with subtle and scientific observation data of the earth [13].In view of the current situation of flight delays, Gui et al. (2019) designed a normalized model using LSTM algorithm and RF algorithm to classify and predict flight conditions.The results show that the proposed model based on random forest can obtain higher prediction accuracy (binary classification is 90.2 %) and overcome the overfitting [14].Lv et al. (2020) constructed a cognitive computing model by context-aware data flow by optimizing the decision tree algorithm in ML.The results show that the application of the model algorithm can ensure the accuracy and stability of behavior classification, which is of great significance for operators to analyze user behavior and develop personalized services [15].Kanagaraj et al. (2021) put forward an

Fig. 2 .
Fig. 2. DT algorithm applied to BD in the e-commerce enterprises

Fig. 4 .
Fig. 4. K-means algorithm applied in the BD of e-commerce enterprises

Fig. 5 .
Fig. 5. risk audit model of e-commerce enterprises based on K-means algorithm combined with RF-LightGBM

Fig. 6 7 For f i do 8 end if 19 end for 20 endFig. 6 .
Fig. 6.Steps of the model based on K-means algorithm combined with RF-LightGBM

Fig. 7 .
Fig. 7. curves of influence of iteration on classification accuracy of different algorithms (a.Accuracy; b.Precision; c. Recall; d.F1 value)

Fig. 9 .
Fig. 9. Comparative analysis of data transmission security of audit data of e-commerce enterprises under different algorithms (a.average deliver rate; b. average leakage rate; c. average delay; d. average loss rate) Qeios, CC-BY 4.0 • Article, March 13, 2024 Qeios ID: G9G0S3 • https://doi.org/10.32388/G9G0S319/24 algorithm's packet loss rate is the lowest, less than 5.29 %, which is due to the balanced processing of the transmitted data (Fig. 9 (d)).Therefore, judging by different transmission data, the risk audit model algorithm of e-commerce enterprises based on K-means algorithm combined with RF-LightGBM constructed here has prominent features in higher average delivery rate, lower delay, and lowest average leakage rate.Therefore, it has fantastic performance in data security transmission on the Internet, and lower data transmission risk of the model.

Fig. 10 .
Fig. 10.comparative analysis of data transmission security of each algorithm under different survival time of audit data message of e-commerce enterprises (a.average deliver rate; b. average leakage rate; c. average delay)