Abstract:
The aim of this thesis was to develop a comprehensive database from published articles about the lipid production from microalgae; then, to use this database for knowledge extraction by employing data mining algorithms to estimate the results of unperformed experiments. A total number of 106 articles were used to construct the database with 5908 instances. Dataset was divided into two groups with respect to reported output variables, which were biomass production (mg/L d), and lipid content (w/w). As the preliminary analysis, the effect of each input variable was investigated by comparing the related articles. Then, for knowledge extraction and prediction-classification purposes, association rule mining, decision tree, and artificial neural network algorithms were applied to both datasets, by using libraries and functions of MATLAB and R. Association rule mining algorithm was implemented to all continuous and categorical variables to examine their effects on output variable, where Chlorella, Chlorococcum, and Nannocholoropsis species are found to yield high biomass production and high lipid content. Models were compared and evaluated by their accuracy in classification and standard error, root mean square error, and r-squared values in predictive analysis. Parameter tuning was done by randomly dividing the dataset into two sets, as the testing and the training sets, where the training set was used to construct the model, and the testing set was used to calculate the root mean square error and the rsquared values. The optimum models constructed using decision tree algorithm for classification gave 77.8% overall accuracy for biomass production, and 62.2% for lipid content. Artificial neural network algorithm was used for predictive modeling. Absolute error, root mean square error, and r-squared values of the optimum model for biomass production was, 50, 80, and 0.7, and 7, 11, 0.3 for lipid content. Predictive power of the constructed models for lipid content was not as strong as biomass production. The input significance analysis showed that nutritional variables were found to be the most deterministic variables for biomass production, whereas microalgae type was found to be the most deterministic variable for lipid content.