GeoStack
A stacking method of machine learning applied to geological geophysical datasets for 3D geological modeling
Install / Use
/learn @Ran-Jia/GeoStackREADME
GeoStack
Introduction
We can use models of scikit-learn, XGboost, and Keras for stacking. As a feature of our project, all out-of-fold predictions can be saved for further analisys after training. Stacking (stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. Stacking yields typically better performance than any single trained models. It has been used successfully in regression and classification (Breiman, 1996).The basic idea is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
Requirements
- Numpy
- Pandas
- XGboost
- Scikit-learn
- Tensorflow2.0 or later
Usage
To train and predict the GeoStack model, just run python stacking/main.py. Note that:
-
Set train and predict dataset under data/input
-
Stacking features from original dataset need to be under data/output/stacking_features
-
Final preidction result in
stacking/final_results.csvis under output folder -
We can visulize the prediction results in
GoCADsoftwares
Detailed Usage
-
Set train dataset with its target data and test dataset.
FILES_LIST_stage1 = { 'train':( INPUT_PATH + 'train.csv', ), 'target':( INPUT_PATH + 'target.csv', ), 'test':( INPUT_PATH + 'test.csv', ), } -
We define model classes that inherit
BaseModelclass, which are used in Stage 1, Stage 2. In our project, we usexgboost,randomforest,svm, andgbdtas stage 1 models. In stage2, we usexgboostagain as final model to predict the final results.The models usage and params are as follows:
# Model usage class xgb_stage1(BaseModel): def build_model(self): return Xgb(**self.params)# For Stage 1 XGB_PARAMS = { 'colsample_bytree':0.80, 'learning_rate':0.1,"eval_metric":"auc", 'max_depth':5, 'min_child_weight':1, 'nthread':4, 'objective':'binary:logistic','seed':407, 'silent':1, 'subsample':0.60, } NN_PARAMS = { 'batch_size':32,'epoch':100, 'verbose':1, 'callbacks':[], 'shuffle':True, 'class_weight':None, 'sample_weight':None, 'normalize':True, 'categorize_y':True } RF_PARAMS = { 'n_estimators':500, 'criterion':'gini', 'n_jobs':8, 'verbose':0, 'random_state':407, 'oob_score':True, } GBDT_PARAMS = { 'n_estimators':300, 'learning_rate':0.05, 'subsample':0.8, 'max_depth':5, 'verbose':1, 'max_features':0.9, 'random_state':407, } SVM_PARAMS = { 'kernel':'rbf', 'C': 100, 'gamma': 11, 'probability': True } # For Stage 2 XGB_PARAMS_stage2 = { 'colsample_bytree':0.8, 'learning_rate':0.1, "eval_metric":"mlogloss", 'max_depth':4, 'seed':1234, 'nthread':8, 'reg_lambda':0.01, 'reg_alpha':0.01, 'subsample':0.80, 'objective':'multi:softprob', 'num_class':output_dim, } -
Train each models of Stage 1 for stacking.
m = Model_XGB(name="xgb_stage1", flist=FILES_LIST_stage1, params = XGB_PARAMS, ) m.run() ... -
Train each model of Stage 2 by using the prediction of Stage-1 models.
FILES_LIST_stage2 = { 'train':( INPUT_PATH + 'train.csv', FEATURE_PATH + 'xgb_stage1_all_fold.csv', FEATURE_PATH + 'nn_stage1_all_fold.csv', FEATURE_PATH + 'rf_stage1_all_fold.csv', FEATURE_PATH + 'gbdt_stage1_all_fold.csv', FEATURE_PATH + 'svm_stage1_all_fold.csv', ), 'target':( INPUT_PATH + 'target.csv', ), 'test':( INPUT_PATH + 'test.csv', FEATURE_PATH + 'xgb_stage1_test.csv', FEATURE_PATH + 'nn_stage1_test.csv', FEATURE_PATH + 'rf_stage1_test.csv', FEATURE_PATH + 'gbdt_stage1_test.csv', FEATURE_PATH + 'svm_stage1_test.csv', ), } m = XGB_stage2(name="xgb_stage2", flist=FILES_LIST_stage2, params = XGB_PARAMS_stage2, ) m.run() -
Final result is saved as
stacking/output/stacking_features/xgb_stage2_test.csv.
Scripts
stacking/data/input: original train and pred datasetstacking/data/output/stacking_features: stage 1 featuresstacking/stacking/base.py: stacking modulestacking/main.py: train and predict program
Detailed scrips
base.py:- Base models for stacking are defined here (using sklearn.base.BaseEstimator).
- Some models are defined here. e.g., XGBoost, Keras.
- These models are wrapped as scikit-learn like (using sklearn.base.ClassifierMixin, sklearn.base.RegressorMixin).
- That is, model class has some methods, fit(), predict_proba(), and predict().
Stacking Framework
- The stacking framework of our research

Stacking model training strategy
- The training strategy of our research

Performance of stacking model
- We use AUC score and roc to evaluate the models and compare the performance of stacking model with single model.

- From the model evaluation results, the auc score of the stacking model is
0.93. Compared with a single model, it achieves better generalization performance.
LICENSE
GPL3.0
How to contact us?
-
gwwang@cugb.edu.cn
-
lvyikai2012@gmail.com
-
2101180102@cugb.edu.cn
