學習pycaret之前,先搭建好jupyter notebook。代碼實現是基于jupyter的。
安裝pycaret(默認的是cpu版本)
參考 pycaret github 最新版pycaret使用說明
#create a conda environment
conda create --name pycaret3 python=3.9
# activate conda environment
conda activate pycaret3
# install pycaret
pip install pycaret [full]
#創(chuàng)建一個notebook kernel
python -m ipykernel install --user --name pycaret3 --display-name "pycaret3"
如果你有GPU可以考慮安裝支持GPU的pycaret
前面的步驟和上面的cpu版本完全一樣。下面是需要手動安裝的內容
pip3 uninstall lightgbm -y
#先降級pip版本,否則無法使用--install-option參數
pip3 install pip==22.2.1
pip3 install lightgbm --install-option=--gpu --install-option="--opencl-include-dir=~/CUDA11.8/include/" --install-option="--opencl-library=~/CUDA11.8/lib64/libOpenCL.so"
上面的~/CUDA11.8/是我的cuda的安裝位置。需要修改為你自己的cuda的安裝位置
還需要cuml ,這個需要根據自己情況選擇對應版本Installation Guide - RAPIDS Docs
RAPIDS里面包含這個cuml.
pycaret是可以實現多個機器學習的包裝器
包含的有scikit-learn,XGBoost,LightGBM,CatBoost,SpaCy,Optuna,Hyperopt,Ray等。
有監(jiān)督機器學習
分類Classification
- 二元分類
- 多元分類
pycaret.classification
官方的分類的所有函數的API
image.png
回歸Regression
pycaret.regression
官方的回歸的所有函數的API

無監(jiān)督機器學習
異常檢測Anomaly Detection
pycaret.anomaly
異常檢測的官方API

聚類Clustering
pycaret.clustering
聚類官方API

時間序列分析 Time Series Forecasting
pycaret.time_series
時間序列官方API

pycaret分析的基本步驟
- 讀取數據get_data
- 初始化安裝,導入分析模型類型
- 模型訓練和選擇
- 可視化最優(yōu)的模型
- 預測測試集的數據
- 預測新的數據的結果
- 保存模型
數據預處理
數據預處理原文
缺失值,一般為空白或NaN
使用setup函數后會自動初始化,并填充缺失值
# load dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class')


MAPE值越低,說明填充的結果約接近真實值
軟件默認的缺失數據填充
數字值
numeric_imputation: int, float, or string, defaul:mean 默認是用均值可以使用的參數值:
drop 刪除包含缺失的行
mean 均值
median 使用中間值填充
mode 使用頻率最多的值填充
knn 使用knn近鄰法填充
int or float 使用提供的數值
分類值 categorical_imputation: string, defaul:mode
可以使用的參數值:
drop
mode
str 使用提供的字符串
imputation_type設置填補類型
默認是simple
可選值是: simple, iterative, None
如果是None則不填充
數據填充使用的模型
numeric_iterative_imputer:str or sklearn estimator ,默認值是:lightgbm
categorical_iterative_imputer:str or sklearn estimator ,默認值是:lightgbm
數據類型,包括數字,分類或日期時間 ,pycaret會自動檢測數據類型
如果pycaret自動檢測的數據類型和預期的不一致,則可以手動指定為對應的數據類
一次性編碼,數據集的分類特征包含標簽值
序數編碼,數據集中的分類特征包含具有內在自然順序的變量,例如:(低,中,高)
基數編碼
目標不平衡,當訓練數據集的目標類分布不均勻時,可以使用fix_imbalance設置中的參數進行修復。
刪除異常值 remove_outliers
pycaret3可用的模型種類
分類模型classification
| 縮寫 | 模型全稱 |
|---|---|
| lr | Logistic Regression |
| knn | K Neighbors Classifier |
| nb | Naive Bayes |
| dt | Decision Tree Classifier |
| svm | SVM - Linear Kernel |
| rbfsvm | SVM - Radial Kernel |
| gpc | Gaussian Process Classifier |
| mlp | MLP Classifier |
| ridge | Ridge Classifier |
| rf | Random Forest Classifier |
| qda | Quadratic Discriminant Analysis |
| ada | Ada Boost Classifier |
| gbc | Gradient Boosting Classifier |
| lda | Linear Discriminant Analysis |
| et | Extra Trees Classifier |
| xgboost | Extreme Gradient Boosting |
| lightgbm | Light Gradient Boosting Machine |
| catboost | CatBoost Classifier |
回歸模型 regression
| 模型縮寫 | 模型全稱 |
|---|---|
| lr | Linear Regression |
| lasso | Lasso Regression |
| ridge | Ridge Regression |
| en | Elastic Net |
| lar | Least Angle Regression |
| llar | Lasso Least Angle Regression |
| omp | Orthogonal Matching Pursuit |
| br | Bayesian Ridge |
| ard | Automatic Relevance Determination |
| par | Passive Aggressive Regressor |
| ransac | Random Sample Consensus |
| tr | TheilSen Regressor |
| huber | Huber Regressor |
| kr | Kernel Ridge |
| svm | Support Vector Regression |
| knn | K Neighbors Regressor |
| dt | Decision Tree Regressor |
| rf | Random Forest Regressor |
| et | Extra Trees Regressor |
| ada | AdaBoost Regressor |
| gbr | Gradient Boosting Regressor |
| mlp | MLP Regressor |
| xgboost | Extreme Gradient Boosting |
| lightgbm | Light Gradient Boosting Machine |
| catboost | CatBoost |
時間序列模型列表Time Series
| 時間序列模型縮寫 | 模型全稱 |
|---|---|
| naive | Naive Forecaster |
| grand_means | Grand Means Forecaster |
| snaive | Seasonal Naive Forecaster (disabled when seasonal_period = 1) |
| polytrend | Polynomial Trend Forecaster |
| arima | ARIMA family of models (ARIMA, SARIMA, SARIMAX) |
| auto_arima | Auto ARIMA |
| exp_smooth | Exponential Smoothing |
| stlf | STL Forecaster |
| croston | Croston Forecaster |
| ets | ETS |
| theta | Theta Forecaster |
| tbats | TBATS |
| bats | BATS |
| prophet | Prophet Forecaster |
| lr_cds_dt | Linear w/ Cond. Deseasonalize & Detrending |
| en_cds_dt | Elastic Net w/ Cond. Deseasonalize & Detrending |
| ridge_cds_dt | Ridge w/ Cond. Deseasonalize & Detrending |
| lasso_cds_dt | Lasso w/ Cond. Deseasonalize & Detrending |
| llar_cds_dt | Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending |
| br_cds_dt | Bayesian Ridge w/ Cond. Deseasonalize & Deseasonalize & Detrending |
| huber_cds_dt | Huber w/ Cond. Deseasonalize & Detrending |
| omp_cds_dt | Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending |
| knn_cds_dt | K Neighbors w/ Cond. Deseasonalize & Detrending |
| dt_cds_dt | Decision Tree w/ Cond. Deseasonalize & Detrending |
| rf_cds_dt | Random Forest w/ Cond. Deseasonalize & Detrending |
| et_cds_dt | Extra Trees w/ Cond. Deseasonalize & Detrending |
| gbr_cds_dt | Gradient Boosting w/ Cond. Deseasonalize & Detrending |
| ada_cds_dt | AdaBoost w/ Cond. Deseasonalize & Detrending |
| lightgbm_cds_dt | Light Gradient Boosting w/ Cond. Deseasonalize & Detrending |
| catboost_cds_dt | CatBoost w/ Cond. Deseasonalize & Detrending |
聚類模型列表Clustering
| 聚類的模型名稱縮寫 | 模型的全稱 |
|---|---|
| kmeans | K-Means Clustering |
| ap | Affinity Propagation |
| meanshift | Mean shift Clustering |
| sc | Spectral Clustering |
| hclust | Agglomerative Clustering |
| dbscan | Density-Based Spatial Clustering |
| optics | OPTICS Clustering |
| birch | Birch Clustering |
| kmodes | K-Modes Clustering |
異常檢測Anomaly Detection
| 異常檢測的模型縮寫 | 異常檢測的模型全稱 |
|---|---|
| abod | Angle-base Outlier Detection |
| cluster | Clustering-Based Local Outlier |
| cof | Connectivity-Based Outlier Factor |
| histogram | Histogram-based Outlier Detection |
| iforest | Isolation Forest |
| knn | k-Nearest Neighbors Detector |
| lof | Local Outlier Factor |
| svm | One-class SVM detector |
| pca | Principal Component Analysis |
| mcd | Minimum Covariance Determinant |
| sod | Subspace Outlier Detection |
| sos | Stochastic Outlier Selection |
