决策树算法详解：从经典ID3到现代梯度提升树

基础概念与原理

什么是决策树？

决策树是一种分类和回归的非参数学习算法。通过一系列问题将样本逐步分割，最终得到叶子节点的预测结果。决策树的核心思想是递归分割，在每个节点选择最优特征进行分割，使得分割后的子集更加”纯净”。

决策树的结构：

                [Root: 年龄]
                /          \
          ≤30岁            >30岁
          /                    \
    [是否学生？]          [收入>5w？]
    /        \                /      \
  是          否            是        否
  |           |             |         |
 批准        [信用评分]     批准    [贷款历史]
             /      \               /        \
           ≥700    <700           好        差
            |        |             |         |
           批准      拒绝          批准      拒绝

决策树的优点

✅ 直观易懂：树形结构易于可视化和理解
✅ 无需数据归一化：对特征尺度无要求
✅ 自动特征选择：自动处理特征重要性
✅ 处理非线性关系：能捕捉复杂的特征交互
✅ 快速预测：时间复杂度为 O(log n)

决策树的缺点

❌ 过拟合风险：容易在训练数据上过度拟合
❌ 不稳定性：小数据变化可能导致完全不同的树结构
❌ 贪心算法：局部最优不保证全局最优
❌ 数据不平衡敏感：对类别不平衡数据较敏感

信息论基础

1. 信息熵（Entropy）

信息熵用于度量样本集合的不确定性。熵越大，不确定性越大；熵越小，样本越纯净。

数学定义：

对于样本集合 $D$，包含 $K$ 个类别，第 $k$ 个类别的样本比例为 $p_k$，信息熵为：

\[H(D) = -\sum_{k=1}^{K} p_k \log_2 p_k\]

示例计算：

假设一个包含10个样本的数据集：

类别A：7个 ($p_A = 0.7$)
类别B：3个 ($p_B = 0.3$)

$H(D) = -0.7 \times \log_2(0.7) - 0.3 \times \log_2(0.3)$ $= -0.7 \times (-0.515) - 0.3 \times (-1.737)$ $= 0.361 + 0.521 = 0.882 \text{ bits}$

2. 信息增益（Information Gain）

信息增益是分割前后信息熵的差值，用于选择最优分割特征。

数学定义：

按特征 $A$ 分割后，样本集合 $D$ 被分为 $n$ 个子集 $D_1, D_2, \ldots, D_n$，信息增益为：

\[Gain(D, A) = H(D) - \sum_{i=1}^{n} \frac{|D_i|}{|D|} H(D_i)\]

其中：

$H(D)$：分割前的信息熵
$H(D_i)$：第 $i$ 个子集的信息熵
$ D_i / D $：子集的权重

案例演示：

考虑决策一个贷款申请：

原始数据集：

总样本数：10
批准：7个
拒绝：3个
$H(D) = 0.882$ bits

按”年龄>30”分割：

特征值	样本数	批准	拒绝	信息熵
≤30岁	4	1	3	0.811
>30岁	6	6	0	0.000

$Gain_{年龄} = 0.882 - \frac{4}{10} \times 0.811 - \frac{6}{10} \times 0.000$ $= 0.882 - 0.324 = 0.558 \text{ bits}$

按”信用评分>700”分割：

特征值	样本数	批准	拒绝	信息熵
≤700	3	1	2	0.918
>700	7	6	1	0.592

$Gain_{信用评分} = 0.882 - \frac{3}{10} \times 0.918 - \frac{7}{10} \times 0.592$ $= 0.882 - 0.275 - 0.414 = 0.193 \text{ bits}$

结论：由于 $Gain_{年龄} > Gain_{信用评分}$，选择”年龄”作为根节点的分割特征。

3. 信息增益率（Gain Ratio）

为了避免偏向多值特征，引入增益率的概念。特征分裂信息定义为：

\[SplitInfo(D, A) = -\sum_{i=1}^{n} \frac{|D_i|}{|D|} \log_2\left(\frac{|D_i|}{|D|}\right)\]

增益率公式：

\[GainRatio(D, A) = \frac{Gain(D, A)}{SplitInfo(D, A)}\]

4. 基尼指数（Gini Index）

基尼指数也是度量数据纯度的指标，特别是在CART算法中使用。

数学定义：

\[Gini(D) = 1 - \sum_{k=1}^{K} p_k^2\]

基尼增益：

\[\Delta Gini(D, A) = Gini(D) - \sum_{i=1}^{n} \frac{|D_i|}{|D|} Gini(D_i)\]

对比与示例：

对于前面的例子（7个正例，3个负例）：

\[Gini(D) = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 0.42\]

当按”年龄>30”分割：

子集1（≤30）：1个正，3个负 → $Gini_1 = 1 - (0.25^2 + 0.75^2) = 0.375$
子集2（>30）：6个正，0个负 → $Gini_2 = 0$

\[\Delta Gini = 0.42 - \frac{4}{10} \times 0.375 - \frac{6}{10} \times 0 = 0.42 - 0.15 = 0.27\]

经典决策树算法

1. ID3算法（Iterative Dichotomiser 3）

发表时间：1986年，由Ross Quinlan提出

核心特征：

使用信息增益选择分割特征
只能处理离散特征
倾向于选择多值特征（偏好特征值较多的特征）
无法处理缺失值

算法流程：

INPUT: 样本集 D，特征集 A，阈值 ε
OUTPUT: 决策树 T

如果 D 中所有样本属于同一类别 C，则 T = 叶子节点(C)；返回
如果 A 为空，则 T = 叶子节点(D中数量最多的类别)；返回
如果 D 中所有样本在 A 上取值相同，则 T = 叶子节点(D中数量最多的类别)；返回
计算 D 的信息熵 H(D)
对于 A 中的每个特征 a，计算其信息增益 Gain(D, a)
选择增益最大的特征 a* 作为分割特征
如果 Gain(D, a*) < ε，则 T = 叶子节点(D中数量最多的类别)；返回
否则，对 a* 的每个取值 v，以 D_v = {x ∈ D | a*(x) = v} 为新样本集递归构建子树
返回 T

ID3的局限性：

多值偏好：ID3倾向于选择取值较多的特征，即使这些特征的预测能力较弱
过拟合严重：无剪枝机制，容易过拟合
离散特征限制：无法直接处理连续特征

2. C4.5算法

发表时间：1993年，由Ross Quinlan改进ID3

核心改进：

方面	ID3	C4.5
特征选择准则	信息增益	增益率
特征类型	仅离散	离散 + 连续
缺失值处理	不支持	支持
剪枝	无	有（错误率剪枝）
过拟合控制	差	较好

C4.5的关键改进：

1. 增益率（Gain Ratio）

为解决ID3偏好多值特征的问题，C4.5使用增益率而非信息增益：

\[GainRatio(D, A) = \frac{Gain(D, A)}{SplitInfo(D, A)}\]

这样对于多值特征的信息增益会被分割信息的增大所抵消。

2. 连续特征处理

对于连续特征，C4.5通过二分查找最优分割点：

给定特征 $a$ 和样本集 $D$，按 $a$ 的值递增排序。在相邻两个不同值的中点处尝试分割：

\[threshold = \frac{a_i + a_{i+1}}{2}\]

计算每个阈值的增益率，选择增益率最大的阈值。

3. 剪枝策略

C4.5使用悲观错误率剪枝（Pessimistic Error Rate Pruning）：

对于一个节点，如果剪枝后的错误率较小，则进行剪枝。错误率估计为：

\[error(T) = errors(T) + \frac{k}{2}\]

其中 $k$ 是节点的样本数。

3. CART算法（Classification And Regression Tree）

发表时间：1984年，由Breiman等人提出

核心特征：

使用基尼指数选择分割特征
二叉树结构（每个节点最多分两个分支）
能处理回归问题
支持剪枝

CART vs C4.5：

特征	CART	C4.5
树结构	二叉树	多叉树
分裂准则	基尼指数	增益率
问题类型	分类 + 回归	主要分类
剪枝	代价复杂度剪枝	悲观错误率剪枝
实现复杂度	中等	较高

CART的分裂准则：

\[\Delta Gini(D, A) = Gini(D) - \sum_{i=1}^{2} \frac{|D_i|}{|D|} Gini(D_i)\]

CART的剪枝：代价复杂度剪枝

定义树的代价复杂度：

\[C_{\alpha}(T) = L(T) + \alpha |T|\]

其中：

$L(T)$：树上的误分类数
$ T $：树的叶子数
$\alpha$：复杂度参数

通过递减的 $\alpha$ 序列，得到一个由简到繁的树序列。

CART回归树：

对于回归问题，CART使用平方误差：

\[L(T) = \sum_{t=1}^{|T|} \sum_{x \in D_t} (y_x - \hat{y}_t)^2\]

其中 $\hat{y}_t$ 是第 $t$ 个叶子的预测值。

现代提升算法

1. Bagging与随机森林

随机森林不是单一决策树，而是多棵树的集成方法：

核心思想：

通过自助采样（Bootstrap Sampling）产生多个训练集
对每个训练集独立训练一棵决策树
通过投票（分类）或平均（回归）聚合预测结果

随机森林的数学表达：

对于分类问题，预测结果为：

\[\hat{y} = \text{majority vote}\{T_1(x), T_2(x), \ldots, T_B(x)\}\]

对于回归问题：

\[\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\]

优势：

减少方差
提高泛化能力
提供特征重要性评估

2. Boosting：AdaBoost

概念：按顺序学习弱学习器，每次调整样本权重以关注被错分的样本。

AdaBoost的更新规则：

初始权重：$w_i^{(1)} = \frac{1}{N}$

第 $m$ 轮迭代：

用权重 $w_i^{(m)}$ 训练基学习器 $h_m(x)$
计算加权误差率： $\epsilon_m = \sum_{i=1}^{N} w_i^{(m)} \mathbb{1}[h_m(x_i) \neq y_i]$
计算学习器权重： $\alpha_m = \frac{1}{2} \ln\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)$
更新样本权重： $w_i^{(m+1)} = \frac{w_i^{(m)} \exp(-\alpha_m y_i h_m(x_i))}{Z_m}$

其中 $Z_m$ 是归一化常数。

最终预测：

\[H(x) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m h_m(x)\right)\]

3. 梯度提升决策树（GBDT）

概念：通过拟合残差逐步改进模型。

GBDT的核心思想：

\[F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\]

其中：

$F_m(x)$：第 $m$ 个模型的预测
$h_m(x)$：新加入的弱学习器（决策树）
$\eta$：学习率（收缩参数）

GBDT的算法流程：

INPUT: 训练集 {(x_i, y_i)}_{i=1}^N，迭代次数 M，学习率 η
OUTPUT: 模型 F(x)

1. 初始化 F_0(x) = argmin_γ Σ L(y_i, γ)
   （通常为常数，如分类问题中的0.5）

2. FOR m = 1 TO M:
   a. 计算伪残差：
      r_im = -∂L(y_i, F_{m-1}(x_i))/∂F_{m-1}(x_i)
   
   b. 拟合一棵回归树 h_m 来预测 {r_im}
   
   c. 计算最优步长：
      γ_m = argmin_γ Σ L(y_i, F_{m-1}(x_i) + γh_m(x_i))
   
   d. 更新模型：
      F_m(x) = F_{m-1}(x) + η·γ_m·h_m(x)

3. 返回 F_M(x)

损失函数：

对于不同的问题，使用不同的损失函数：

平方误差（回归）：$L(y, \hat{y}) = (y - \hat{y})^2$
绝对误差（鲁棒回归）：$L(y, \hat{y}) = y - \hat{y} $
对数损失（二分类）：$L(y, \hat{y}) = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$

4. XGBoost（极端梯度提升）

发表时间：2016年，由陈天奇开发

XGBoost相比GBDT的改进：

正则化项

XGBoost加入了显式的正则化项：

\[L = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{m=1}^{M} \Omega(h_m)\]

其中：

\[\Omega(h) = \gamma |T| + \frac{1}{2}\lambda \sum_{j=1}^{|T|} w_j^2\]

$ T $：叶子数
$w_j$：叶子权重
$\gamma$：叶子复杂度惩罚
$\lambda$：权重 L2 正则化系数

二阶泰勒展开

对损失函数进行二阶泰勒展开：

\[L \approx \sum_{i=1}^{n} [L(y_i, \hat{y}_i^{(m-1)}) + g_i h_m(x_i) + \frac{1}{2}h_i h_m(x_i)^2] + \Omega(h_m)\]

其中：

$g_i = \partial L(y_i, \hat{y}_i^{(m-1)})/\partial \hat{y}_i^{(m-1)}$（一阶导数）
$h_i = \partial^2 L(y_i, \hat{y}_i^{(m-1)})/\partial (\hat{y}_i^{(m-1)})^2$（二阶导数）

信息增益最大化

对于二分类分裂，XGBoost最大化的目标函数为：

\[Gain = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma\]

其中：

$G_L, G_R$：左右子树的一阶导数和
$H_L, H_R$：左右子树的二阶导数和

处理缺失值

XGBoost通过学习默认方向处理缺失值：

对每个特征，学习当值缺失时样本应该进入左子树还是右子树。

列采样

训练每棵树时，按列（特征）进行随机采样，减少过拟合。

5. LightGBM（Light Gradient Boosting Machine）

发表时间：2016年底，由微软开发

LightGBM的创新：

叶子生长策略

GBDT：按层生长（Level-wise）
LightGBM：按叶生长（Leaf-wise）

Leaf-wise策略选择最大损失减少的叶子进行分裂：

\[Loss\_reduction = max\_leaf(Gain_{leaf})\]

这通常能用更少的树达到同样的精度。

直方图学习

不是对每个特征值进行分裂，而是先将特征值分组到直方图中：

\[h_k = [min(x), min(x) + \frac{max(x) - min(x)}{bins}, \ldots, max(x)]\]

优势：

内存占用减少 95%
训练速度提升 10x

类别特征原生支持

直接处理分类特征，无需独热编码：

对于分类特征 $a$，枚举所有可能的子集作为分裂点。

单边梯度采样（GOSS）

只保留梯度较大的样本进行训练（梯度大表示误差大）：

\[top\_p \% \text{ 梯度较大的样本} + random(1-p)\% \text{ 梯度较小的样本}\]

减少计算量同时保持精度。

6. CatBoost（Categorical Boosting）

发表时间：2017年，由Yandex开发

CatBoost的特色：

有序编码（Ordered Encoding）

对分类特征进行有序编码，防止目标泄露：

\[x_i = \frac{\sum_{j < i} [x_j = x_i] \cdot y_j + prior}{\sum_{j < i} [x_j = x_i] + 1}\]

排列树生长

使用样本的随机排列来选择分裂点，进一步防止过拟合。

对称树结构

所有分裂使用相同的特征，产生对称的树结构，提高泛化能力。

实际案例应用

案例背景：信用卡欺诈检测

问题描述：

一个银行需要实时检测信用卡交易中的欺诈行为。

数据特征：

交易金额（连续）
交易地点（离上次交易距离，连续）
交易时间（离上次交易时间，连续）
商户类型（分类）
设备特征（分类）
账户特征（年龄、月消费、分类）

数据样本：

总样本数：100,000
正常交易：99,700（99.7%）
欺诈交易：300（0.3%）

类别不平衡问题：

这是典型的极端不平衡分类问题。直接使用准确率会导致模型倾向于预测所有交易都是正常的。

需要使用：

混淆矩阵分析
精确率/召回率 权衡
ROC-AUC 指标
类别权重 调整

数学建模

目标函数（加权对数损失）：

\[L = -\frac{1}{n} \sum_{i=1}^{n} w_i [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]\]

其中：

$w_i = w_{negative}$ 当 $y_i = 0$
$w_i = w_{positive}$ 当 $y_i = 1$
通常设置：$\frac{w_{positive}}{w_{negative}} = \frac{n_{negative}}{n_{positive}}$

对于我们的数据：

\[weight\_ratio = \frac{99,700}{300} \approx 332\]

XGBoost模型构建

模型超参数选择：

max_depth = 6              # 树的最大深度
eta = 0.05                 # 学习率（收缩）
num_rounds = 500           # 迭代次数
min_child_weight = 3       # 叶子最小权重
subsample = 0.8            # 样本采样率
colsample_bytree = 0.8     # 特征采样率
scale_pos_weight = 332     # 正例权重
eval_metric = "auc"        # 评估指标

为什么选择这些参数：

max_depth = 6：平衡偏差和方差，防止过拟合
eta = 0.05：较小的学习率，需要更多迭代但更稳定
scale_pos_weight = 332：处理类别不平衡，加大对欺诈交易的权重
subsample = 0.8：行采样，增加随机性，防止过拟合
colsample_bytree = 0.8：列采样，减少计算量

特征工程

关键特征构造：

交易异常度： $anomaly\_score = \frac{|交易金额 - 账户平均消费|}{账户消费标准差}$
交易频率：
- 最近24小时交易次数
- 最近7天交易次数
地理距离： $distance = \sqrt{(lon_1 - lon_2)^2 + (lat_1 - lat_2)^2}$
时间差异：
- 与上次交易的时间差
- 交易发生的小时
行为统计：
- 该商户类型的平均消费
- 账户历史最大单笔交易
- 交易成功率

完整代码示例

1. 数据生成与预处理

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import (confusion_matrix, precision_score, recall_score, 
                             f1_score, roc_auc_score, roc_curve, auc)
import matplotlib.pyplot as plt
import seaborn as sns

# 设置随机种子
np.random.seed(42)

# 生成模拟欺诈检测数据集
n_normal = 9970
n_fraud = 30
n_total = n_normal + n_fraud

# 正常交易特征
normal_data = {
    'amount': np.random.lognormal(3.5, 1.5, n_normal),
    'distance': np.random.exponential(100, n_normal),
    'time_diff': np.random.exponential(3600, n_normal),
    'transaction_count_24h': np.random.poisson(3, n_normal),
    'merchant_type': np.random.choice(['retail', 'online', 'atm'], n_normal, p=[0.5, 0.4, 0.1]),
    'device_type': np.random.choice(['mobile', 'card', 'atm'], n_normal, p=[0.3, 0.6, 0.1])
}

# 欺诈交易特征（异常特征）
fraud_data = {
    'amount': np.random.lognormal(4.5, 1.2, n_fraud),  # 更大金额
    'distance': np.random.exponential(500, n_fraud),    # 更大距离
    'time_diff': np.random.exponential(7200, n_fraud),  # 更大时间差
    'transaction_count_24h': np.random.poisson(8, n_fraud),  # 更多交易
    'merchant_type': np.random.choice(['online', 'retail'], n_fraud, p=[0.8, 0.2]),
    'device_type': np.random.choice(['mobile', 'atm'], n_fraud, p=[0.7, 0.3])
}

# 创建数据框
df_normal = pd.DataFrame(normal_data)
df_normal['is_fraud'] = 0

df_fraud = pd.DataFrame(fraud_data)
df_fraud['is_fraud'] = 1

df = pd.concat([df_normal, df_fraud], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)

print(f"数据集大小: {len(df)}")
print(f"欺诈比例: {df['is_fraud'].mean():.2%}")
print(f"\n特征统计:\n{df.describe()}")

# 特征编码
df_encoded = df.copy()
df_encoded['merchant_type'] = pd.factorize(df_encoded['merchant_type'])[0]
df_encoded['device_type'] = pd.factorize(df_encoded['device_type'])[0]

# 分离特征和标签
X = df_encoded.drop('is_fraud', axis=1)
y = df_encoded['is_fraud']

# 分割数据集（8:2）
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

print(f"\n训练集: {len(X_train)} 样本")
print(f"测试集: {len(X_test)} 样本")
print(f"训练集欺诈比例: {y_train.mean():.2%}")
print(f"测试集欺诈比例: {y_test.mean():.2%}")

2. 决策树基础模型

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report

# 构建单一决策树
dt_model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

dt_model.fit(X_train, y_train)

# 预测
y_pred_dt = dt_model.predict(X_test)
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]

# 评估
print("=== 单一决策树模型 ===")
print("\n混淆矩阵:")
cm = confusion_matrix(y_test, y_pred_dt)
print(cm)

print("\n分类报告:")
print(classification_report(y_test, y_pred_dt, 
                           target_names=['正常', '欺诈']))

auc_dt = roc_auc_score(y_test, y_pred_proba_dt)
print(f"\nROC-AUC: {auc_dt:.4f}")

# 可视化树结构（仅显示前3层）
plt.figure(figsize=(20, 10))
plot_tree(dt_model, max_depth=3, feature_names=X.columns.tolist(),
          class_names=['正常', '欺诈'], filled=True)
plt.title("决策树结构（前3层）", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/decision_tree_structure.png', dpi=150, bbox_inches='tight')
plt.close()

# 特征重要性
feature_importance_dt = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n特征重要性:")
print(feature_importance_dt)

3. 随机森林模型

from sklearn.ensemble import RandomForestClassifier

# 构建随机森林
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

# 预测
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

# 评估
print("\n=== 随机森林模型 ===")
print("\n混淆矩阵:")
cm = confusion_matrix(y_test, y_pred_rf)
print(cm)

print("\n分类报告:")
print(classification_report(y_test, y_pred_rf, 
                           target_names=['正常', '欺诈']))

auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
print(f"\nROC-AUC: {auc_rf:.4f}")

# 特征重要性
feature_importance_rf = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n特征重要性:")
print(feature_importance_rf)

4. XGBoost模型（带参数调优）

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# 计算正例权重以处理类别不平衡
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

# XGBoost参数网格（简化版）
param_grid = {
    'max_depth': [4, 5, 6],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
}

# 基础XGBoost模型
base_xgb = xgb.XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    eval_metric='auc',
    random_state=42,
    tree_method='hist'
)

# 网格搜索（注意：这会花费较长时间）
print("\n=== XGBoost模型参数调优 ===")
print("进行网格搜索...")

grid_search = GridSearchCV(
    base_xgb,
    param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# 仅用部分参数进行演示（完整调优需要更长时间）
simplified_param_grid = {
    'max_depth': [5],
    'learning_rate': [0.05],
    'n_estimators': [200],
}

grid_search_simplified = GridSearchCV(
    base_xgb,
    simplified_param_grid,
    scoring='roc_auc',
    cv=3,
    n_jobs=-1,
    verbose=0
)

grid_search_simplified.fit(X_train, y_train)

print(f"最佳参数: {grid_search_simplified.best_params_}")
print(f"最佳CV得分: {grid_search_simplified.best_score_:.4f}")

# 使用最佳模型
xgb_best = grid_search_simplified.best_estimator_

# 预测
y_pred_xgb = xgb_best.predict(X_test)
y_pred_proba_xgb = xgb_best.predict_proba(X_test)[:, 1]

# 评估
print("\n=== XGBoost最优模型结果 ===")
print("\n混淆矩阵:")
cm = confusion_matrix(y_test, y_pred_xgb)
print(cm)
print(f"真正例 (TP): {cm[1,1]}")
print(f"假正例 (FP): {cm[0,1]}")
print(f"真负例 (TN): {cm[0,0]}")
print(f"假负例 (FN): {cm[1,0]}")

print("\n分类报告:")
print(classification_report(y_test, y_pred_xgb, 
                           target_names=['正常', '欺诈']))

auc_xgb = roc_auc_score(y_test, y_pred_proba_xgb)
print(f"\nROC-AUC: {auc_xgb:.4f}")

# 特征重要性
feature_importance_xgb = pd.DataFrame({
    'feature': X.columns,
    'importance': xgb_best.feature_importances_
}).sort_values('importance', ascending=False)

print("\n特征重要性:")
print(feature_importance_xgb)

5. 模型对比与ROC曲线

# 模型性能对比
models_comparison = pd.DataFrame({
    '模型': ['决策树', '随机森林', 'XGBoost'],
    'ROC-AUC': [auc_dt, auc_rf, auc_xgb],
    '精确率': [
        precision_score(y_test, y_pred_dt),
        precision_score(y_test, y_pred_rf),
        precision_score(y_test, y_pred_xgb)
    ],
    '召回率': [
        recall_score(y_test, y_pred_dt),
        recall_score(y_test, y_pred_rf),
        recall_score(y_test, y_pred_xgb)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_rf),
        f1_score(y_test, y_pred_xgb)
    ]
})

print("\n=== 模型性能对比 ===")
print(models_comparison.to_string(index=False))

# ROC曲线绘制
plt.figure(figsize=(12, 8))

# 决策树
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_proba_dt)
plt.plot(fpr_dt, tpr_dt, label=f'决策树 (AUC = {auc_dt:.4f})', linewidth=2)

# 随机森林
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)
plt.plot(fpr_rf, tpr_rf, label=f'随机森林 (AUC = {auc_rf:.4f})', linewidth=2)

# XGBoost
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_pred_proba_xgb)
plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {auc_xgb:.4f})', linewidth=2)

# 随机分类器
plt.plot([0, 1], [0, 1], 'k--', label='随机分类器 (AUC = 0.5000)', linewidth=1)

plt.xlabel('假正例率 (False Positive Rate)', fontsize=12)
plt.ylabel('真正例率 (True Positive Rate)', fontsize=12)
plt.title('模型ROC曲线对比', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/roc_curves_comparison.png', dpi=150, bbox_inches='tight')
plt.close()

print("\nROC曲线已保存")

# 特征重要性对比
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 决策树
feature_importance_dt.plot(x='feature', y='importance', kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('决策树 - 特征重要性', fontsize=12, fontweight='bold')
axes[0].set_xlabel('重要性')

# 随机森林
feature_importance_rf.plot(x='feature', y='importance', kind='barh', ax=axes[1], color='forestgreen')
axes[1].set_title('随机森林 - 特征重要性', fontsize=12, fontweight='bold')
axes[1].set_xlabel('重要性')

# XGBoost
feature_importance_xgb.plot(x='feature', y='importance', kind='barh', ax=axes[2], color='coral')
axes[2].set_title('XGBoost - 特征重要性', fontsize=12, fontweight='bold')
axes[2].set_xlabel('重要性')

plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/feature_importance_comparison.png', dpi=150, bbox_inches='tight')
plt.close()

print("特征重要性对比图已保存")

6. 混淆矩阵可视化

# 混淆矩阵可视化
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

models = ['决策树', '随机森林', 'XGBoost']
predictions = [y_pred_dt, y_pred_rf, y_pred_xgb]

for idx, (model_name, y_pred) in enumerate(zip(models, predictions)):
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=['正常', '欺诈'],
                yticklabels=['正常', '欺诈'])
    
    axes[idx].set_title(f'{model_name}\n混淆矩阵', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('真实标签')
    axes[idx].set_xlabel('预测标签')

plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/confusion_matrices.png', dpi=150, bbox_inches='tight')
plt.close()

print("混淆矩阵可视化已保存")

7. 实际预测示例

# 实际预测示例
print("\n=== 实际预测示例 ===")

# 创建示例交易
example_transactions = pd.DataFrame({
    'amount': [50.0, 5000.0],
    'distance': [10.0, 1000.0],
    'time_diff': [3600, 50000],
    'transaction_count_24h': [2, 15],
    'merchant_type': [0, 1],  # retail vs online
    'device_type': [0, 1]     # mobile vs card
})

example_transactions_encoded = example_transactions.copy()

# 预测
proba_dt = dt_model.predict_proba(example_transactions_encoded)[0]
proba_rf = rf_model.predict_proba(example_transactions_encoded)[0]
proba_xgb = xgb_best.predict_proba(example_transactions_encoded)[0]

results = pd.DataFrame({
    '模型': ['决策树', '随机森林', 'XGBoost'],
    '正常概率': [proba_dt[0], proba_rf[0], proba_xgb[0]],
    '欺诈概率': [proba_dt[1], proba_rf[1], proba_xgb[1]],
    '预测': ['正常' if p[1] < 0.5 else '欺诈' for p in [proba_dt, proba_rf, proba_xgb]]
})

print("\n交易 1: 金额$50, 距离10km, 最近24h交易2次 (正常交易)")
print(results)

# 创建可疑交易
example_transactions_fraud = pd.DataFrame({
    'amount': [5000.0],
    'distance': [5000.0],
    'time_diff': [100000],
    'transaction_count_24h': [20],
    'merchant_type': [1],
    'device_type': [1]
})

example_transactions_fraud_encoded = example_transactions_fraud.copy()

proba_dt_fraud = dt_model.predict_proba(example_transactions_fraud_encoded)[0]
proba_rf_fraud = rf_model.predict_proba(example_transactions_fraud_encoded)[0]
proba_xgb_fraud = xgb_best.predict_proba(example_transactions_fraud_encoded)[0]

results_fraud = pd.DataFrame({
    '模型': ['决策树', '随机森林', 'XGBoost'],
    '正常概率': [proba_dt_fraud[0], proba_rf_fraud[0], proba_xgb_fraud[0]],
    '欺诈概率': [proba_dt_fraud[1], proba_rf_fraud[1], proba_xgb_fraud[1]],
    '预测': ['正常' if p[1] < 0.5 else '欺诈' for p in [proba_dt_fraud, proba_rf_fraud, proba_xgb_fraud]]
})

print("\n交易 2: 金额$5000, 距离5000km, 最近24h交易20次 (可疑交易)")
print(results_fraud)

8. 模型解释性分析

import shap

# 计算SHAP值用于XGBoost模型解释
print("\n=== 模型解释性分析（SHAP）===")

# 创建SHAP Explainer
explainer = shap.TreeExplainer(xgb_best)
shap_values = explainer.shap_values(X_test)

# 绘制SHAP汇总图
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
plt.title('SHAP特征重要性', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/shap_summary.png', dpi=150, bbox_inches='tight')
plt.close()

print("SHAP汇总图已保存")

# 绘制SHAP依赖图（最重要特征）
plt.figure(figsize=(12, 8))
shap.dependence_plot(0, shap_values, X_test, feature_names=X.columns.tolist(), show=False)
plt.title('SHAP依赖图 - 最重要特征', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/shap_dependence.png', dpi=150, bbox_inches='tight')
plt.close()

print("SHAP依赖图已保存")

# 选择一个欺诈案例进行详细解释
fraud_indices = y_test[y_test == 1].index
if len(fraud_indices) > 0:
    fraud_idx = fraud_indices[0]
    fraud_idx_in_test = list(X_test.index).index(fraud_idx)
    
    plt.figure(figsize=(12, 8))
    shap.force_plot(explainer.expected_value, 
                    shap_values[fraud_idx_in_test], 
                    X_test.iloc[fraud_idx_in_test],
                    feature_names=X.columns.tolist(),
                    matplotlib=True,
                    show=False)
    plt.title('单个欺诈交易的SHAP力图', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('/Users/tov/Library/Mobile Documents/com~apple~CloudDocs/personal website/DoraemonJack.github.io/_posts/python_scripts/shap_force.png', dpi=150, bbox_inches='tight')
    plt.close()
    
    print("SHAP力图已保存")

print("\n分析完成！")

总结与展望

决策树的发展历程总结

阶段	主要算法	时间	特点
第一阶段	ID3	1986	信息增益，仅离散特征
第二阶段	C4.5	1993	增益率，连续特征，剪枝
第三阶段	CART	1984	基尼指数，二叉树，通用性强
集成阶段	Bagging/随机森林	2001	降低方差
提升阶段	Boosting/AdaBoost/GBDT	2000s	降低偏差
现代阶段	XGBoost/LightGBM/CatBoost	2016+	高效，可扩展，工业级

关键数学概念回顾

熵和信息增益：衡量特征分裂纯度的关键指标
基尼指数：CART和后续算法的基础
泰勒展开：XGBoost利用二阶导数优化的数学基础
正则化：防止过拟合的重要手段
目标函数：损失函数 + 正则化项的平衡

实际应用建议

数据预处理：
- 处理缺失值
- 处理类别不平衡
- 特征工程和特征选择
模型选择：
- 数据量小：使用CART或C4.5
- 数据量中等：使用随机森林
- 数据量大/对精度要求高：使用XGBoost/LightGBM
超参数调优：
- 使用交叉验证
- 网格搜索或随机搜索
- 关注过拟合和欠拟合
模型评估：
- 分类问题：精确率、召回率、F1、ROC-AUC
- 回归问题：MAE、RMSE、R²
- 使用SHAP进行模型解释
生产部署：
- 监控模型性能漂移
- 定期重新训练
- 实现模型版本控制

参考文献

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
Quinlan, J. R. (1993). C4. 5: Programs for machine learning. Morgan Kaufmann.
Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Chapman and Hall/CRC.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference.
Ke, G., et al. (2017). LightGBM: A fast, distributed, gradient boosting framework. Advances in Neural Information Processing Systems, 30.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11372.
Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28), 307-317.
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765-4774).

决策树算法详解：从经典ID3到现代梯度提升树

目录

基础概念与原理

什么是决策树？

决策树的优点

决策树的缺点

信息论基础

1. 信息熵（Entropy）

2. 信息增益（Information Gain）

3. 信息增益率（Gain Ratio）

4. 基尼指数（Gini Index）

经典决策树算法

1. ID3算法（Iterative Dichotomiser 3）

2. C4.5算法

3. CART算法（Classification And Regression Tree）

现代提升算法

1. Bagging与随机森林

2. Boosting：AdaBoost

3. 梯度提升决策树（GBDT）

4. XGBoost（极端梯度提升）

正则化项

二阶泰勒展开

信息增益最大化

处理缺失值

列采样

5. LightGBM（Light Gradient Boosting Machine）

叶子生长策略

直方图学习

类别特征原生支持

单边梯度采样（GOSS）

6. CatBoost（Categorical Boosting）

有序编码（Ordered Encoding）

排列树生长

对称树结构

实际案例应用

案例背景：信用卡欺诈检测

数学建模

XGBoost模型构建

特征工程

完整代码示例

1. 数据生成与预处理

2. 决策树基础模型

3. 随机森林模型

4. XGBoost模型（带参数调优）

5. 模型对比与ROC曲线

6. 混淆矩阵可视化

7. 实际预测示例

8. 模型解释性分析

总结与展望

决策树的发展历程总结

关键数学概念回顾

实际应用建议

参考文献

CATALOG

FEATURED TAGS

FRIENDS