从零构建高效数据采集系统:架构设计、反爬策略与工程实践
2026/5/17 4:01:44
概率论是深度学习的核心数学工具:
概率论核心概念 随机变量: 数据不确定性 概率分布: 数据分布建模 贝叶斯定理: 推理与学习 最大似然估计: 参数学习| 分布 | 用途 | 参数 |
|---|---|---|
| 高斯分布 | 连续数据建模 | 均值、方差 |
| 伯努利分布 | 二分类 | 概率p |
| 多项式分布 | 多分类 | 概率向量 |
| 泊松分布 | 计数数据 | 率参数 |
信息论概念 熵: 不确定性度量 交叉熵: 分布差异 KL散度: 相对熵 互信息: 变量相关性import numpy as np from scipy import stats class ProbabilityDistributions: @staticmethod def gaussian_pdf(x, mu=0, sigma=1): return (1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-0.5 * ((x - mu) / sigma) ** 2) @staticmethod def bernoulli_pmf(k, p): return p ** k * (1 - p) ** (1 - k) @staticmethod def multinomial_pmf(x, p): n = np.sum(x) numerator = np.math.factorial(n) denominator = np.prod([np.math.factorial(i) for i in x]) return (numerator / denominator) * np.prod(p ** x) @staticmethod def poisson_pmf(k, lam): return (lam ** k * np.exp(-lam)) / np.math.factorial(k) class DistributionFitting: @staticmethod def fit_gaussian(data): mu = np.mean(data) sigma = np.std(data) return mu, sigma @staticmethod def fit_bernoulli(data): p = np.mean(data) return p @staticmethod def fit_multinomial(data): counts = np.bincount(data) p = counts / np.sum(counts) return pclass BayesianInference: @staticmethod def bayes_theorem(prior, likelihood, evidence): posterior = (likelihood * prior) / evidence return posterior @staticmethod def update_prior(prior, likelihood, data): evidence = np.sum(likelihood * prior) posterior = (likelihood * prior) / evidence return posterior class BayesianClassifier: def __init__(self): self.class_priors = {} self.feature_distributions = {} def train(self, X, y): classes = np.unique(y) for c in classes: class_data = X[y == c] self.class_priors[c] = len(class_data) / len(y) self.feature_distributions[c] = { 'mean': np.mean(class_data, axis=0), 'std': np.std(class_data, axis=0) } def predict(self, x): posteriors = {} for c in self.class_priors: mean = self.feature_distributions[c]['mean'] std = self.feature_distributions[c]['std'] likelihood = np.prod(stats.norm.pdf(x, mean, std)) posteriors[c] = self.class_priors[c] * likelihood return max(posteriors, key=posteriors.get)class InformationTheory: @staticmethod def entropy(p): p = np.clip(p, 1e-10, 1 - 1e-10) return -np.sum(p * np.log2(p)) @staticmethod def cross_entropy(p, q): p = np.clip(p, 1e-10, 1) q = np.clip(q, 1e-10, 1) return -np.sum(p * np.log(q)) @staticmethod def kl_divergence(p, q): p = np.clip(p, 1e-10, 1) q = np.clip(q, 1e-10, 1) return np.sum(p * np.log(p / q)) @staticmethod def mutual_information(x, y): p_x = np.bincount(x) / len(x) p_y = np.bincount(y) / len(y) joint_counts = np.histogram2d(x, y, bins=(len(np.unique(x)), len(np.unique(y))))[0] p_xy = joint_counts / len(x) mi = 0 for i in range(len(p_x)): for j in range(len(p_y)): if p_xy[i, j] > 0: mi += p_xy[i, j] * np.log(p_xy[i, j] / (p_x[i] * p_y[j])) return mi class MaximumLikelihoodEstimation: @staticmethod def estimate_gaussian(data): mu = np.mean(data) sigma = np.std(data) return mu, sigma @staticmethod def estimate_bernoulli(data): p = np.mean(data) return p @staticmethod def log_likelihood_gaussian(data, mu, sigma): return np.sum(stats.norm.logpdf(data, mu, sigma))| 方法 | 速度 | 精度 | 适用数据量 |
|---|---|---|---|
| 矩估计 | 快 | 中 | 小 |
| 最大似然 | 中 | 高 | 中 |
| MCMC | 慢 | 很高 | 大 |
| 操作 | Python实现 | NumPy实现 | 加速比 |
|---|---|---|---|
| 高斯PDF(10万点) | 100ms | 5ms | 20x |
| 熵计算(1000维) | 50ms | 1ms | 50x |
| KL散度(1000维) | 60ms | 2ms | 30x |
| 分类器 | 准确率 | 训练速度 | 推理速度 |
|---|---|---|---|
| 朴素贝叶斯 | 85% | 快 | 快 |
| Logistic回归 | 90% | 中 | 快 |
| SVM | 92% | 慢 | 中 |
def choose_distribution(data): if len(np.unique(data)) == 2: return 'bernoulli' elif len(np.unique(data)) < 10: return 'multinomial' else: return 'gaussian' class ProbabilityModelSelector: @staticmethod def select(data_type, data): models = { 'binary': lambda: BernoulliDistribution(), 'categorical': lambda: MultinomialDistribution(), 'continuous': lambda: GaussianDistribution() } return models[data_type]()class InformationTheoryApplications: @staticmethod def feature_selection(X, y, threshold=0.1): mi_scores = [] for i in range(X.shape[1]): mi = InformationTheory.mutual_information(X[:, i], y) mi_scores.append((i, mi)) selected = [i for i, mi in mi_scores if mi > threshold] return selected @staticmethod def model_selection(models, X, y): best_model = None best_score = float('inf') for model in models: predictions = model.predict(X) ce = InformationTheory.cross_entropy(y, predictions) if ce < best_score: best_score = ce best_model = model return best_model概率论和信息论是深度学习的数学基础:
对比数据如下: