当前位置：首页 > news >正文

【数据处理】Python：实现求条件分布函数 | 求平均值方差和协方差 | 求函数函数期望值的函数 | 概率论

news 来源：原创 2024/5/20 15:31:28

猛戳订阅！ 👉 《一起玩蛇》🐍

💭 写在前面：本章我们将通过 Python 手动实现条件分布函数的计算，实现求平均值，方差和协方差函数，实现求函数期望值的函数。部署的测试代码放到文后了，运行所需环境 python version >= 3.6，numpy >= 1.15，nltk >= 3.4，tqdm >= 4.24.0，scikit-learn >= 0.22。

🔗 相关链接：【概率论】Python：实现求联合分布函数 | 求边缘分布函数

📜 本章目录：

0x00 实现求条件分布的函数（Conditional distribution）

0x01 实现求平均值, 方差和协方差的函数（Mean, Variance, Covariance）

0x02 实现求函数期望值的函数（Expected Value of a Function）

0x04 提供测试用例

0x00 实现求条件分布的函数（Conditional distribution）

实现 conditional_distribution_of_word_counts 函数，接收 Point 和 Pmarginal 并求出结果。

请完成下面的代码，计算条件分布函数 (Joint distribution)，将结果存放到 Pcond 中并返回：

def conditional_distribution_of_word_counts(Pjoint, Pmarginal):"""Parameters:Pjoint (numpy array) - Pjoint[m,n] = P(X0=m,X1=n), whereX0 is the number of times that word0 occurs in a given text,X1 is the number of times that word1 occurs in the same text.Pmarginal (numpy array) - Pmarginal[m] = P(X0=m)Outputs:Pcond (numpy array) - Pcond[m,n] = P(X1=n|X0=m)"""raise RuntimeError("You need to write this part!")return Pcond

🚩 输出结果演示：

Problem3. Conditional distribution:
[[0.97177419 0.02419355 0.00201613 0.        0.00201613][1.         0.         0.         0.        0.        ][       nan        nan        nan       nan        nan][       nan        nan        nan       nan        nan][1.         0.         0.         0.        0.        ]]

💭 提示：条件分布 (Conditional distribution) 公式如下：

$\color{}P=(X_1=x_1|X_0=x_0)=\frac{P(X_0=X_0,X_1=x_1)}{P(X_0=x_0)}$

💬 代码演示：conditional_distribution_of_word_counts 的实现

def conditional_distribution_of_word_counts(Pjoint, Pmarginal):Pcond = Pjoint / Pmarginal[:, np.newaxis]  # 根据公式即可算出条件分布return Pcond

值得注意的是，如果分母 Pmarginal 中的某些元素为零可能会导致报错问题。这导致除法结果中出现了 NaN（Not a Number）。在计算条件概率分布时，如果边缘分布中某个值为零，那么条件概率无法得到合理的定义。为了解决这个问题，我们可以在计算 Pmarginal 时，将所有零元素替换为一个非零的很小的数，例如 1e-10。

0x01 实现求平均值, 方差和协方差的函数（Mean, Variance, Covariance）

使用英文文章中最常出现的 a, the 等单词求出其联合分布 (Pathe) 和边缘分布 (Pthe)。

Pathe 和 Pthe 在 reader.py 中已经定义好了，不需要我们去实现，具体代码文末可以查阅。

这里需要我们使用概率分布，编写求平均值、方差和协方差的函数：

函数 mean_from_distribution 和 variance_from_distribution 输入概率分布 $\color{}P(Pthe)$ 中计算概率变量 $\color{}X$ 的平均和方差并返回。平均值和方差保留小数点前三位即可。
函数 convariance_from_distribution 计算概率分布 $\color{}P(Pathe)$ 中的概率变量 $\color{}X_0$ 和概率变量 $\color{}X_1$ 的协方差并返回，同样保留小数点前三位即可。


def mean_from_distribution(P):"""Parameters:P (numpy array) - P[n] = P(X=n)Outputs:mu (float) - the mean of X"""raise RuntimeError("You need to write this part!")return mudef variance_from_distribution(P):"""Parameters:P (numpy array) - P[n] = P(X=n)Outputs:var (float) - the variance of X"""raise RuntimeError("You need to write this part!")return vardef covariance_from_distribution(P):"""Parameters:P (numpy array) - P[m,n] = P(X0=m,X1=n)Outputs:covar (float) - the covariance of X0 and X1"""raise RuntimeError("You need to write this part!")return covar

🚩 输出结果演示：

Problem4-1. Mean from distribution:
4.432
Problem4-2. Variance from distribution:
41.601
Problem4-3. Convariance from distribution:
9.235

💭 提示：求平均值、方差和协方差的公式如下

$\color{}\mu =\sum_{x}^{}x\cdot P(X=x)$

$\color{}\sigma =\sum_{x }^{}(x-\mu )^2\cdot P(X=x)$

$\color{}\, Cov(X_0,X_1)=\sum_{x_0,x_1}^{}(x_0-\mu x_0)(x_1-\mu x_1)\cdot P(X_0=x_0,X_1=x_1)$

💬 代码演示：

def mean_from_distribution(P):mu = np.sum(    # Σnp.arange(len(P)) * P)return round(mu, 3)  # 保留三位小数def variance_from_distribution(P):mu = mean_from_distribution(P)var = np.sum(    # Σ(np.arange(len(P)) - mu) ** 2 * P)return round(var, 3)   # 保留三位小数def covariance_from_distribution(P):m, n = P.shapemu_X0 = mean_from_distribution(np.sum(P, axis=1))mu_X1 = mean_from_distribution(np.sum(P, axis=0))covar = np.sum(   # Σ(np.arange(m)[:, np.newaxis] - mu_X0) * (np.arange(n) - mu_X1) * P)return round(covar, 3)

0x02 实现求函数期望值的函数（Expected Value of a Function）

实现 expectation_of_a_function 函数，计算概率函数 $\color{}X_0,X_1$ 的 $\color{}E[f(X_0,X_1)]$ 。

其中 $\color{}P$ 为联合分布， $\color{}f$ 为两个实数的输入，以 $\color{}f(x_0,x_1)$ 的形式输出。

函数 $\color{}f$ 已在 reader.py 中定义，你只需要计算 $\color{}E[f(X_0,X_1)]$ 的值并保留后三位小数返回即可。

def expectation_of_a_function(P, f):"""Parameters:P (numpy array) - joint distribution, P[m,n] = P(X0=m,X1=n)f (function) - f should be a function that takes tworeal-valued inputs, x0 and x1.  The output, z=f(x0,x1),must be a real number for all values of (x0,x1)such that P(X0=x0,X1=x1) is nonzero.Output:expected (float) - the expected value, E[f(X0,X1)]"""raise RuntimeError("You need to write this part!")return expected

🚩 输出结果演示：

Problem5. Expectation of a funciton:
1.772

💬 代码演示：expectation_of_a_function 函数的实现

def expectation_of_a_function(P, f):"""Parameters:P (numpy array) - joint distribution, P[m,n] = P(X0=m,X1=n)f (function) - f should be a function that takes tworeal-valued inputs, x0 and x1.  The output, z=f(x0,x1),must be a real number for all values of (x0,x1)such that P(X0=x0,X1=x1) is nonzero.Output:expected (float) - the expected value, E[f(X0,X1)]"""m, n = P.shapeE = 0.0for x0 in range(m):for x1 in range(n):E += f(x0, x1) * P[x0, x1]return round(E, 3)   # 保留三位小数

0x04 提供测试用例

这是一个处理文本数据的项目，测试用例为 500 封电子邮件的数据（txt 的格式文件）：

🔨 所需环境：

- python version >= 3.6
- numpy >= 1.15
- nltk >= 3.4
- tqdm >= 4.24.0
- scikit-learn >= 0.22

nltk 是 Natural Language Toolkit 的缩写，是一个用于处理人类语言数据（文本）的 Python 库。nltk 提供了许多工具和资源，用于文本处理和 NLP，PorterStemmer 用来提取词干，用于将单词转换为它们的基本形式，通常是去除单词的词缀。 RegexpTokenizer 是基于正则表达式的分词器，用于将文本分割成单词。

💬 data_load.py：用于加载文本数据

import os
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from tqdm import tqdmporter_stemmer = PorterStemmer()
tokenizer = RegexpTokenizer(r"\w+")
bad_words = {"aed", "oed", "eed"}  # these words fail in nltk stemmer algorithmdef loadFile(filename, stemming, lower_case):"""Load a file, and returns a list of words.Parameters:filename (str): the directory containing the datastemming (bool): if True, use NLTK's stemmer to remove suffixeslower_case (bool): if True, convert letters to lowercaseOutput:x (list): x[n] is the n'th word in the file"""text = []with open(filename, "rb") as f:for line in f:if lower_case:line = line.decode(errors="ignore").lower()text += tokenizer.tokenize(line)else:text += tokenizer.tokenize(line.decode(errors="ignore"))if stemming:for i in range(len(text)):if text[i] in bad_words:continuetext[i] = porter_stemmer.stem(text[i])return textdef loadDir(dirname, stemming, lower_case, use_tqdm=True):"""Loads the files in the folder and returns alist of lists of words from the text in each file.Parameters:name (str): the directory containing the datastemming (bool): if True, use NLTK's stemmer to remove suffixeslower_case (bool): if True, convert letters to lowercaseuse_tqdm (bool, default:True): if True, use tqdm to show status barOutput:texts (list of lists): texts[m][n] is the n'th word in the m'th emailcount (int): number of files loaded"""texts = []count = 0if use_tqdm:for f in tqdm(sorted(os.listdir(dirname))):texts.append(loadFile(os.path.join(dirname, f), stemming, lower_case))count = count + 1else:for f in sorted(os.listdir(dirname)):texts.append(loadFile(os.path.join(dirname, f), stemming, lower_case))count = count + 1return texts, count

💬 reader.py：将读取数据并打印

import data_load, hw4, importlib
import numpy as npif __name__ == "__main__":texts, count = data_load.loadDir("data", False, False)importlib.reload(hw4)Pjoint = hw4.joint_distribution_of_word_counts(texts, "mr", "company")print("Problem1. Joint distribution:")print(Pjoint)print("---------------------------------------------")P0 = hw4.marginal_distribution_of_word_counts(Pjoint, 0)P1 = hw4.marginal_distribution_of_word_counts(Pjoint, 1)print("Problem2. Marginal distribution:")print("P0:", P0)print("P1:", P1)print("---------------------------------------------")Pcond = hw4.conditional_distribution_of_word_counts(Pjoint, P0)print("Problem3. Conditional distribution:")print(Pcond)print("---------------------------------------------")Pathe = hw4.joint_distribution_of_word_counts(texts, "a", "the")Pthe = hw4.marginal_distribution_of_word_counts(Pathe, 1)mu_the = hw4.mean_from_distribution(Pthe)print("Problem4-1. Mean from distribution:")print(mu_the)var_the = hw4.variance_from_distribution(Pthe)print("Problem4-2. Variance from distribution:")print(var_the)covar_a_the = hw4.covariance_from_distribution(Pathe)print("Problem4-3. Covariance from distribution:")print(covar_a_the)print("---------------------------------------------")def f(x0, x1):return np.log(x0 + 1) + np.log(x1 + 1)expected = hw4.expectation_of_a_function(Pathe, f)print("Problem5. Expectation of a function:")print(expected)

📌 [ 笔者 ]   王亦优
📃 [ 更新 ]   2023.11.15
❌ [ 勘误 ]   /* 暂无 */
📜 [ 声明 ]   由于作者水平有限，本文有错误和不准确之处在所难免，本人也很想知道这些错误，恳望读者批评指正！

📜 参考资料

C++reference[EB/OL]. []. http://www.cplusplus.com/reference/.

Microsoft. MSDN(Microsoft Developer Network)[EB/OL]. []. .

百度百科[EB/OL]. []. https://baike.baidu.com/.

比特科技. C++[EB/OL]. 2021[2021.8.31].

相机通用类之LMI激光三角相机（3D），软触发硬触发（飞拍），并输出halcon格式对象

Linux命令--重启系统的方法

电源电压范围宽、功耗小、抗干扰能力强的国产芯片GS069适用于电动工具等产品中，采用SOP8的封装形式封装

Redis缓存穿透、击穿、雪崩

阿里云国际站：密钥管理服务

【Vue原理解析】之异步与优化

python接口自动化-参数关联

Ladybug 全景相机， 360°球形成像，带来全方位的视觉体验

[代码实战和详解]VGG16

vue 使用 this.$router.push 传参数，接参数的 query或params 两种方法示例

第一行代码第三版-第三章变量和函数

CSS特效007：绘制3D文字，类似PS效果

css中的hover用法示例（可以在vue中制作鼠标悬停显示摸个按钮的效果）

桂院校园导航静态项目二次开发教程 1.3

差分详解（附加模板和例题）

2018天猫双11|这就是阿里云！不止有新技术，更有温暖的社会力量

30秒的PHP代码片段（1）数组 - Array

JavaScript 基础知识 - 入门篇(一)

JavaScript类型识别

Promise面试题2实现异步串行执行

Sequelize 中文文档 v4 - Getting started - 入门

Spring Security中异常上抛机制及对于转型处理的一些感悟

webpack+react项目初体验——记录我的webpack环境配置

阿里云应用高可用服务公测发布

从零开始的无人驾驶 1

山寨一个 Promise

新手搭建网站的主要流程

- 转 Ext2.0 form使用实例

【运维趟坑回忆录】vpc迁移 - 吃螃蟹之路

Unity3D - 异步加载游戏场景与异步加载游戏资源进度条 ...

带你开发类似Pokemon Go的AR游戏

#laravel 通过手动安装依赖PHPExcel#

(bean配置类的注解开发)学习Spring的第十三天

（南京观海微电子）——I3C协议介绍

(原創) 如何刪除Windows Live Writer留在本機的文章? (Web) (Windows Live Writer)

(转)Groupon前传：从10个月的失败作品修改，1个月找到成功

(转)setTimeout 和 setInterval 的区别

.NET Core 将实体类转换为 SQL(ORM 映射)

.NET Remoting学习笔记（三）信道

.net获取当前url各种属性(文件名、参数、域名等)的方法

.NET项目中存在多个web.config文件时的加载顺序

.NET中统一的存储过程调用方法(收藏)

.vue文件怎么使用_我在项目中是这样配置Vue的

[ C++ ] 继承

[]C/C++读取串口接收到的数据程序

[1]-基于图搜索的路径规划基础

[⑧ADRV902x]: Digital Pre-Distortion (DPD)学习笔记

[Android Pro] listView和GridView的item设置的高度和宽度不起作用

[Android] 修改设备访问权限

[ASP.NET MVC]Ajax与CustomErrors的尴尬

[BROADCASTING]tensor的扩散机制

[bzoj 3124][sdoi 2013 省选] 直径

[Deep Learning] 神经网络基础

[FZSZOJ 1223] 上海红茶馆

[Java]深入剖析常见排序

0x00 实现求条件分布的函数（Conditional distribution）

0x01 实现求平均值, 方差和协方差的函数（Mean, Variance, Covariance）

0x02 实现求函数期望值的函数（Expected Value of a Function）

0x04 提供测试用例

相关文章：