当前位置: 首页 > news >正文

机器学习中的监督学习介绍

In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.

在这篇文章中,我们将深入探讨监督学习的概念、机器学习的要求以及学习和提高预测准确性的过程。

What is Supervised Learning        什么是监督学习

When it comes to machine learning, there are primarily four types:

在机器学习领域,主要有四种类型:

  • Supervised Machine Learning        监督学习
  • Unsupervised Machine Learning    非监督学习
  • Semi-Supervised Machine Learning   半监督学习
  • Reinforcement Learning    强化学习

Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.

监督机器学习是指使用标记数据进行机器训练的过程。标记数据可以包括数值或字符串值。例如,假设你有一些动物的照片,如猫和狗。为了训练你的机器识别这些动物,你需要“标记”或指示每张照片旁边的动物名称。然后,机器将学习在照片中识别相似的模式并预测适当的标签。

Machine Learning        机器学习

Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?

机器学习是一个术语,指的是机器经历的过程,以便它能够产生预测。如上所述,机器即使在以前从未见过这只特定的猫,也能在照片中识别出猫。但是这是怎么做到的呢?

Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.

当然是通过训练,训练涉及一个递归过程,用于提高输出(或预测)的准确性。在监督机器学习中,我们根据提供的标记数据教机器识别事物。

Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.

如今,你随处可见训练过的机器,并与它们互动。Netflix、YouTube、TikTok 和大多数服务都实现了某种算法,这些算法使用你的数据(从你那里收集而来)来了解你,以便为你提供你喜欢的内容。这就是为什么你会花上无数个小时滚动浏览的原因。

The more data you give these services, the more they learn about you. Some of them even know you more than you know yourself.

你向这些服务提供的数据越多,它们对你的了解就越多。有些服务甚至比你自己更了解你。

Supervised vs Unsupervised        监督与非监督

Think of it like this:        像这样思考

As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.

作为人类,我们能够识别猫是因为我们的父母和老师教给我们猫长什么样。他们基本上“监督”我们并“标记”了我们的数据。然而,当我们区分好朋友和坏朋友时,我们依赖的是自己的经历和观察来实现这一点。同样地,机器可以通过监督学习来学习识别特定的图像,或者通过非监督学习,根据提供给它们的数据自行做出判断。

Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).

与我们相比,机器的学习过程是不同的,但它受到我们大脑的启发。为了训练计算机,我们主要使用统计算法,如线性回归、决策树(DTs)和K最近邻(KNNs)。

An algorithm is a sequence of operations that is typically used by computers to find the correct solution to a problem (or identify that there are no correct solutions).

算法是一系列操作的序列,通常由计算机使用,以找到问题的正确解决方案(或确定没有正确的解决方案)。

5 things you'll need to train your model

训练模型所需的5件事情

Understand the problem        理解问题

First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:

首先,你需要理解你试图解决的问题。通常,我们可以使用机器学习来回答广泛的问题,例如:

  • Can we accurately predict diseases in patients?        我们能否准确预测患者的疾病?
  • Can we predict the price of houses?   我们能预测房价吗?

It's important to understand the question we're trying to answer. Let's take the first question from the list above:

了解我们试图回答的问题很重要。让我们从上面的列表中取第一个问题:

"Can we accurately predict diseases in patients?"        我们能否准确预测患者的疾病?

We can rephrase this question to:        我们可以将这个问题重新表述为:

"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"

“是否可以利用历史患者数据,如年龄、性别、血压、胆固醇和医疗状况,来预测新患者患病的可能性?”

The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.

答案是:可以。这被称为分类问题,其中输入数据用于根据预定类别的列表预测患者患新疾病的潜在性。

Get and prepare the data        获取并准备数据

Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.

想象一下,你买了一本数学课的教科书,但所有的纸张都是空白的。或者更好的是,想象一下这些纸张包含随机信息,与主题无关,甚至是无法识别的字符。你能学到东西吗?当然不能。你需要有组织的信息。同样地,在使用数据之前,我们需要先准备数据。

The quality of your data will determine the quality of your predictions.

您数据的质量将决定您预测的质量。

So, the next step is to get the data. It could be located in many places, like:

所以,下一步是获取数据。数据可能位于许多地方,比如:

  • Hospital internal database (SQL)   医院内部数据库(SQL)
  • Publicly available information (Web Scraping)   公开可获取的信息(网络爬虫)
  • Public health records (JSON)   公共卫生记录(JSON)

As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.

如您所见,数据可能位于多个位置,且有多种形状和格式。只要数据与我们的问题相关,我们就可以利用它。

Data Wrangling is the process of working with raw data and converting it into a usable form.

数据整理(Data Wrangling)是处理原始数据并将其转换为可用形式的过程。

Explore and analyze the data        探索和分析数据

Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.

既然我们现在处理的是干净的数据,重要的是要仔细查看并进行所谓的解释性数据分析(EDA),以查找模式并总结主要特征。例如,为了了解数据集的分布,我们可以计算年龄变量的平均值、中位数和范围。我们还可以通过计算特定性别的疾病百分比来分析疾病与性别之间的相关性。

The most common programming languages used to perform EDA, and data analysis are: Python and R. Popular libraries for Python include: matplotlib, seaborn, numpy, and others.

用于执行EDA和数据分析的最常见的编程语言是:Python和R。Python中流行的库包括:matplotlib、seaborn、numpy等。

We will not go into technical details in this post, but some common analyses done during the EDA phase include:

本文不会深入探讨技术细节,但在EDA阶段通常会进行的一些常见分析包括:

  • Data Distribution        数据分布
  • Dataset Structure    数据集结构
  • Handle Missing Values and Outliers    处理缺失值和异常值
  • Determine Correlations    确定相关性
  • Evaluate Assumptions    评估假设
  • Visualize by Plotting    通过绘图进行可视化
  • Identify Patterns    识别模式
  • Understand the Relevancy of External Data    理解外部数据的相关性

Choose a suitable algorithm        选择合适的算法

As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.

正如我们之前所见,我们面临的是一个分类问题。因此,我们可以使用常见的分类算法构建模型候选者,然后比较输出结果以选择最准确的模型。

For this example, I'm going to use two algorithms popular for solving classification problems:

对于这个例子,我将使用两种流行的分类问题解决算法:

  • Random Forest    随机森林
  • Support Vector Machine (SVM)    支持向量机(SVM)

Train, test, and refine        训练、测试和调优

Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.

使用Python和scikit-learn(Python的机器学习库),我们可以根据我们的数据集确定两种算法的准确性。我们将通过提供一部分数据来训练模型。

While we can use all of the data in our dataset to train the model, we'll be splitting the data into two parts. Commonly, it is an 80/20 split, meaning 80% of our data will go to training, and the remaining 20% will be used for testing. This is done to prevent overfitting. The topic of overfitting was discussed in this article.

虽然我们可以使用数据集中的所有数据来训练模型,但我们会将数据分成两部分。通常,这是80/20的分割,意味着80%的数据用于训练,剩余的20%用于测试。这是为了防止过拟合。过拟合的主题在本文中已有讨论。

# ... Previous code omitted for brevity# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

The output is as shown below:        输出如下:

# ... Previous code omitted for brevityprint("SVM Accuracy:", svm_accuracy)
print("Random Forest Accuracy:", rf_accuracy)SVM Accuracy: 0.2857142857142857
Random Forest Accuracy: 0.8571428571428571

Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.

查看SVM与随机森林的准确率结果,我们将选择随机森林,因为它的准确率为85%,而SVM的准确率仅为28%。

Accuracy refers to the ability of the model to correctly classify the disease given a set of testing data.

准确率指的是模型在给定的测试数据集下正确分类疾病的能力。

Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.

显然,你可以尝试其他算法,直到你对根据你的标准和你试图解决的问题所得到的输出感到满意为止。

The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.

以上基本上就是监督机器学习过程的内容。需要强调的是,这是一个迭代过程,并不会在训练后结束。我们需要部署模型并从利益相关者那里获取反馈,这可能会基于新数据和其他因素导致模型细化。

Conclusion        总结

Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.

感谢阅读!在本文中,我们介绍了监督机器学习是什么,机器需要学习什么,它们如何学习,以及它们如何改进。我们还介绍了诸如数据预处理和探索性数据分析(EDA)等重要步骤,这些步骤在模型预测的准确性和相关性方面至关重要。

相关文章:

  • matlab 任意二维图像转点云
  • Windows11和Ubuntu22双系统安装指南
  • C++ virtual public(虚继承类)
  • 从人、机器和环境角度解决智能安全问题
  • 绘制MySQL数据库的实体关系图(ERD)与逻辑模型图
  • 费曼的博士学位论文及下载
  • 前端 CSS 经典:好用的 CSS 选择器
  • 【云】各家云服务器介绍
  • 开源新纪元:ChatTTS——引领对话式文本转语音的新潮流
  • Java学习 - MySQL数据存储过程 + 函数 + 触发器介绍实例
  • ES6 .entries用法
  • WordPress模板推荐
  • python GUI开发: tkinter事件处理的几种方式详解与应用实战
  • 【数据库编程-SQLite3(二)】API-增删改查基础函数-(含源码)
  • Linux下Shell脚本基础知识
  • $translatePartialLoader加载失败及解决方式
  • [译]CSS 居中(Center)方法大合集
  • 【划重点】MySQL技术内幕:InnoDB存储引擎
  • Fabric架构演变之路
  • interface和setter,getter
  • java架构面试锦集:开源框架+并发+数据结构+大企必备面试题
  • Linux各目录及每个目录的详细介绍
  • React Transition Group -- Transition 组件
  • SpiderData 2019年2月16日 DApp数据排行榜
  • SpringBoot 实战 (三) | 配置文件详解
  • 第13期 DApp 榜单 :来,吃我这波安利
  • 多线程事务回滚
  • 关于字符编码你应该知道的事情
  • 基于Javascript, Springboot的管理系统报表查询页面代码设计
  • 少走弯路,给Java 1~5 年程序员的建议
  • 我建了一个叫Hello World的项目
  • [地铁译]使用SSD缓存应用数据——Moneta项目: 低成本优化的下一代EVCache ...
  • 《TCP IP 详解卷1:协议》阅读笔记 - 第六章
  • Redis4.x新特性 -- 萌萌的MEMORY DOCTOR
  • 关于Android全面屏虚拟导航栏的适配总结
  • ​ArcGIS Pro 如何批量删除字段
  • #pragma pack(1)
  • (01)ORB-SLAM2源码无死角解析-(56) 闭环线程→计算Sim3:理论推导(1)求解s,t
  • (1)虚拟机的安装与使用,linux系统安装
  • (33)STM32——485实验笔记
  • (42)STM32——LCD显示屏实验笔记
  • (cljs/run-at (JSVM. :browser) 搭建刚好可用的开发环境!)
  • (html转换)StringEscapeUtils类的转义与反转义方法
  • (PWM呼吸灯)合泰开发板HT66F2390-----点灯大师
  • (python)数据结构---字典
  • (编译到47%失败)to be deleted
  • (代码示例)使用setTimeout来延迟加载JS脚本文件
  • (分享)一个图片添加水印的小demo的页面,可自定义样式
  • (三)Pytorch快速搭建卷积神经网络模型实现手写数字识别(代码+详细注解)
  • (三分钟了解debug)SLAM研究方向-Debug总结
  • (四)Linux Shell编程——输入输出重定向
  • .a文件和.so文件
  • .Net IE10 _doPostBack 未定义
  • .NET Micro Framework 4.2 beta 源码探析
  • .Net Remoting(分离服务程序实现) - Part.3