当前位置: 首页 > news >正文

Introduction to Machine Learning

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

1: Introduction To Machine Learning

In data science, we're often trying to understand a process or system using observational data.

Here are a few specific examples:

  • How do the properties(价值) of a house affect it's market value?
  • How does an applicant's application affect if they get into graduate school or not?

These questions are high-level and tough to answer in the abstract. We can start to narrow these questions to the following:

  • How does the size of a house, the number of rooms, its neighborhood crime index, and age affect it's market value?
  • How does an applicant's college GPA and GRE score affect if they get in to graduate school or not?

These more specific questions we can start to answer by applying machine learning techniques on past data.

In the first problem, we're interested in trying to predict a specific, real valued number -- the market value of a house in dollars. Whenever we're trying to predict a real valued number, the process is called regression.

In the second problem, we're interested in trying to predict a binary value(二进制值) -- acceptance or rejection into graduate school. Whenever we're trying to predict a binary value, the process is called classification.

In this mission, we'll focus on a specific regression problem.

2: Introduction To The Data

  • How do the properties of a car impact it's fuel efficiency(燃油效率)?

To try to answer this question, we'll work with a dataset containing(包含) fuel efficiencies of several cars compiled(美[kəm'paɪld] 收集) by Carnegie Mellon University(卡内基梅隆大学). The dataset is hosted by the University of California Irvine(加利福尼亚大学欧文分校) on their machine learning repository. As a side note, the UCI Machine Learning repository contains many small datasets which are useful when getting your hands dirty with machine learning.

You'll notice that the Data Folder(数据文件夹) contains a few different files. We'll be working with auto-mpg.data, which omits(省略) the 8 rows containing missing values for fuel efficiency (mpg column). Even though the file's extension is .data, it's encoded as a plain text file and you can open it using any text editor. If you opened auto-mpg.data in a text editor, you'll notice that the values in each line of the file are separated by a variable number of white spaces:

Imgur

Since the file isn't formatted as a CSV file and instead uses a variable number of white spaces to delimit the columns, you can't useread_csv to read into a DataFrame. You need to instead use the read_table method, setting the delim_whitespace parameter to Trueso the file is parsed using the whitespace between values:

mpg = pd.read_table("auto-mpg.data", delim_whitespace=True)

The file doesn't contain the column names unfortunately so you'll have to extract the column names from auto-mpg.names and specify them manually. The column names can be found in the Attribute Information section. Just like auto-mpg.dataauto-mpg.names is a text file that can be opened using a standard text editor.

As specified in auto-mpg.names, the dataset contains 7 numerical features that could have an effect on a car's fuel efficiency:

  • cylinders -- the number of cylinders in the engine.
  • displacement -- the displacement of the engine.
  • horsepower -- the horsepower of the engine.
  • weight -- the weight of the car.
  • acceleration -- the acceleration of the car.
  • model year -- the year that car model was released (e.g. 70 corresponds to 1970).
  • origin -- where the car was manufactured (0 if North America, 1 if Europe, 2 if Asia).

When reading in auto-mpg.data using the read_table method, you can use the names parameter to specify the list of column names, as a list of strings. Let's now read in the dataset into a DataFrame so we can explore it further.

Instructions

Read the dataset auto-mpg.data into a DataFrame named cars using the Pandas method read_table.

  • Specify that you want the whitespace between values to be used as the delimiter.
  • Use the column names provided in auto-mpg.names to set the column names for the carsDataframe.
  • Display the cars DataFrame using a print statement or by checking the variable inspector below the code box.

 

import pandas as pd
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars=pd.read_table("auto-mpg.data",delim_whitespace=True,names=columns)
print(cars.head(5))

 

3: Exploratory Data Analysis

Using this dataset, we can work on a more narrow problem:

  • How does the number of cylinders, displacement, horsepower, weight, acceleration, and model year affect a car's fuel efficiency?

Let's perform some exploratory data analysis for a couple of the columns to see which one correlates best with fuel efficiency.

Instructions

  • Create a grid of subplots containing 2 rows and 1 column.
  • Generate the following data visualizations:
    • Top chart: Scatter plot with the weight column on the x-axis and the mpg column on the y-axis.
    • Bottom chart: Scatter plot with the accelerationcolumn on the x-axis and the mpg column on the y-axis.

import matplotlib.pyplot as plt
fig=plt.figure()
ax1=fig.add_subplot(2,1,1)
ax2=fig.add_subplot(2,1,2)
cars.plot("weight","mpg",kind="scatter",ax=ax1)
cars.plot("acceleration","mpg",kind="scatter",ax=ax2)
plt.show()

 

4: Linear Relationship

The scatter plots hint that there's a strong negative linear relationship between the weight and mpg columns and a weak, positive linear relationship between the acceleration and mpg columns. Let's now try to quantify the relationship between weight and mpg.

A machine learning model is the equation that represents how the input is mapped to the output. Said another way, machine learning is the process of determining the relationship between the independent variable(s) and the dependent variable. In this case, the dependent variable is the fuel efficiency and the independent variables are the other columns in the dataset.

In this mission and the next few missions, we'll focus on a family of machine learning models known as linear models. These models take the form of:

y=mx+by=mx+b

The input is represented as x, transformed using the parameters m (slope) and b (intercept), and the output is represented as y. We expect m to be a negative number since the relationship is a negative linear one.

The process of finding the equation that fits the data the best is called fitting. We won't dive into how a model is fit to the data in this mission and will instead focus on interpreting the model. We'll use the Python library scikit-learn library to handle fitting the model to the data.

5: Scikit-Learn

To fit the model to the data, we'll use the machine learning library scikit-learn. Scikit-learn is the most popular library for working with machine learning models for small to medium sized datasets. Even when working with larger datasets that don't fit in memory, scikit-learn is commonly used to prototype (原始模型)and explore machine learning models on a subset of the larger dataset.

Scikit-learn uses an object-oriented style(面向对象), so each machine learning model must be instantiated(实例化) before it can be fit to a dataset (similar to creating a figure in Matplotlib before you plot values). We'll be working with the LinearRegression class from sklearn.linear_model:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

To fit a model to the data, we use the conveniently named fit method:

lr.fit(inputs, output)

where inputs is a n_rows by n_columns matrix and output is a n_rows by 1 matrix. The dataset we're working with contains 398 rows and 9 columns but since we want to only use the weight column, we need to pass in a matrix containing 398 rows and 1 column. The catch, however, is if you just select the weight column and pass that in as the first parameter to the fit method, an error will be returned. This is because scikit-learn will convert Series and Dataframe objects to NumPy objects and the dimensions don't match.

You can use the values attribute to see which NumPy object is returned:

cars["weight"].values

A NumPy array with 398 elements will be returned instead of a matrix containing rows and columns. You can confirm this by using theshape attribute:

cars["weight"].values.shape

The value (398,), representing 398 rows by 0 columns, will be returned. If you instead use double bracket notation:

cars[["weight"]].values

you'll get back a NumPy matrix with 398 rows and 1 column.

Instructions

  • Import the LinearRegressionclass fromsklearn.linear_model.
  • Instantiate a LinearRegression instance and assign to lr.
  • Use the fit method to fit a linear regression model using theweight column as the input and the mpg column as the output.

from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(cars[["weight"]],cars[["mpg"]])

 

6: Making Predictions

Now that we have a trained linear regression model, we can use it to make predictions. Recall that this model takes in a weight value, in pounds, and outputs a fuel efficiency value, in miles per gallon. To use a model to make predictions, use the LinearRegression methodpredict. The predict method has a single required parameter, the n_samples by n_features input matrix and returns the predicted values as a n_samples by 1 matrix (really just a list).

You may be wondering why we'd want to make predictions for the data we trained the model on, since we already know the true fuel efficiency values. Making predictions on data used for training is the first step in the testing & evaluation(测试与评估) process. If the model can't do a good job of even capturing the structure of the trained data, then we can't expect it to do a good job on data it wasn't trained on. This is known as underfitting(欠拟合), since the model under performs on the data it was fit on.

Instructions

  • Use the LinearRegression methodpredict to make predictions using the values from the weightcolumn.
  • Assign the resulting list of predictions to predictions.
  • Display the first 5 elements inpredictions and the first 5 elements in the mpg column to compare the predicted values with the actual values.

import sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions=lr.predict(cars[["weight"]])
print(predictions[0:5])
print(cars["mpg"][0:5])

7: Plotting The Model

We can now plot the actual fuel efficiency values for each car alongside the predicted fuel efficiency values to gain a visual understanding of the model's effectiveness.

Instructions

On the same subplot:

  • Generate a scatter plot withweight on the x-axis and thempg column on the y-axis. Specify that you want the dots in the scatter plot to be red.
  • Generate a scatter plot withweight on the x-axis and the predicted values on the y-axis. Specify that you want the dots in the scatter plot to be blue.

plt.scatter(cars["weight"],cars["mpg"],c="red")
plt.scatter(cars["weight"],predictions,c="blue")
plt.show()

8: Error Metrics

The plot from the last step gave us a visual idea of how well the linear regression model performs. To obtain a more quantitative understanding(定量的解释), we can calculate the model's error, or the mismatch between a model's predictions and the actual values.

One commonly used error metric for regression is mean squared error, or MSE for short. You calculate MSE by computing the squared error between each predicted value and the actual value:

(Yi^−Yi)2(Yi^−Yi)2

where Yi^Yi^ is a predicted value for fuel efficiency and YiYi is the actual mpg value. Then, you compute the mean of all of the squared errors:

MSE=1n∑ni=1(Yi^−Yi)2MSE=1n∑i=1n(Yi^−Yi)2

Here's the same formula in psuedo-code:

sum = 0
for each data point:
   diff =  predicted_value - actual_value
   squared_diff = diff ** 2
   sum += squared_diff
mse = sum/n

We'll use the mean_squared_error function from scikit-learn to calculate MSE. We'll leave it to you to import the function and understand how to use it, so that you become more accustomed to reading documentation.

Instructions

  • Import themean_squared_error function.
  • Use the mean_squared_errorfunction to calculate the MSE of the predicted values and assign tomse.
  • Display the MSE value using aprint statement or the variables display below the code cell after you run your code.

from sklearn.metrics import mean_squared_error
lr = LinearRegression()#fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions = lr.predict(cars[["weight"]])
mse=mean_squared_error(predictions,cars[["mpg"]])
print(mse)

 

9: Root Mean Squared Error

There are many error metrics you can use, each with it's own advantages and disadvantages. While the specific properties of each of the different error metrics is outside the scope of this mission, we'll introduce another error metric here.

Root mean squared error, or RMSE for short, is the square root of the MSE and does a better job of penalizing large error values. In addition, the RMSE is easier to interpret since it's units are in the same dimension as the data. When computing MSE, we squared both the predicted and actual values, calculated the differences, then summed all of the differences. This means that the MSE value will be in miles per gallon squared while the RMSE value will be in miles per gallon.

Instructions

  • Calculate the RMSE of the predicted values and assign tormse.
  • Display the RMSE value using aprint statement or the variables display below the code cell after you run your code.

 

mse = mean_squared_error(cars["mpg"], predictions)
rmse = mse ** (1/2)
print(rmse)

10: Next Steps

In this mission, we explored the basics of machine learning to better understand how the weight of a car relates to its fuel efficiency. We focused on regression, a class of machine learning techniques where the input and output values are continuous values.

Next up is a challenge where you can practice the concepts you learned in this mission.

 

 

转载于:https://my.oschina.net/Bettyty/blog/751261

相关文章:

  • Windows Server 2012 R2上安装.Net4.6.1出错
  • 解决操作缓冲池重复添加的问题
  • linux复制指定目录下的全部文件到另一个目录中,linux cp 文件夹
  • 更改阿里云域名解析台里某个域名绑定的IP之后不能解析到新IP
  • 区块链承兑系统是怎样的?
  • ionic2-loading动画
  • 编译安装zabbix
  • 值和引用
  • 装饰器-python
  • 眼见为实:.NET类库中的DateTimeOffset用途何在
  • SpringCloud学习系列之五-----配置中心(Config)和消息总线(Bus)完美使用版
  • ios中UIButton选中状态切换
  • WebRTC 的工作原理解析 | 掘金技术征文
  • Spring Boot中使用Swagger2构建强大的RESTful API文档
  • spring2.5整合struts2
  • Angular 响应式表单之下拉框
  • Django 博客开发教程 8 - 博客文章详情页
  • E-HPC支持多队列管理和自动伸缩
  • Fastjson的基本使用方法大全
  • javascript面向对象之创建对象
  • PHP 小技巧
  • 编写高质量JavaScript代码之并发
  • 欢迎参加第二届中国游戏开发者大会
  • 回流、重绘及其优化
  • 紧急通知:《观止-微软》请在经管柜购买!
  • 实战:基于Spring Boot快速开发RESTful风格API接口
  • 使用docker-compose进行多节点部署
  • 数据科学 第 3 章 11 字符串处理
  • 吐槽Javascript系列二:数组中的splice和slice方法
  • 小李飞刀:SQL题目刷起来!
  • ​Distil-Whisper:比Whisper快6倍,体积小50%的语音识别模型
  • ​创新驱动,边缘计算领袖:亚马逊云科技海外服务器服务再进化
  • #我与Java虚拟机的故事#连载15:完整阅读的第一本技术书籍
  • #中的引用型是什么意识_Java中四种引用有什么区别以及应用场景
  • (4)Elastix图像配准:3D图像
  • (vue)页面文件上传获取:action地址
  • (多级缓存)多级缓存
  • (附源码)计算机毕业设计SSM在线影视购票系统
  • (转)Unity3DUnity3D在android下调试
  • (转)全文检索技术学习(三)——Lucene支持中文分词
  • .desktop 桌面快捷_Linux桌面环境那么多,这几款优秀的任你选
  • .net core MVC 通过 Filters 过滤器拦截请求及响应内容
  • .net对接阿里云CSB服务
  • .net经典笔试题
  • @Service注解让spring找到你的Service bean
  • [ Linux ] Linux信号概述 信号的产生
  • [AutoSar]BSW_OS 02 Autosar OS_STACK
  • [BZOJ] 2006: [NOI2010]超级钢琴
  • [C#][opencvsharp]opencvsharp sift和surf特征点匹配
  • [C#基础知识系列]专题十七:深入理解动态类型
  • [codevs 2822] 爱在心中 【tarjan 算法】
  • [ERROR] Plugin 'InnoDB' init function returned error
  • [excel与dict] python 读取excel内容并放入字典、将字典内容写入 excel文件
  • [gdc19]《战神4》中的全局光照技术
  • [Hive] CTE 通用表达式 WITH关键字