当前位置：首页 > news >正文

正则表达式与grep详解

news 来源：原创 2024/5/18 21:42:50

写在前面：

博客书写牢记5W1H法则：What，Why，When，Where，Who，How。

本篇主要内容：

● 标准正则表达式

● 扩展正则表达式

● 扩展与基本正则表达差异

本篇涉及命令：

☉ grep

☉ egrep

☉ fgrep

linux文本处理三剑客：

grep：

文本搜索工具。基于“pattern”对给定文本进行搜索操作。

sed：

Stream EDitor，流编辑器，行编辑器，本质是一个文本编辑工具。

awk：

GNU awk，文本格式化工具；文本报告生成器。

正则表达式：REGular EXpression,REGEX

由一类特殊字符和文本字符所编写的模式，其中有些字符不表示其字面意义，而是用来表示控制或通配的同能。

regular expression可以分为两类：

basic regular expression：BRE

extended regular expression：ERE

正则表达式引擎：利用正则表达式模式分析给定文本的程序；

补充：man 7 regex可以看到正则表达式介绍。

grep系列：

grep

支持使用基本正则表达式

egrep

支持使用扩展正则表达式

fgrep

不支持使用正则表达式，不解析匹配模式字符的含义，按普通字符串对待，速度最快

grep

print lines matching a pattern

grep [OPTIONS] PATTERN [FILE...]

grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

-E, --extended-regexp：扩展正则ERE，等同于egrep

-F, --fixed-strings, --fixed-regexp：关闭正则匹配，等同于fgrep

-G, --basic-regexp：基础正则表达BRE，默认项

-P, --perl-regexp：Perl语言的正则表达支持

-e PATTERN, --regexp=PATTERN：指定匹配模式(PATTERN)，可以多个

-f FILE, --file=FILE：加载匹配模式文件，文件的每一行一个PATTERN

-i, --ignore-case：忽略大小写

-v, --invert-match：反选，输出非匹配到的行

-x, --line-regexp：查找与PATTEN匹配的整行

-o, --only-matching：只输出匹配到的字符，而不是整行

-q, --quiet, --silent：静默模式，不输出匹配到的任何内容

-A NUM, --after-context=NUM：匹配到的行的后NUM行也输出

-B NUM, --before-context=NUM：匹配到的行的前NUM行也输出

-C NUM, -NUM, --context=NUM：匹配到的行的前后NUM都输出

基本正则表达式元字符：

(1)字符匹配：

.：匹配任意单个字符；

[]：匹配指定范围的单个字符；

[^ ]：匹配指定范围以外的单个字符；

特殊字符组合常用表：可通过man tr和man grep获取更多

字符组合	含义	字符组合	含义
alnum	所有字母和数字	space	水平或垂直空白符
digit	数字	blank	水平方向空白符
punct	标点符号	lower	小写字母
alpha	所有字母，即大小写字母	upper	大写字母
graph	可输出字符，不包含空格	cntrl	控制字符
print	可输出字符，包含空格	xdigit	十六进制数字，即[0-9a-fA-F]

上表特殊字符组合放入[[::]]内来表示特定含义，如：[[:lower:]]来表示小写字母。

实例：

       #匹配单个任意字符
[root@localhost test]# cat file 
Hello
hollo
hallo
[root@localhost test]# grep "h.llo" file 
hollo
hallo
	#匹配a或o单个字符
[root@localhost test]# grep "h[ao]llo" file 
hollo
hallo
	#匹配非a单个字符
[root@localhost test]# grep "[^a]llo" file 
Hello
hollo
	#匹配小写字母
[root@localhost test]# grep "[[:lower:]].llo" file 
hollo
hallo

(2)次数匹配：

用在要指定次数的字符后面，指定字符出现的次数，默认工作在贪婪模式

贪婪模式：匹配最多字符。如grep "x*y"会匹配xxxxyabc中的xxxxy。

*：匹配任意次(0次及以上)

.*：匹配任意长度的任意字符

\+：匹配一次或多次

\?：匹配0次或1次

\{m\}：匹配m次

\{m,n\}：匹配m-n次

\{,n\}：匹配0次到n次

\{m,\}：匹配m次及以上

注意：由于部分符号可能会有特殊含义，所以需要使用\来转义。

实例：

	#测试文件内容如下
[root@localhost test]# cat file 
hllo
hello
heello
hollo
hoollo
hoeollo
	#匹配任意次字母e
[root@localhost test]# grep "he*llo" file
hllo
hello
heello
	#匹配任意长度任意字符
[root@localhost test]# grep "h.*llo" file
hllo
hello
heello
hollo
hoollo
hoeollo
	#匹配一个或多个字母e
[root@localhost test]# grep "he\+llo" file
hello
heello
	#匹配0个或1个字母e
[root@localhost test]# grep "he\?llo" file
hllo
hello
	#匹配2个字母o
[root@localhost test]# grep "ho\{2\}llo" file
hoollo
	#匹配0-2个字母o
[root@localhost test]# grep "ho\{,2\}llo" file
hllo
hollo
hoollo

(3)位置锚定：

限制使用匹配模式搜索的文本出现的位置。

^：行首锚定，用于PATTEN的最左侧。

$：行尾锚定，用于PATTEN的最右侧。

\<或\b：词首，用于单词的左侧

\>或\b：词尾，用于单词的右侧

\<PATTEN\>：单词锚定

补充：这里的单词指的是由非特殊字符组成的连续字符串。

^$：表示空行

^[[:space:]]$：只有空白字符的行

实例：

	#测试文件内容
[root@localhost test]# cat file 
 here is a space.
here is not a space
	there is a tab
there is not a tab.
	#匹配t或h开头的行
[root@localhost test]# grep "^[th]" file 
here is not a space
there is not a tab.
	#匹配“.”结尾的行，注意“.”为正则表达式特殊含义字符，需要使用“\”转义
[root@localhost test]# grep "\.$" file
 here is a space.
there is not a tab.
	#匹配单词“here”
[root@localhost test]# grep "\<here\>" file 
 here is a space.
here is not a space
	#匹配以"here"结尾的单词
[root@localhost test]# grep "here\>" file 
 here is a space.
here is not a space
	there is a tab
there is not a tab.
	#匹配单词“here”
[root@localhost test]# grep "\bhere\b" file 
 here is a space.
here is not a space

(4)分组与引用：

$PATTEN$：分组，将PATTEN看做一个整体，可对其整体进行次数匹配等操作。

\n：后项引用，引用前面分组中的结果。

分组中匹配到的结果字符会被正则表达式引擎自动记录与内部变量中，可以使用\1,\2...来引用这些变量。

注意：正则表达式引擎记录的内部变量看的是“(”。如：

pat1$pat2$pat3$pat4\(pat5$apt6\)

\1 指的是pat2所匹配的内容；

\2 指的是pat4$pat5$apt6所匹配的内容；

\3 指的是pat5；

实例：

	#测试文件内容
[root@localhost test]# cat file 
please say hello with me.HELLO!
please say hello with me.hello!
please say BAY with me.bay!
please say BAY with me.BAY!
OK,good job.
	#匹配“say”后面跟小写单词，后面同样跟这个小写单词的行
[root@localhost test]# grep "say \(\<[[:lower:]]*\>\).*\1" file
please say hello with me.hello!
	#匹配“say”后面跟大写单词，后面同样跟这个大写单词的行
[root@localhost test]# grep "say \(\<[[:upper:]]*\>\).*\1" file
please say BAY with me.BAY!
	#匹配“say”后面跟一个单词，后面跟相同单词的行，不区分大小写
[root@localhost test]# grep -i "say \(\<[[:upper:]]*\>\).*\1" file
please say hello with me.HELLO!
please say hello with me.hello!
please say BAY with me.bay!
please say BAY with me.BAY!

egrep

egrep [OPTIONS] PATTERN [FILE...]

等同于grep -E，参数用法都相同。

扩展正则与基本正则差异：

(1)扩展正则较基本正则增加了“或”匹配"|"

(2)扩展正则次数匹配、分组不必添加"/"转义

扩展正则表达式：

(1)元字符：

.：匹配单个任意字符

[]：匹配范围内单个字符

[^ ]：匹配范围外单个字符

(2)次数匹配：

*：匹配任意次(0次及以上)

.*：匹配任意长度的任意字符

+：匹配一次或多次

?：匹配0次或1次

{m}：匹配m次

{m,n}：匹配m-n次

{,n}：匹配0次到n次

{m,}：匹配m次及以上

(3)位置锚定：

限制使用匹配模式搜索的文本出现的位置。

^：行首锚定，用于PATTEN的最左侧。

$：行尾锚定，用于PATTEN的最右侧。

\<或\b：词首，用于单词的左侧

\>或\b：词尾，用于单词的右侧

\<PATTEN\>：单词锚定

补充：不同与其他，单词锚定符号前的"\"不能省略

(4)分组与引用：

(PATTEN)：分组，将PATTEN看做一个整体，可对其整体进行次数匹配等操作。

\n：后项引用，引用前面分组中的结果。

(5)或者：

|：a|b表示a或者b其中之一，注意结合分组

如egrep "C|cat" FILE匹配cat或C；

egrep "(C|c)at" FILE匹配cat或Cat；

实例：

	#显示/etc/passwd文件中不以bash结尾的行
[root@localhost test]# egrep -v "bash$"  /etc/passwd
	#找出/etc/passwd文件中的三位或四位数
[root@localhost test]# egrep "\b[0-9]{3,4}\b" /etc/passwd
	#显示当前系统上root、centos或slackware用户的相关信息
[root@localhost test]# egrep "^(root|centos|slackware)\>" /etc/passwd
	#echo输出一个绝对路径，使用egrep取出其基名
[root@localhost test]# echo "/etc/fstab" | egrep -o "[^/]+/?$"
	#找出ifconfig命令中100-255之间的数
[root@localhost test]# ifconfig | egrep -e "\<1[0-9][0-9]\>" -e "\<2[0-4][0-9]\>" -e "\<25[0-5]\>
	#egrep的“|”用法
[root@localhost test]# egrep "H|hello" file
Hello
hello
H
[root@localhost test]# egrep -o "H|hello" file
H
hello
H
[root@localhost test]# egrep -o "(H|h)ello" file
Hello
hello

转载于:https://blog.51cto.com/1036416056/1748896