当前位置：首页 > news >正文

Elasticsearch索引映射定义

news 来源：原创 2024/9/20 18:52:16

前言

映射是ES索引最重要的配置之一，类似于关系数据库中的Scheme。映射决定了文档字段的数据类型，以及一些其它的属性，例如是否是必需的字段、是否允许为空值等。不仅如此，映射还决定了文档是如何被存储和检索的，映射不合理，会导致索引的性能下降，文档检索结果不准确等。

如下示例，就是一个最简单的映射配置

PUT users
{"mappings": {"properties": {"name": {"type": "keyword"},"gender": {"type": "keyword"}}}
}

映射的定义

索引是文档的集合，文档是字段的集合，每个字段都有自己的数据类型。在映射数据时，需要创建一个映射定义，其中包含与文档相关的字段列表。映射定义还包括元数据字段，如_source字段，它自定义如何处理文档的关联元数据。

映射类型

ES索引映射包含两部分：动态映射和显式映射。

动态映射

动态映射的优点是，开发者可以在不定义映射，不指定字段名称和字段类型的前提下，直接索引文档，快速上手。缺点是ES动态映射的结果可能不是最理想的，不过这个可以通过设置动态映射模板来解决。

当ES在文档中检测到新的字段时，默认会将其动态的添加到类型映射中，可以通过将属性index.mappings.dynamic 设为false来禁用动态映射。并非所有数据类型都支持动态映射，支持的数据类型有：boolean、float、long、Object、Array、date、string类型，其它类型均会适配成text存储。

如下示例，创建一个“users”索引，在不定义映射的情况下直接索引文档，ES会根据文档的字段类型来动态创建映射：

// 创建索引
PUT users// 索引文档
POST users/_doc
{"name":"Lisa"
}// 查看索引
GET users
{"users": {"mappings": {"properties": {"name": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}}}
}

显式映射

尽管动态映射很好用，但是更多的时候，还是推荐使用显式映射，毕竟我们比ES更了解我们的数据。显式映射是指开发者在创建索引时就定义好字段的映射关系，类似于关系数据库的Schema，在索引文档前需要先建模。

为了安全，可以将index.mappings.dynamic 属性设为strict，ES检测到新字段时会报如下错误。属性设置为false，新字段可以写入，但不能被检索。

{"error": {"root_cause": [{"type": "strict_dynamic_mapping_exception","reason": "[3:9] mapping set to strict, dynamic introduction of [age] within [_doc] is not allowed"}],"type": "strict_dynamic_mapping_exception","reason": "[3:9] mapping set to strict, dynamic introduction of [age] within [_doc] is not allowed"},"status": 400
}

运行时字段

ES索引映射还支持运行时字段（Runtime fields）。顾名思义，运行时字段是在运行时动态添加的字段，可以在文档检索时或映射里定义运行时字段。

运行时字段的优点是：首先，因为不会被索引，所以不会占用额外的存储空间；其次它可以和其它字段一样使用，例如用来做排序，聚合等操作。同样地，它也有一些缺点：因为不会被索引，运行时字段是要在运行时根据原始文档计算出来的，运行时字段的生成本身需要时间，如果要基于它做检索，效率就更低了，所以使用运行时字段时要注意性能问题，平衡搜索性能和灵活性。

如下示例，users索引只存储first_name和last_name，而对于full_name直接用运行时字段来实现，无需额外存储。

PUT users
{"mappings": {"runtime": {"full_name": {"type": "keyword","script": {"source": "emit(doc['first_name'].value+' '+doc['last_name'].value)"}}},"properties": {"first_name": {"type": "keyword"},"last_name": {"type": "keyword"}}}
}

接下来，索引文档并检索，返回full_name

POST users/_doc
{"first_name": "Michael","last_name": "Jordan"
}GET users/_search
{"fields": ["full_name"]
}// 数据返回
{"took": 7,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 1,"hits": [{"_index": "users","_id": "4Os9mY4BODFb3LbQ3HQN","_score": 1,"_source": {"first_name": "Michael","last_name": "Jordan"},"fields": {"full_name": ["Michael Jordan"]}}]}
}

或者，你也可以在搜索时定义运行时字段，效果是一样的：

GET users/_search
{"fields": ["full_name_v2"],"runtime_mappings": {"full_name_v2": {"type": "keyword","script": {"source": "emit(doc['first_name'].value+' '+doc['last_name'].value)"}}}
}

数据类型

ES支持非常多的数据类型，字段类型按族分组，同一族中的类型具有完全相同的搜索行为，但可能具有不同的空间使用或性能特征。

ES支持的常用数据类型：

binary：编码为Base64字符串的二进制类型
boolean：布尔类型，只接受true和false
Keywords：关键字类型族，用于精准匹配，包括：keyword、constant_keyword、wildcard
Dates：日期类型族，包括：date和date_nanos
object：JSON对象类型
flattened：扁平对象类型，将一整个JSON对象作为单个字段值，避免字段膨胀
nested：嵌套数据类型
join：为同一索引中的文档定义父子关联关系
Range：范围类型，包括：long_range、double_range、date_range和ip_range
ip：ip地址类型
text：文本类型，用于全文检索
geo_point：地理位置坐标类型
Multi-fields：多字段类型，为不同的目的以不同的方式索引同一字段，例如针对同一字段同时有全文检索和聚合的需求，可以定义keyword和text类型

doc_values

文档值（doc_values）属性设为true可以用来给字段建立正排索引。我们知道，倒排索引非常适用于全文检索，但是对于排序、聚合等需求就显得无能为力了，这是正排索引的强项。所以ES默认会给所有非text类型的字段启用doc_values属性，这会占用额外的存储空间，但是可以提高字段排序、聚合的性能。如果明确字段不需要排序、聚合、脚本计算、地理位置过滤等业务场景，可以禁用doc_values属性以节约存储空间。

PUT users
{"mappings": {"properties": {"name": {"type": "text","doc_values": false}}}
}

fielddata

默认情况下，text类型的字段可以被用于搜索，但是不能被用于排序、聚合或编写脚本，因为text字段数据是分词后再存储的，且text类型不支持开启doc_values属性，如果强行对text字段做聚合，会得到一个异常

// 创建索引
PUT users
{"mappings": {"properties": {"name": {"type": "text"},"gender":{"type": "keyword"}}}
}// 聚合
GET users/_search
{"size": 0, "aggs": {"name_count": {"terms": {"field": "name"}}}
}// 结果
"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "Fielddata is disabled on [name] in [users]. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}]
}

如果非要对text类型做聚合该怎么办呢？可以开启字段的fielddata属性。
fielddata是基于内存的数据结构，ES会从磁盘读取字段的完整倒排索引，反转词项与文档之间的关系，并在内存中构建fielddata用于排序和聚合等操作，因此构建fielddata的代价是很大的，默认是禁用的，一般也不建议开启。

如下示例，给text字段开启fielddata，即可用于聚合

PUT users
{"mappings": {"properties": {"name": {"type": "text","fielddata": true},"gender":{"type": "keyword"}}}
}

因为在内存中构建fielddata非常昂贵，如果真的需要同时对text字段做全文检索和排序聚合等需求，建议使用多字段类型，给字段同时设置text和keyword类型即可

PUT users
{"mappings": {"properties": {"name": {"type": "text","fields": {"keyword":{"type":"keyword"}}},"gender":{"type": "keyword"}}}
}//基于name.keyword做聚合
GET users/_search
{"size": 0, "aggs": {"name_count": {"terms": {"field": "name.keyword"}}}
}

_source

默认情况下，每个文档都会有一个“_source"字段来存储被索引的原始文档，_source字段本身只会被存储，但是不会被索引，意味着它不可以用来检索，可以检索时跟随文档被召回。

如下示例，索引一个用户，查询时返回原始文档

POST users/_doc
{"name":"张三","gender":"男"
}GET users/_search
{"took": 4,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 1,"hits": [{"_index": "users","_id": "6-uhnI4BODFb3LbQGHSD","_score": 1,"_source": {"name": "张三","gender": "男"}}]}
}

_source字段会占用额外的存储空间，如果只是做文档检索不需要获取原始文档，可以考虑将其禁用以节省存储空间。

PUT users
{"mappings": {"_source": {"enabled": false}, "properties": {"name": {"type": "keyword"},"gender":{"type": "keyword"}}}
}

禁用_source字段以后，update、update_by_query、reindex API和高亮显示将不可用，因为ES没有原始文档了。

store

默认情况下，字段值会被索引，但是不会被存储。这意味着你可以基于字段做检索，但是拿不到字段的原始值。通常来说一般也没什么问题，因为原始文档_source字段已经包含了所有的字段值。
但是，如果_source字段被禁用了，或者你不想返回整个原始文档而是只想提取几个特定的字段，那么就可以为单个字段开启store属性单独存储。

如下示例，为name字段开启store，查询时可以只返回name字段值

PUT users
{"mappings": {"_source": {"enabled": false}, "properties": {"name": {"type": "keyword","store": true},"gender":{"type": "keyword"}}}
}GET users/_search
{"stored_fields": ["name"]
}{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 1,"hits": [{"_index": "users","_id": "7Ou4nI4BODFb3LbQ93Sw","_score": 1,"fields": {"name": ["张三"]}}]}
}

null_value

默认情况下，null值是不会被索引且不能被搜索的，当文档字段值为null，ES会认为该字段没有值，但是业务需求可能需要对null值做检索。

如下示例，检索gender为null的用户

// 创建索引
PUT users
{"mappings": {"properties": {"name": {"type": "keyword"},"gender":{"type": "keyword"}}}
}// 索引文档
POST users/_doc
{"name":"张三","gender":null
}// 
GET users/_search
{"query": {"term": {"gender": {"value": null}}}
}

会得到一个异常，检索值不能为null

{"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "value cannot be null"}],"type": "illegal_argument_exception","reason": "value cannot be null"},"status": 400
}

此时，我们可以利用ES的 null_value 属性来用给定值替换空值，以达到对空值索引和检索的目的。
如下示例，我们用字符串”NULL“来代替空值

PUT users
{"mappings": {"properties": {"name": {"type": "keyword"},"gender":{"type": "keyword","null_value": "NULL"}}}
}

索引文档后再检索，就可以找回gender为空值的文档了

GET users/_search
{"query": {"term": {"gender": {"value": "NULL"}}}
}{"took": 4,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 0.2876821,"hits": [{"_index": "users","_id": "7uvEnI4BODFb3LbQB3QL","_score": 0.2876821,"_source": {"name": "张三","gender": null}}]}
}