当前位置: 首页 > news >正文

hive sql 处理多层 json 数组

1. 背景

json 字符串值数据示例:

{"score": 1,"submitTime": 1712491933,"answerFlag": 1,"groupId": 1755547960,"answers": [{"value": "[1, 2, 3]","ids": [4,5,6],"isPic": 0,"duration": 22314,"status": 1,"tid": 1},{"value": "aabbcc","lessons": [44,55,66],"isPic": 0,"duration": 22314,"status": 2,"tid": 2}],"questionType": 65
}

现在这个 json 字符串形式的字段值在 hive 表的某个字段中,我需要获取到 “answers” 这个 json 数组,然后将其按照数组长度,列转行到多行数据,每行数据一个子 json ,并且从中获取到每个子 json 的 “tid” 和 “status” 值,理想情况下,我需要这行数据处理完之后,结果如下表所示。

tidstatus
11
22

2. 常见方案

通过 ge_json_object 函数,先获取到 “answers” 对应的 json 数组字符串,然后通过正则替换掉 [] 符号,之后将 },{ 符号替换为 }我是分隔符{,最后将 我是分隔符 作为 split 函数的分隔符号,将字符串分割,再通过 lateral view explode() 语法,将数组放到多行。多行数据,都是处理好的 json 对象,之后通过 get_json_object 函数后取需要的字段值即可。具体代码示例如下。

  1. 获取 “answers” 对应的 json 数组。

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select get_json_object(col, '$.answers') as answers
    from table1
    

    结果如下所示,正常获取到 “answers” 下的 json 数组,结果为字符串。

    answers
    [{“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2}]
  2. 将最外层的 [] 符号去掉

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select regexp_replace(answers, '\\[|\\]', '') as answers
    from (select get_json_object(col, '$.answers') as answersfrom table1) as a
    

    结果如下所示。

    answers
    {“value”:“1, 2, 3”,“ids”:4,5,6,“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:44,55,66,“isPic”:0,“duration”:22314,“status”:2,“tid”:2}

    由于第一步处理的结果最外层是 json 数组,左右有 [] 符号,但是由于内层还有子 json 数组,这种直接全局替换的方式,会将内层子 json 数组的 [] 符号也一并去掉,可以查看下面的结果,"ids":4,5,6,,就是因为全局替换,造成整个 json 结构被破坏,之后将无法使用 get_json_object() 函数来获取想要的 key 对应的值了。

由此可见,这种方式只适合于 “answers” 下的 json 数组内的每个 json 对象中都只包含 json 对象才行,不能再包含 json 数组,否则就会造成处理错误,拿不到想要的数据。

3. 推荐方案

3.1 具体步骤

  1. 获取 “answers” 对应的 json 数组。

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select get_json_object(col, '$.answers') as answers
    from table1
    

    结果如下所示,正常获取到 “answers” 下的 json 数组,结果为字符串。

    answers
    [{“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2}]
  2. 通过字符串截取的方式,将第一步的结果最前面和最后面的 [] 符号去掉

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select substring(answers, 2, length(answers) - 2) as answers
    from (select get_json_object(col, '$.answers') as answersfrom table1) as a
    

    结果如下所示,最前面和最后面的 [] 符号已经被去掉。

    answers
    {“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2}
  3. 最重要的一步:通过正则替换,只要匹配到 [.*] 内容,就直接替换为数字 0。

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select substring(answers, 1, length(answers) - 3) as answers
    from (select get_json_object(col, '$.answers') as answersfrom table1) as a
    

    结果如下所示,字符串中所有 [.*] 的部分,都已经被替换为数字 0

    answers
    {“value”:“0”,“ids”:0,“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:0,“isPic”:0,“duration”:22314,“status”:2,“tid”:2}

    将 json 数字替换为数字 0,是为了兼容 "[.*]"[.*] 两种情况,不管 json 数组是不是被英文双引号包围,替换为数字 0,都是没问题的,都可以保证 json 的格式不被破坏。

  4. 由于已经去掉了所有的子 json 数组,之后就可以按照传统的方式,将 },{ 替换为 }我是分隔符号{

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{') as answers
    from (select regexp_replace(answers, '\\[.*\\]', '0') as answersfrom (select substring(answers, 2, length(answers) - 2) as answersfrom (select get_json_object(col, '$.answers') as answersfrom table1) as a) as a) as a
    
    answers
    {“value”:“0”,“ids”:0,“isPic”:0,“duration”:22314,“status”:1,“tid”:1}我是分隔符号{“value”:“aabbcc”,“lessons”:0,“isPic”:0,“duration”:22314,“status”:2,“tid”:2}
  5. 通过 我是分隔符号 切换字符串,再通过 lateral view explode() 语法将 json 数组展开,最后通过 get_json_object() 函数,获取需要的值即可。

    with table1 as (select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )select get_json_object(b.answer, '$.tid') as tid, get_json_object(b.answer, '$.status') as status
    from (select regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{') as answersfrom (select regexp_replace(answers, '\\[.*\\]', '0') as answersfrom (select substring(answers, 2, length(answers) - 2) as answersfrom (select get_json_object(col, '$.answers') as answersfrom table1) as a) as a) as a) as a
    lateral view explode(split(regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{'), '我是分隔符号')) b as answer
    

    结果如下,可以看到,数据处理结果符合预期

    tidstatus
    11
    22

3.2 注意事项

上面的步骤 3,只适用于你想要展开最外层的这个 json 数组,并且完全不需要内部嵌套的子 json 数组才行,否则将内部的子 json 数组全部替换为数字 0 之后,你就获取不到子 json 数组数据了。

如果还想要内部的子 json 数组,单纯的 sql 应该是实现不了的,需要去自定义 udf,然后通过 java 代码一层一层解析了。

相关文章:

  • 北京网站建设多少钱?
  • 辽宁网页制作哪家好_网站建设
  • 高端品牌网站建设_汉中网站制作
  • 用R语言运用 Shiny 包打造基于鸢尾花数据集的交互式数据可视化应用
  • 探索Delphi的图形与图像处理能力:从基础到高级应用
  • 获利能力段-新增自定义特性、特性派生增强
  • 手撕初阶数据结构之---排序
  • SpringBoot的使用的注解以及监控监控和lombok使用
  • 数据结构——循环队列
  • C#高级:通过一个遍历实体的小案例去理解反射(基础版)
  • Python之字符串的创建、索引和分片
  • 深入理解 GO 语言并发
  • 双配置视觉 Transformer 在多模态中的突破 !
  • Linux服务器:Samba配置
  • Java - 正则表达式
  • Memecoin的火爆与AMM在Solana上的主导地位
  • 嵌入式八股-C++面试30题(20240814)
  • Hibernate Session在项目中的创建方式
  • IE9 : DOM Exception: INVALID_CHARACTER_ERR (5)
  • 【162天】黑马程序员27天视频学习笔记【Day02-上】
  • CentOS学习笔记 - 12. Nginx搭建Centos7.5远程repo
  • CNN 在图像分割中的简史:从 R-CNN 到 Mask R-CNN
  • create-react-app项目添加less配置
  • CSS相对定位
  • exif信息对照
  • java2019面试题北京
  • magento2项目上线注意事项
  • Promise面试题2实现异步串行执行
  • React 快速上手 - 07 前端路由 react-router
  • Redis的resp协议
  • 得到一个数组中任意X个元素的所有组合 即C(n,m)
  • 模型微调
  • 人脸识别最新开发经验demo
  • 使用agvtool更改app version/build
  • 为什么要用IPython/Jupyter?
  • 系统认识JavaScript正则表达式
  • 协程
  • 【运维趟坑回忆录】vpc迁移 - 吃螃蟹之路
  • zabbix3.2监控linux磁盘IO
  • 长三角G60科创走廊智能驾驶产业联盟揭牌成立,近80家企业助力智能驾驶行业发展 ...
  • !!【OpenCV学习】计算两幅图像的重叠区域
  • # 移动硬盘误操作制作为启动盘数据恢复问题
  • #Datawhale X 李宏毅苹果书 AI夏令营#3.13.2局部极小值与鞍点批量和动量
  • #pragam once 和 #ifndef 预编译头
  • #宝哥教你#查看jquery绑定的事件函数
  • $.ajax,axios,fetch三种ajax请求的区别
  • $.ajax中的eval及dataType
  • (4)事件处理——(6)给.ready()回调函数传递一个参数(Passing an argument to the .ready() callback)...
  • (Redis使用系列) Springboot 整合Redisson 实现分布式锁 七
  • (八)Flink Join 连接
  • (八)五种元启发算法(DBO、LO、SWO、COA、LSO、KOA、GRO)求解无人机路径规划MATLAB
  • (二)【Jmeter】专栏实战项目靶场drupal部署
  • (二)延时任务篇——通过redis的key监听,实现延迟任务实战
  • (附源码)springboot 房产中介系统 毕业设计 312341
  • (附源码)springboot宠物管理系统 毕业设计 121654
  • (附源码)ssm基于web技术的医务志愿者管理系统 毕业设计 100910
  • (九)c52学习之旅-定时器
  • (十一)图像的罗伯特梯度锐化