当前位置: 首页 > news >正文

基于Skywalking开发分布式监控(四)一个案例

上一篇我们简单介绍了基于SkyWalking自定义增强的基本架构,即通过把Trace数据导入数据加工模块进行加工,进行持久化,并赋能grafana展示。

现在我们给出一个例子,对于量化交易系统,市场交易订单提交,该订单可以走模拟盘也可以走实盘,可以自动提交,也可以走人工提交,订单提交后,会把交易所给到的订单信息反馈回来。 需要监控的需求很简单:可以按,自动实盘/虚拟盘,人工实盘/虚拟盘订单分类监控,提交和反馈流程,满足指标项:

1 每分钟延时、延时百分位(P50/75/90/95/99 MAX)、每分钟请求数,排名前5的慢请求等监控项(metrics)

2 以及按排名前5的慢请求对应的SPAN进行抓取,分析出最慢的SPAN

那么SW原生监控有啥问题呢?
1 需要根据该流程在不同阶段的特征才能定位该流程,按Trace-Span模型来说,即需要一个Trace链根据不同Span提供的特征才能抓取该Trace,SW并不支持

例如 分辨人工/自动订单实际上是按Trace相关EndpointName来的
人工订单走页面,EntrySpan的 endpointName为POST:/api/trade/order/send
但自动订单由程序发起,EntrySpan的 endpointName为“rpc.OrderTradeService.send

而分辨是否走实盘/虚拟盘,则是在后续Span,按tag systemFlag=1或2,来确认

在这里插入图片描述
而SW的搜索显然是不支持的

  1. 问题2 反馈消息是根据交易所API生成的,不是一个标准通讯架构,只能根据自定义用户增强(customize-enhance),生成的localSpan形成跟踪链,那SW原生Trace查询根本没法按endpoint名字搜索,只能按tag搜索,然后按时间取定位,效率非常低
  2. 还有一个上一篇说了,SW对Trace和Span不提供metric聚合项

那增强计算模块怎么解决上述问题
对问题1: 按人工、自动、虚拟、实盘,形成4个搜索项,然后定时(基本)同时执行,把搜索结果叠加到ES索引中,按订单编号trade_id更新索引项,利用ES的向量特征形加上业务标签,供下游按业务标签定位需要的Trace

对问题2: 按预先设计的Tag值标识反馈消息,然后按Tag搜索,把搜索结果叠加到ES索引中,按订单编号trade_id更新索引项,利用ES的向量特征形加上业务标签,供下游按业务标签定位需要的Trace

对问题3 按业务标签计算各监控项(metrics),并按时间点汇总最慢的5个Trace,查找Span

我们按配置config来说明
关于问题1,我们配置了4个搜索项

"tasks" : [{  #查找按EndpointName=rpc.OrderTradeService.send查找自动订单,并且在ES索引中增加业务标签 businessTag:: Auto"name": "task.QueryTraces",       "para" : {"serviceName" : "TradeService","endpointName" : "rpc.OrderTradeService.send","businessTag" : { "key": "businessTag", "value": "Auto"},"tags" : {},"traces_index" :  "traces-"    #索引名,xx-后面跟着日期},"switch" : "on",      #搜索项有效"interval" : "60"      #每隔60秒执行一次},{ #查找按EndpointName=POST:/api/trade/order/send查找人工订单,并且在ES索引中增加业务标签 businessTag:: manual"name" : "task.QueryTraces","para" : {"serviceName" : "TradeService","endpointName" : "POST:/api/trade/order/send","businessTag" : { "key": "businessTag", "value": "manual"},"tags" : {},"traces_index" :  "traces-"},"switch" : "on","interval" : "60"},{  #查找按tag: systemFlag=1 查找人工订单,并且在ES索引中增加业务标签 systemFlag:: 1 (实盘)"name" : "task.QueryTraces","para" : {"serviceName" : "TradeService","endpointName" : "","businessTag" : { "key": "systemFlag", "value": "sim"},"tags" : { "key": "systemFlag", "value": "1"},"traces_index" :  "traces-"},"switch" : "on","interval" : "60"},{   #查找按tag: systemFlag=2 查找人工订单,并且在ES索引中增加业务标签 systemFlag:: 2 (实盘)"name" : "task.QueryTraces","para" : {"serviceName" : "TradeService","endpointName" : "","businessTag" : { "key": "systemFlag", "value": "RealTime"},"tags" : { "key": "systemFlag", "value": "2"},"traces_index" :  "traces-"},"switch" : "on","interval" : "60"},

task.QueryTraces是查询程序,按每分钟1次的节奏,按Graphql接口查询,需要用到的接口,按ServiceName按SW内置查询searchService接口查ServiceId , 按SW内置查询searchEndpoint接口查EndpointId
然后根据ServiceId , EndpointId调用,或者ServiceId和预置Tag,按SW内置查询接口queryBasicTraces查询相关Traces,注意点如下:
1 查询窗口要注意,也就是要防止Trace形成前执行查询语句,建议做成滑动窗口,可以调节窗口的大小,或者隔几秒多试几次(比如10秒执行3次)
2 要注意应用多页查询,queryBasicTraces有页数限制,一次最多1000条,要查全需要比较完整多页查询结构
查询完更新ES索引之后
在这里插入图片描述
很容易根据业务标签,获取我们所需的Traces

同理对问题2,我们引入配置文件,实际上我们利用FIX报文msgtype=8 报文的特征来标识反馈消息,然后按ordStatus,表示是否是成交或者订单有效的报文,即按tags msgType=8, ordStatus=2/0 查询相关Traces

{"name" : "task.QueryTraces","para" : {"serviceName" : "APIService","endpointName" : "","businessTag" : { "key": "OrdStatus", "value": "deal"},"tags" : [{ "key": "msgType", "value": "8"},{"key": "ordStatus","value": "2"}],"traces_index" :  "traces-"},"switch" : "on","interval" : "60"},{"name" : "task.TracesQueryInfo","para" : {"serviceName" : "APIService","endpointName" : "","businessTag" : { "key": "OrdStatus", "value": "effect"},"tags" : [{ "key": "msgType", "value": "8"},{"key": "ordStatus","value": "0"}],"traces_index" :  "traces-"},"switch" : "on","interval" : "60"},

对于问题3,我们配置两种计算模块: 一是 task.Caculator用于计算各类Metrics,与SW无关,二是 task.SpanInfo计算 ES索引库中 按大于95%分位数延时的慢Traces,逐条查找全部Span

{ # 按业务标签查人工实盘的订单traces(businessTag=manual,systemFlag=RealTime),计算监控项"name": "task.Caculator","para" : {"businessTags" :[{ "key": "businessTag", "value": "manual"},{"key": "systemFlag","value": "RealTime"}],"traces_index" :  "traces-",    # 源索引"stat_index" : "traces_index-"   #监控项索引},"switch" : "on","interval" : "60","delay" : 10      # 比源索引执行慢10秒},{  # 按业务标签查自动虚拟盘的订单traces(businessTag=auto,systemFlag=sim),计算监控项"name": "task.Caculator","para" : {"businessTags" :[{ "key": "businessTag", "value": "Auto"},{"key": "systemFlag","value": "sim"}],"traces_index" :  "traces-","stat_index" : "traces_index-"},"switch" : "on","interval" : "60","delay" : 10},{ # 按业务标签查自动实盘的订单traces(businessTag=auto,systemFlag=Realtime),计算监控项"name": "task.Caculator","para" : {"businessTags" :[{ "key": "businessTag", "value": "Auto"},{"key": "systemFlag","value": "RealTime"}],"traces_index" :  "traces-","stat_index" : "traces_index-"},"switch" : "on","interval" : "60","delay" : 10},{ # 按业务标签查反馈提交有效订单(OrdStatus=effect,systemFlag=Realtime),计算监控项"name": "task.Caculator","para" : {"businessTags" : { "key": "OrdStatus", "value": "effect"},"traces_index" :  "traces-","stat_index" : "traces_index-"},"switch" : "on","interval" : "60","delay" : 10},{ # 计算 ES索引库中 按大于95%分位数延时的慢Traces,逐条查找全部Span"name": "task.SpanInfo","para" : {"percentile" : 0.95,"traces_index" :  "traces-","span_index" : "traces_index-"},"switch" : "on","interval" : "60","delay" : 10}

我们看一下订单提交计算结果索引
在这里插入图片描述

以及慢Trace相关Span的索引
在这里插入图片描述
关于task.QueryTraces,task.Caculator,task.SpanInfo,主要代码如下
task.QueryTraces

public class QueryTraces extends AbstractTraceQuery implements TaskService,Runnable{private static final Lock lock = new ReentrantLock();  //对不同任务的竞争性资源加锁ObjectMapper objectMapper = new ObjectMapper();String serviceName,serviceId,endpointName,endpointId,traces_index;ArrayNode businessTags;JsonNode businessTag,tags;DatasourceService datasource;TargetdbService targetdb;@Overridepublic void run() {logger.info("QueryInfo begin...");if("".equals(serviceId)){//防止获取不到serviceIdserviceId=this.datasource.queryServiceId(serviceName);if("".equals(serviceId)){//第二次获取不成功就终止线程logger.error("query serviceId fail");return;}}if(endpointName.equals("")){//检查tags是否为空,为空就终止线程if(tags.isNull() || tags.isMissingNode()) {logger.error("endpointName & tags is both empty");return;}} else{if("".equals(endpointId)){//防止获取不到endpointIdendpointId=this.datasource.queryEndPointId(endpointName,serviceName);if("".equals(endpointId)){//第二次获取不成功就终止线程logger.error("query endpointId fail");return;}}}targetdb.createForm(traces_index);String endTime=getTimeEndPoint(1,40);String startTime=getTimeEndPoint(3,41);int retry=3;  //重试次数int lastArraylistSize=0;ArrayNode traceList= JsonNodeFactory.instance.arrayNode();logger.info("QueryInfo startTime:: {}  endTime:: {}",startTime,endTime);try{while(retry>0){//查询SW的traces数据,注意有可能需要分页查询traceList=getMultiPageResult(datasource,serviceId,endpointId,startTime,endTime,tags);logger.info("traceList:: {} retry:: {}",traceList.toString(),retry);if(traceList.size()>lastArraylistSize){//如果查到结果,打业务标签,并按TraceId调批量更新目标库lastArraylistSize=traceList.size();Map<String, List<Map<String,Object>>> traceMap = genTraceMap(businessTags, traceList); //结果集合targetdb.updateDate(traces_index,traceMap);//打时间戳logger.info("TracesQuery update is done. {}",System.currentTimeMillis());}try {// 暂停执行5秒钟Thread.sleep(5000);} catch (InterruptedException e) {e.printStackTrace();}retry--;}}catch (Exception e) {e.printStackTrace();return;}}@Overridepublic void init(JsonNode paraData, DatasourceService datasourceService, TargetdbService targetdbService) {......}
}

task.Caculator

public class Caculator extends AbstractTraceQuery implements TaskService,Runnable {private final static Logger logger = LoggerFactory.getLogger(TracesQueryInfo.class);private static final Lock lock = new ReentrantLock();  //对不同任务的竞争性资源加锁String traces_index, stat_index;ArrayNode businessTags;JsonNode businessTag;DatasourceService datasource;TargetdbService targetdb;private Map<String,Object> traceProcess(Map<String,Object> sourceMap){//处理traces查询结果AtomicInteger durationSum= new AtomicInteger();AtomicInteger count= new AtomicInteger();AtomicInteger maxDuration=new AtomicInteger();double durationAvg,p50,p75,p90,p95,p99;ArrayList<Integer> durationArray = new ArrayList<>();;  //延时集合,用于计算分位数sourceMap.entrySet().stream().forEach((Map.Entry<String,Object> entry) -> {count.getAndIncrement();String traceId = entry.getKey();System.out.println("traceId::" + traceId);Integer duration = (int) Double.parseDouble(entry.getValue().toString());durationSum.addAndGet(duration);if (duration > maxDuration.get()) {maxDuration.getAndSet(duration);}durationArray.add(duration);});durationAvg=(durationSum.get())/(count.get());p50=percentile(durationArray.toArray(new Integer[durationArray.size()]),0.5);p75=percentile(durationArray.toArray(new Integer[durationArray.size()]),0.75);p90=percentile(durationArray.toArray(new Integer[durationArray.size()]),0.90);p95=percentile(durationArray.toArray(new Integer[durationArray.size()]),0.95);p99=percentile(durationArray.toArray(new Integer[durationArray.size()]),0.99);Map<String,Object> resultMap = new HashMap<>();resultMap.put("max_resp",maxDuration.get());resultMap.put("mean_resp",durationAvg);resultMap.put("count",count.get());resultMap.put("p50",p50);resultMap.put("p75",p75);resultMap.put("p90",p90);resultMap.put("p95",p95);resultMap.put("p99",p99);return resultMap;}@Overridepublic void run() {if(targetdb.isExisted(traces_index)){logger.info("TracesStatInfo begin...");String endTime =getTimeUtcEndPoint(1,30);String startTime=getTimeUtcEndPoint(2,31);logger.info("startTime:: {}  endTime:: {}",startTime,endTime);try{// 在es trace表中,按bussinesTagList 查找local_time_stamp在当前时间范围内的记录logger.info("statQuery queryDate begins ... {}",System.currentTimeMillis());Map<String, Object> dataMap=targetdb.queryData(traces_index,businessTags,startTime,endTime,"duration");Map<String, Object> resMap = new HashMap<>();if(null!=dataMap) {//Map<String, Object> resMap = new HashMap<>();logger.info("TracesStatInfo resultMap:: {} ", dataMap.toString());resMap = traceProcess(dataMap);// targetdb.createForm(stat_index);//targetdb.insertDate(stat_index, seqNo, resMap);}else{//找不到置0logger.info("StatInfo resultMap is null ");resMap.put("max_resp", 0);resMap.put("mean_resp", 0);resMap.put("count", 0);resMap.put("p50", 0);resMap.put("p75", 0);resMap.put("p90", 0);resMap.put("p95", 0);resMap.put("p99", 0);}//打业务标签和时间戳resMap = getMapWithTags(businessTags, resMap);String seqNo = generateSeqNo(); //生成序号// 加锁lock.lock();targetdb.createForm(stat_index);targetdb.insertDate(stat_index, seqNo, resMap)}catch(Exception e){e.printStackTrace();return;}finally {// 释放锁lock.unlock();}}else{logger.info("trace_index {} is not existed",traces_index);}}@Overridepublic void init(JsonNode paraData, DatasourceService datasourceService, TargetdbService targetdbService) {.....}
}

task.SpanInfo

public class SpanInfo extends AbstractTraceQuery implements TaskService,Runnable{private final static Logger logger = LoggerFactory.getLogger(SpanQueryInfo.class);private static final Lock lock = new ReentrantLock();  //对不同任务的竞争性资源加锁String traces_index, span_index;DatasourceService datasource;TargetdbService targetdb;double percentile;private Map<String,Object> findTraces(Map<String,Object> sourceMap,double percentile){ArrayList<Integer> durationArray = new ArrayList<>();;  //延时集合,用于计算分位数Map<String,Object> resultMap = new HashMap<>(); //结果集合//计算percentile分位sourceMap.entrySet().stream().forEach((Map.Entry<String,Object> entry) ->{Integer duration = (int) Double.parseDouble(entry.getValue().toString());durationArray.add(duration);});double percentileData = percentile(durationArray.toArray(new Integer[0]), percentile);logger.info("percentileData:: {}",percentileData);//查找超过percentile的traceIdsourceMap.entrySet().stream().forEach((Map.Entry<String,Object> entry) ->{double duration = (double) Double.parseDouble(entry.getValue().toString());if(duration>=percentileData){String traceId=entry.getKey().toString();resultMap.put(traceId,duration);}});return resultMap;}@Overridepublic void run() {logger.info("SpanInfo begin...");//建表targetdb.createForm(span_index);try{logger.info("SpanInfo try begin...");//找到当前trace_index索引中所有高出95%的值的traceId集合Map<String, Object> dataMap=targetdb.queryAllData(traces_index,"duration");if(null!=dataMap) {logger.info("SpanInfo resultMap:: {} ", dataMap.toString());//查找高于percentile分位数的值Map<String, Object> resMap = findTraces(dataMap, percentile);logger.info("spanInfo foundedMap:: {} ", resMap.toString());//遍历查询结果,如果span_index中不存在,则查询span后插入span_indexresMap.entrySet().stream().forEach((Map.Entry<String, Object> entry) -> {String traceId = entry.getKey();if (targetdb.isNotInTheIndex(span_index, "traceId", traceId)) {//按traceId查询spanArrayNode spanList = datasource.getTraceSpans(traceId);Map<String, List<Map<String, Object>>> spansMap = genSpanMap(traceId, spanList); //组成SpanList//插入span_indextargetdb.updateDate(span_index, spansMap);}});}else{logger.info("SpanInfo resultMap is null ");}}catch(Exception e){e.printStackTrace();return;}}@Overridepublic void init(JsonNode paraData, DatasourceService datasourceService, TargetdbService targetdbService) {....}
}

完成索引持久化后,就可以以grafana访问ES库形成展示,这部分不展开,看一下效果
在这里插入图片描述
在这里插入图片描述

姑且算抛砖引玉吧,希望各位大佬也分享一下方案

相关文章:

  • C及C++每日练习(2)
  • VSCode设置
  • [C语言]——C语言常见概念(1)
  • 【Appium问题】每次启动appium都会安装一次uiautomator
  • window环境下使用k8s部署.net core项目
  • [CSAWQual 2019]Web_Unagi ---不会编程的崽
  • 面试经典150题(101-104)
  • 电脑解锁后黑屏有鼠标--亲测!!不需要重装系统!!
  • gan, pixel2pixel, cyclegan, srgan图像超分辨率
  • git 如何将多个提交点合并为一个提交点 commit
  • chromedriverUnable to obtain driver for chrome using ,selenium找不到chromedriver
  • 【AI视野·今日NLP 自然语言处理论文速览 第八十二期】Tue, 5 Mar 2024
  • Java零基础 - try-catch-finally和throw语句
  • Mysql学习之MVCC解决读写问题
  • 新学习到的知识点【Python/C++】,持续更新中
  • 【跃迁之路】【444天】程序员高效学习方法论探索系列(实验阶段201-2018.04.25)...
  • Codepen 每日精选(2018-3-25)
  • css系列之关于字体的事
  • js算法-归并排序(merge_sort)
  • oldjun 检测网站的经验
  • Vue2.x学习三:事件处理生命周期钩子
  • webpack+react项目初体验——记录我的webpack环境配置
  • 翻译--Thinking in React
  • 聚类分析——Kmeans
  • 微信支付JSAPI,实测!终极方案
  • 我建了一个叫Hello World的项目
  • 智能网联汽车信息安全
  • 《码出高效》学习笔记与书中错误记录
  • 如何通过报表单元格右键控制报表跳转到不同链接地址 ...
  • ​一文看懂数据清洗:缺失值、异常值和重复值的处理
  • # 深度解析 Socket 与 WebSocket:原理、区别与应用
  • # 执行时间 统计mysql_一文说尽 MySQL 优化原理
  • #etcd#安装时出错
  • $redis-setphp_redis Set命令,php操作Redis Set函数介绍
  • (4)Elastix图像配准:3D图像
  • (52)只出现一次的数字III
  • (day6) 319. 灯泡开关
  • (react踩过的坑)antd 如何同时获取一个select 的value和 label值
  • (多级缓存)缓存同步
  • (附源码)springboot学生选课系统 毕业设计 612555
  • (附源码)springboot掌上博客系统 毕业设计063131
  • (附源码)ssm高校升本考试管理系统 毕业设计 201631
  • (黑马C++)L06 重载与继承
  • (每日持续更新)jdk api之StringBufferInputStream基础、应用、实战
  • (三)docker:Dockerfile构建容器运行jar包
  • (转载)PyTorch代码规范最佳实践和样式指南
  • (轉)JSON.stringify 语法实例讲解
  • .NET CLR基本术语
  • .NET WebClient 类下载部分文件会错误?可能是解压缩的锅
  • .net(C#)中String.Format如何使用
  • .NET6实现破解Modbus poll点表配置文件
  • @converter 只能用mysql吗_python-MySQLConverter对象没有mysql-connector属性’...
  • @hook扩展分析
  • @Import注解详解
  • @Validated和@Valid校验参数区别