当前位置: 首页 > news >正文

flink窗口聚合函数之aggregateFuction学习

学习这个函数之前需要了解acc


import org.apache.flink.api.common.JobExecutionResult;
import org.apache.flink.api.common.accumulators.IntCounter;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.Serializable;

/**
 * @Author: cc
 * @Date: 2022/4/28 16:21
 */
public class AccumulatorTest {
    private static final Logger logger = LoggerFactory.getLogger(AccumulatorTest.class);

    public static void main(String[] args) throws Exception {
        //初始化flink的streaming环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment
                .getExecutionEnvironment();
        env.setParallelism(4);
        DataStreamSource<Integer> source = env.fromElements(1, 2, 3, 4, 5, 6, 7, 8, 9, 0,1, 2, 3, 4, 5, 6, 7, 8, 9, 0);
        source.addSink(new Sink1());
        JobExecutionResult jobResult= env.execute("cas-job");
        int nums = jobResult.getAccumulatorResult("elementCounter");
        System.out.println("使用累加器统计的结果:"+nums);
        env.execute();
    }


    static class Sink1 extends RichSinkFunction<Integer> {
        private IntCounter elementCounter = new IntCounter();
        Integer count = 0;
        @Override
        public void open(Configuration parameters) throws Exception {
            //-2注册累加器
            getRuntimeContext().addAccumulator("elementCounter", elementCounter);
            super.open(parameters);
        }

        @Override
        public void close() throws Exception {
            super.close();
        }

        @Override
        public void invoke(Integer value, Context context) throws Exception {
            this.elementCounter.add(1);
            count+=1;//非累加器
            Serializable elementCounter = getRuntimeContext().getAccumulator("elementCounter").getLocalValue();
            System.out.println("count="+count+",acc="+elementCounter.toString());
        }
    }
}

 

为什么要使用累加器,有时候我们想知道处理了多少数据量 比如 name=cc的数据量

为什么count++ 不行呢?因为是分布式计算,累加器在多台机器++,然后最后会聚合一次。或者我们需要的累加值,其实这里和spark的累加器好像是一个意思。

public interface Accumulator<V, R extends Serializable> extends Serializable, Cloneable {
    /** @param value The value to add to the accumulator object */
    void add(V value); //累加器调用add方法的处理逻辑

    /** @return local The local value from the current UDF context */
    R getLocalValue();// 获取值

    /** Reset the local value. This only affects the current UDF context. */
    void resetLocal();//重置本地初始值

    /**
     * Used by system internally to merge the collected parts of an accumulator at the end of the
     * job.
     *
     * @param other Reference to accumulator to merge in.
     */
    void merge(Accumulator<V, R> other); //分布在不同的累加器开始merge值

    /**
     * Duplicates the accumulator. All subclasses need to properly implement cloning and cannot
     * throw a {@link java.lang.CloneNotSupportedException}
     *
     * @return The duplicated accumulator.
     */
    Accumulator<V, R> clone(); //
}

累加器一般用在什么情况下呢? 聚合函数aggregateFunction里。那么agg又是什么样呢?

注意这个泛型。 IN代表入参,ACC代表累加器, OUT代表出参

@PublicEvolving
public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {

    /**
     * Creates a new accumulator, starting a new aggregate.
     *
     * <p>The new accumulator is typically meaningless unless a value is added via {@link
     * #add(Object, Object)}.
     *
     * <p>The accumulator is the state of a running aggregation. When a program has multiple
     * aggregates in progress (such as per key and window), the state (per key and window) is the
     * size of the accumulator.
     *
     * @return A new accumulator, corresponding to an empty aggregate.
     */
    ACC createAccumulator();

    /**
     * Adds the given input value to the given accumulator, returning the new accumulator value.
     *
     * <p>For efficiency, the input accumulator may be modified and returned.
     *
     * @param value The value to add
     * @param accumulator The accumulator to add the value to
     * @return The accumulator with the updated state
     */
    ACC add(IN value, ACC accumulator);

    /**
     * Gets the result of the aggregation from the accumulator.
     *
     * @param accumulator The accumulator of the aggregation
     * @return The final aggregation result.
     */
    OUT getResult(ACC accumulator);

    /**
     * Merges two accumulators, returning an accumulator with the merged state.
     *
     * <p>This function may reuse any of the given accumulators as the target for the merge and
     * return that. The assumption is that the given accumulators will not be used any more after
     * having been passed to this function.
     *
     * @param a An accumulator to merge
     * @param b Another accumulator to merge
     * @return The accumulator with the merged state
     */
    ACC merge(ACC a, ACC b);
}

案例-同时计算uv 和pv为, uv是用户访问量,pv是页面点击量。

数据

user1  url1 

user2 url2

user2  ur1

.......

我们先要清楚怎么算uv,很明显需要去重 所以我们需要一个hashset

怎么算pv?每来一条记录 数据+1 所以用个int和long就行。

再看看接口泛型

in就是数据类型 我们用个class代替叫做event(string user,string url)

out 用个tuple。返回tuple<int ,int>.of (uv,pv)

acc怎么表示呢?注意来一个event 就要累加一次,我们既要存uv的信息也要存pv的信息

所以还是用个tuple,上面也说了uv用set ,pv用int 所以 tuple<hashset,int>

所以开搞。

import com.atguigu.chapter05.ClickSource;
import com.atguigu.chapter05.Event;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.util.Calendar;
import java.util.HashSet;
import java.util.Random;
public class WindowAggregateTest {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
                                return element.timestamp;
                            }
                        }));

        // 所有数据设置相同的key,发送到同一个分区统计PV和UV,再相除
        stream.keyBy(data -> true) //如果你是想计算每天的每个窗口的uv pv 可以data->data.day
                .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(2)))
                .aggregate(new AvgPv())
                .print("res");
        env.execute();
    }

    public static class AvgPv implements AggregateFunction<Event, Tuple2<HashSet<String>, Integer>, Tuple2<Integer,Integer>> {
        @Override
        public Tuple2<HashSet<String>, Integer> createAccumulator() {
            // 创建累加器 初始化
            // 最开始这个uv和pv的数值 一般默认是0。但是如果有的公司要作假可以加大。
            return Tuple2.of(new HashSet<String>(), 0);
        }

        @Override
        public Tuple2<HashSet<String>, Integer> add(Event value, Tuple2<HashSet<String>, Integer> accumulator) {
            // 属于本窗口的数据来一条累加一次,并返回累加器
            //f0计算 uv  f1计算pv
            accumulator.f0.add(value.user);
            return Tuple2.of(accumulator.f0, accumulator.f1 + 1);
        }

        @Override
        public Tuple2<Integer,Integer> getResult(Tuple2<HashSet<String>, Integer> accumulator) {
            //每个窗口结束的时候执行一次
            // 窗口闭合时,增量聚合结束,将计算结果发送到下游
            return Tuple2.of(accumulator.f0.size(),accumulator.f1);
        }

        @Override
        public Tuple2<HashSet<String>, Integer> merge(Tuple2<HashSet<String>, Integer> a, Tuple2<HashSet<String>, Integer> b) {
            System.out.println("merge被执行了");
            //如果是并行的多个结果会执行。
            HashSet<String> uv1 = a.f0;
            HashSet<String> uv2 = b.f0;
            HashSet<String> uv = new HashSet<>();
            for (String s : uv1) {
                uv.add(s);
            }
            for (String s : uv2) {
                uv.add(s);
            }
            Integer pv1 = a.f1;
            Integer pv2 = b.f1;
            return Tuple2.of(uv,pv1+pv2);
        }
    }

    static class ClickSource implements SourceFunction<Event> {
        // 声明一个布尔变量,作为控制数据生成的标识位
        private Boolean running = true;
        @Override
        public void run(SourceContext<Event> ctx) throws Exception {
            Random random = new Random();    // 在指定的数据集中随机选取数据
            String[] users = {"Mary", "Alice", "Bob", "Cary"};
            String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};

            while (running) {
                ctx.collect(new Event(
                        users[random.nextInt(users.length)],
                        urls[random.nextInt(urls.length)],
                        Calendar.getInstance().getTimeInMillis()
                ));
                //这里也可以发送水位线
                //           ctx.collectWithTimestamp();
//            ctx.emitWatermark(new Watermark(1));
                // 隔1秒生成一个点击事件,方便观测
                Thread.sleep(1000);
            }
        }
        @Override
        public void cancel() {
            running = false;
        }

    }

}

注意我在代码里打印了merge,但是这里为什么没有输出呢?

因为我设置的并行度为1?不对。。。

因为keyby data->true 就只会发送到一个分区,根本不需要merge

但是我后面测试了不管怎样。好像都没有打印merge....后面遇到了再更新

/**
 * Merges two accumulators, returning an accumulator with the merged state.
 *
 * <p>This function may reuse any of the given accumulators as the target for the merge and
 * return that. The assumption is that the given accumulators will not be used any more after
 * having been passed to this function.
 *
 * @param a An accumulator to merge
 * @param b Another accumulator to merge
 * @return The accumulator with the merged state
 */
ACC merge(ACC a, ACC b);

 

相关文章:

  • DNS域名解析系统-BIND服务
  • 【rainbowzhou 面试21/101】综合提问--为什么选择做测试/大数据测试?
  • 【日拱一卒行而不辍20220923】自制操作系统
  • 【高质量C/C++】3.命名规则
  • 湖仓一体电商项目(二十):业务实现之编写写入DM层业务代码
  • Msquic客户端详解
  • Eclipse Theia技术揭秘——构建桌面IDE
  • 交换机的构成以及其工作原理
  • 想换壁纸找不到高质量的?来这里用python采集上万张壁纸
  • Mybatis-Plus(核心功能篇 ==> 条件构造器
  • vue2.X+Cesium1.93Demo
  • 适配器模式【Java设计模式】
  • 聊一下接口幂等性
  • springboot源码理解十二、springMVC功能
  • 论文写作教程之学术论文中需要做好的10 件事
  • Angular 2 DI - IoC DI - 1
  • C++回声服务器_9-epoll边缘触发模式版本服务器
  • js正则,这点儿就够用了
  • Laravel Mix运行时关于es2015报错解决方案
  • node-sass 安装卡在 node scripts/install.js 解决办法
  • uni-app项目数字滚动
  • webgl (原生)基础入门指南【一】
  • 初识MongoDB分片
  • 道格拉斯-普克 抽稀算法 附javascript实现
  • 湖南卫视:中国白领因网络偷菜成当代最寂寞的人?
  • 机器学习学习笔记一
  • 聚簇索引和非聚簇索引
  • 前端 CSS : 5# 纯 CSS 实现24小时超市
  • 深度学习入门:10门免费线上课程推荐
  • 说说动画卡顿的解决方案
  • 微信小程序填坑清单
  • 如何用纯 CSS 创作一个菱形 loader 动画
  • 数据库巡检项
  • 资深实践篇 | 基于Kubernetes 1.61的Kubernetes Scheduler 调度详解 ...
  • (2)nginx 安装、启停
  • (6)【Python/机器学习/深度学习】Machine-Learning模型与算法应用—使用Adaboost建模及工作环境下的数据分析整理
  • (动手学习深度学习)第13章 计算机视觉---微调
  • (二开)Flink 修改源码拓展 SQL 语法
  • (五)IO流之ByteArrayInput/OutputStream
  • (一)appium-desktop定位元素原理
  • (转)编辑寄语:因为爱心,所以美丽
  • ..thread“main“ com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.3.1
  • .form文件_SSM框架文件上传篇
  • .MyFile@waifu.club.wis.mkp勒索病毒数据怎么处理|数据解密恢复
  • .NET Reactor简单使用教程
  • .netcore如何运行环境安装到Linux服务器
  • .NET文档生成工具ADB使用图文教程
  • .pyc文件还原.py文件_Python什么情况下会生成pyc文件?
  • @converter 只能用mysql吗_python-MySQLConverter对象没有mysql-connector属性’...
  • @RestControllerAdvice异常统一处理类失效原因
  • @Tag和@Operation标签失效问题。SpringDoc 2.2.0(OpenApi 3)和Spring Boot 3.1.1集成
  • [ C++ ] 继承
  • [ IO.File ] FileSystemWatcher
  • [ Linux Audio 篇 ] 音频开发入门基础知识
  • [ 隧道技术 ] 反弹shell的集中常见方式(二)bash反弹shell