17370845950

0基础学习PyFlink——时间滚动窗口(Tumbling Time Windows)

在《0基础学习pyflink——个数滚动窗口(tumbling count windows)》一文中,我们了解到如果窗口内元素个数未达到设定窗口大小,计算个数的函数不会被触发。例如,下图中红色部分的元素(b,2)和(d,5)不会被计算:

为了让这些元素也能被计算,我们可以使用时间滚动窗口(Tumbling Time Windows)。这种窗口不依赖于元素的数量,而是基于时间进行触发。只要时间窗口到达,无论窗口内有多少元素,计算都会进行。

我们可以稍作修改《0基础学习PyFlink——个数滚动窗口(Tumbling Count Windows)》的示例,将元素集中在“A”上。以下是修改后的代码:

map代码语言:javascript

class SumWindowFunction(WindowFunction[tuple, tuple, str, TimeWindow]):
    def apply(self, key: str, window: TimeWindow, inputs: Iterable[tuple]):
        print(*inputs, window)
        return [(key,  len([e for e in inputs]))]

word_count_data = [("A",2),("A",1),("A",4),("A",3),("A",6),("A",5),("A",7),("A",8),("A",9),("A",10), ("A",11),("A",12),("A",13),("A",14),("A",15),("A",16),("A",17),("A",18),("A",19),("A",20)]

def word_count(): env = StreamExecutionEnvironment.get_execution_environment() env.set_runtime_mode(RuntimeExecutionMode.STREAMING)

write all the data to one file

env.set_parallelism(1)
source_type_info = Types.TUPLE([Types.STRING(), Types.INT()])
# define the source
# mappging
source = env.from_collection(word_count_data, source_type_info)
# source.print()
# keying
keyed=source.key_by(lambda i: i[0])

reduce代码语言:javascript

    # reducing
reduced=keyed.window(TumblingProcessingTimeWindows.of(Time.milliseconds(2))) \
                .apply(SumWindowFunction(),
                    Types.TUPLE([Types.STRING(), Types.INT()]))
        # # define the sink
reduced.print()
# submit for execution
env.execute()

在这个例子中,我们使用了时间滚动窗口,窗口大小设置为2毫秒(Time.milliseconds(2))。运行这段代码时,由于基于时间触发计算,每个元素都会被计算,输出结果可能会有所不同:

可以看出,结果并不稳定,但每条数据都会被计算,而不是像个数滚动窗口那样某些数据可能不会被触发。

完整代码如下:

from typing import Iterable

import time from pyflink.common import Types, Time from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode, WindowFunction from pyflink.datastream.window import TimeWindow, TumblingProcessingTimeWindows

class SumWindowFunction(WindowFunction[tuple, tuple, str, TimeWindow]): def apply(self, key: str, window: TimeWindow, inputs: Iterable[tuple]): print(*inputs, window) return [(key, len([e for e in inputs]))]

word_count_data = [("A",2),("A",1),("A",4),("A",3),("A",6),("A",5),("A",7),("A",8),("A",9),("A",10), ("A",11),("A",12),("A",13),("A",14),("A",15),("A",16),("A",17),("A",18),("A",19),("A",20)]

def word_count(): env = StreamExecutionEnvironment.get_execution_environment() env.set_runtime_mode(RuntimeExecutionMode.STREAMING)

write all the data to one file

env.set_parallelism(1)
source_type_info = Types.TUPLE([Types.STRING(), Types.INT()])
# define the source
# mappging
source = env.from_collection(word_count_data, source_type_info)
# source.print()
# keying
keyed=source.key_by(lambda i: i[0])

# reducing
reduced=keyed.window(TumblingProcessingTimeWindows.of(Time.milliseconds(2))) \
                .apply(SumWindowFunction(),
                    Types.TUPLE([Types.STRING(), Types.INT()]))
        # # define the sink
reduced.print()
# submit for execution
env.execute()

if name == 'main': word_count()

参考资料:https://www./link/dc61c1317e2c1637f0f8d2de7fd8da9b