当前位置：首页 > news >正文

Hadoop的streamingAPI与MapReduce[Python]

news 来源：原创 2024/9/20 20:27:20

文章目录

1.创建模拟文本
2. 使用mapperduce统计标签分布和抽取指定标签
3. 运行Map函数并排序结果以模拟Reduce任务：
4.运行在无网络开发机上

1.创建模拟文本

1.1 机器模拟生成

from collections import namedtuple
from faker import Faker# 初始化Faker
fake = Faker()# 定义一个namedtuple类型，包含id, subject, text字段
GaokaoQuestion = namedtuple('GaokaoQuestion', 'id subject text')# 定义生成模拟数据的函数
def generate_faker_data(num_samples):data = []for _ in range(num_samples):# 使用faker生成数据subject = fake.word()text = fake.sentence()# 使用md5生成idid_value = f"{text} {subject}"id_hash = hashlib.md5(id_value.encode('utf-8')).hexdigest()# 创建namedtuple实例question = GaokaoQuestion(id=id_hash, subject=subject, text=text)data.append(question)return data# 生成3条模拟数据
samples = generate_faker_data(3)# 打印生成的数据
for sample in samples:print(sample)

2.手动生成

cat > test_data.jsonl << EOF
{"id":"1", "subject":"Math", "text":"Math question"}
{"id":"2", "subject":"Science", "text":"Science question"}
{"id":"3", "subject":"Math", "text":"Another Math question"}
EOF

2. 使用mapperduce统计标签分布和抽取指定标签

#!/usr/bin/env python3
import sys
import json
from collections import defaultdict# 指定需要抽取的subject标签列表
TARGET_SUBJECTS = ["数学", "物理"]def mapper():for line in sys.stdin:data = json.loads(line)if data['subject'] in TARGET_SUBJECTS:print(json.dumps(data))def reducer():counts = defaultdict(int)for line in sys.stdin:subject, count = line.strip().split('\t')counts[subject] += int(count)for subject, count in counts.items():print(f"{subject}\t{count}")if __name__ == "__main__":if len(sys.argv) > 1 and sys.argv[1] == 'reduce':reducer()else:mapper()

3. 运行Map函数并排序结果以模拟Reduce任务：

cat test_data.jsonl | python3 mapper_reducer.py | sort -k1,1 | python3 mapper_reducer.py reduce

4.运行在无网络开发机上

# 假设input_data.jsonl是HDFS上的输入文件路径
# 假设output是HDFS上输出结果的路径# 运行Map任务
hadoop fs -get /path/to/input_data.jsonl input_data.jsonl
python \Auser\tmp\mapper_reducer_script\mapper_reducer.py | sort -k1,1 > mapped_output.txt# 运行Reduce任务
python \Auser\tmp\mapper_reducer_script\mapper_reducer.py reduce < mapped_output.txt > reduced_output.txt# 将结果上传到HDFS
hadoop fs -put reduced_output.txt /path/to/output/