当前位置: 首页 > news >正文

Python调用pyspark报错整理

前言

Pycharm配置了SSH服务器和Anaconda的python解释器,如果没有配置可参考 大数据单机学习环境搭建(8)Linux单节点Anaconda安装和Pycharm连接

Pycharm执行的脚本

执行如下 pyspark_model.py 的python脚本,构建SparkSession来执行sparksql

"""脚本名称:Pycharm使用pyspark测试功能:Pycharm远程执行sparksql
"""
from pyspark.sql import SparkSession
import osos.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = '/opt/jdk1.8'spark = SparkSession.builder \.appName('pyspark_conda') \.master("yarn") \.config("spark.sql.warehouse.dir", "hdfs://bigdata01:8020/user/hive/warehouse") \.config("hive.metastore.uris", "thrift://bigdata01:9083") \.enableHiveSupport() \.getOrCreate()spark.sql('select * from hostnames limit 10;').show()spark.stop()

报错一:pyspark版本不匹配

例如我当前集群环境Spark3.0.0,python的pyspark3.5.0,没有指定版本默认下载了最新的

报错信息 [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number., 具体如下:

ssh://slash@bigdata01:22/opt/python3/bin/python3 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
JAVA_HOME is not set
Traceback (most recent call last):File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>spark = SparkSession.builder \File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 497, in getOrCreatesc = SparkContext.getOrCreate(sparkConf)File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 515, in getOrCreateSparkContext(conf=conf or SparkConf())File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 201, in __init__SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 436, in _ensure_initializedSparkContext._gateway = gateway or launch_gateway(conf)File "/opt/python3/lib/python3.8/site-packages/pyspark/java_gateway.py", line 107, in launch_gatewayraise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

如果坚持不更换python的pyspark版本,即使像报错2已经指定了JAVA_HOME 依然会有其他报错。例如下方报错 Py4JError ,所以最彻底的方法是替换pyspark版本与spark版本一致

Traceback (most recent call last):File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>spark = SparkSession.builder \File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 497, in getOrCreatesc = SparkContext.getOrCreate(sparkConf)File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 515, in getOrCreateSparkContext(conf=conf or SparkConf())File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 203, in __init__self._do_init(File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 316, in _do_initself._jvm.PythonUtils.getPythonAuthSocketTimeout(self._jsc)File "/opt/python3/lib/python3.8/site-packages/py4j/java_gateway.py", line 1549, in __getattr__raise Py4JError(
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM

报错二:JAVA_HOME指定不成功

python的pyspark已经重装3.0.0版本(下载时指定版本 pip install pyspark==3.0.0),报错信息 Java gateway process exited before sending its port number., JAVA_HOME is not set 具体如下:

ssh://slash@bigdata01:22/opt/python3/bin/python3 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
JAVA_HOME is not set
Traceback (most recent call last):File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>spark = SparkSession.builder \File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 186, in getOrCreatesc = SparkContext.getOrCreate(sparkConf)File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 371, in getOrCreateSparkContext(conf=conf or SparkConf())File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 128, in __init__SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 320, in _ensure_initializedSparkContext._gateway = gateway or launch_gateway(conf)File "/opt/python3/lib/python3.8/site-packages/pyspark/java_gateway.py", line 105, in launch_gatewayraise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

指定内容如下:

# pyspark3.5.0指定了 SPARK_HOME JAVA_HOME还是会报错
# pyspark3.0.0指定后成功运行
os.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = '/opt/jdk1.8'

报错三:python版本问题

最开始安装的最新版的anaconda环境,其中python3.11,安装pyspark3.0.0也会报错 TypeError: code() argument 13 must be str, not int,具体内容如下:

ssh://slash@bigdata01:22/opt/anaconda3/bin/python3.11 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
Traceback (most recent call last):File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 1, in <module>from pyspark.sql import SparkSessionFile "/opt/anaconda3/lib/python3.11/site-packages/pyspark/__init__.py", line 51, in <module>from pyspark.context import SparkContextFile "/opt/anaconda3/lib/python3.11/site-packages/pyspark/context.py", line 30, in <module>from pyspark import accumulatorsFile "/opt/anaconda3/lib/python3.11/site-packages/pyspark/accumulators.py", line 97, in <module>from pyspark.serializers import read_int, PickleSerializerFile "/opt/anaconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 71, in <module>from pyspark import cloudpickleFile "/opt/anaconda3/lib/python3.11/site-packages/pyspark/cloudpickle.py", line 209, in <module>_cell_set_template_code = _make_cell_set_template_code()^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/cloudpickle.py", line 172, in _make_cell_set_template_codereturn types.CodeType(^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int

删除 /opt/anaconda3的文件夹后,重新安装了 Anaconda3-2021.05-Linux-x86_64.sh 版本的anaconda,其中python3.8,利用pyspark3.0.0第三方库操作spark3.0.0的计算引擎构建SparkSession,执行sparksql成功。


声明:本文所载信息不保证准确性和完整性。文中所述内容和意见仅供参考,不构成实际商业建议,如有雷同纯属巧合。

相关文章:

  • 在java中一般什么时候用==
  • 打卡今天学习 Linux
  • 美国服务器如何
  • 1 月 Web3 游戏行业概览:市场实现空前增长
  • [项目管理] 如何使用git客户端管理gitee的私有仓库
  • 【CV论文精读】EarlyBird: Early-Fusion for Multi-View Tracking in the Bird’s Eye View
  • LRU缓存
  • 图灵日记之java奇妙历险记--抽象类和接口
  • Redis核心技术与实战【学习笔记】 - 31.番外篇:Redis客户端如何与服务器端交换命令和数据
  • k8s 部署java应用 基于ingress+jar包
  • 【iOS ARKit】3D 人体姿态估计
  • 三丰云免费云服务器测评
  • Kubernetes基础(十五)-k8s网络通信
  • 《Python 网络爬虫简易速速上手小册》第1章:Python 网络爬虫基础(2024 最新版)
  • 在windows的控制台实现贪吃蛇小游戏
  • (三)从jvm层面了解线程的启动和停止
  • “Material Design”设计规范在 ComponentOne For WinForm 的全新尝试!
  • Android Studio:GIT提交项目到远程仓库
  • ECMAScript6(0):ES6简明参考手册
  • Go 语言编译器的 //go: 详解
  • Javascripit类型转换比较那点事儿,双等号(==)
  • MD5加密原理解析及OC版原理实现
  • Mocha测试初探
  • Octave 入门
  • 测试如何在敏捷团队中工作?
  • 程序员最讨厌的9句话,你可有补充?
  • 动态规划入门(以爬楼梯为例)
  • 如何实现 font-size 的响应式
  • 通过git安装npm私有模块
  • 一个JAVA程序员成长之路分享
  • 智能网联汽车信息安全
  • 正则表达式-基础知识Review
  • ​比特币大跌的 2 个原因
  • #define 用法
  • #NOIP 2014#day.2 T1 无限网络发射器选址
  • $con= MySQL有关填空题_2015年计算机二级考试《MySQL》提高练习题(10)
  • (12)Linux 常见的三种进程状态
  • (二)c52学习之旅-简单了解单片机
  • (附源码)php投票系统 毕业设计 121500
  • (附源码)springboot工单管理系统 毕业设计 964158
  • (附源码)ssm考生评分系统 毕业设计 071114
  • (学习日记)2024.01.09
  • (转) 深度模型优化性能 调参
  • .bat批处理(九):替换带有等号=的字符串的子串
  • .NET 4.0网络开发入门之旅-- 我在“网” 中央(下)
  • .NET Core Web APi类库如何内嵌运行?
  • .NET设计模式(2):单件模式(Singleton Pattern)
  • .vimrc php,修改home目录下的.vimrc文件,vim配置php高亮显示
  • /var/log/cvslog 太大
  • [AIGC] 使用Curl进行网络请求的常见用法
  • [ARC066F]Contest with Drinks Hard
  • [c#基础]值类型和引用类型的Equals,==的区别
  • [ComfyUI进阶教程] animatediff视频提示词书写要点
  • [flask]http请求//获取请求体数据
  • [leetcode]_String to Integer (atoi)