当前位置: 首页 > news >正文

[Spark][Python]DataFrame中取出有限个记录的例子

[Spark][Python]DataFrame中取出有限个记录的例子:

sqlContext = HiveContext(sc)

peopleDF = sqlContext.read.json("people.json")

peopleDF.limit(3).show()

===

[training@localhost ~]$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
[training@localhost ~]$


In [1]: sqlContext = HiveContext(sc)

In [2]: peopleDF = sqlContext.read.json("people.json")
17/10/05 05:03:11 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
17/10/05 05:03:11 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/05 05:03:11 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/05 05:03:14 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/05 05:03:14 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/05 05:03:15 INFO hive.metastore: Connected to metastore.
17/10/05 05:03:16 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training
17/10/05 05:03:16 INFO session.SessionState: Created local directory: /tmp/4e1c5259-7ae8-482c-ae77-94d3a0c51f91_resources
17/10/05 05:03:16 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training/4e1c5259-7ae8-482c-ae77-94d3a0c51f91
17/10/05 05:03:16 INFO session.SessionState: Created local directory: /tmp/training/4e1c5259-7ae8-482c-ae77-94d3a0c51f91
17/10/05 05:03:16 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training/4e1c5259-7ae8-482c-ae77-94d3a0c51f91/_tmp_space.db
17/10/05 05:03:16 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
17/10/05 05:03:16 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
17/10/05 05:03:19 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
17/10/05 05:03:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
17/10/05 05:03:20 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:55073 (size: 21.6 KB, free: 208.8 MB)
17/10/05 05:03:20 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
17/10/05 05:03:20 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/05 05:03:21 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/05 05:03:21 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
17/10/05 05:03:21 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
17/10/05 05:03:21 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:55073 (size: 2.4 KB, free: 208.8 MB)
17/10/05 05:03:21 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/10/05 05:03:21 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
17/10/05 05:03:21 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/05 05:03:21 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/05 05:03:21 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/05 05:03:21 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/05 05:03:21 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/10/05 05:03:21 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/10/05 05:03:21 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/10/05 05:03:21 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/10/05 05:03:21 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/10/05 05:03:22 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/05 05:03:22 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.931 s
17/10/05 05:03:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 850 ms on localhost (1/1)
17/10/05 05:03:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/10/05 05:03:22 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.388410 s
17/10/05 05:03:23 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
17/10/05 05:03:23 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
17/10/05 05:03:23 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/05 05:03:23 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/05 05:03:23 INFO spark.ContextCleaner: Cleaned accumulator 2
17/10/05 05:03:23 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:55073 in memory (size: 2.4 KB, free: 208.8 MB)
17/10/05 05:03:25 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/05 05:03:25 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/05 05:03:25 INFO hive.metastore: Connected to metastore.
17/10/05 05:03:25 INFO session.SessionState: Created local directory: /tmp/684b38e5-72f0-4712-81d4-4c439e093f5c_resources
17/10/05 05:03:25 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/684b38e5-72f0-4712-81d4-4c439e093f5c
17/10/05 05:03:25 INFO session.SessionState: Created local directory: /tmp/training/684b38e5-72f0-4712-81d4-4c439e093f5c
17/10/05 05:03:25 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/684b38e5-72f0-4712-81d4-4c439e093f5c/_tmp_space.db
17/10/05 05:03:25 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.

In [3]: peopleDF.limit(3).show()
17/10/05 05:04:09 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.5 KB, free 338.2 KB)
17/10/05 05:04:10 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.4 KB, free 359.6 KB)
17/10/05 05:04:10 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:55073 (size: 21.4 KB, free: 208.8 MB)
17/10/05 05:04:10 INFO spark.SparkContext: Created broadcast 2 from showString at NativeMethodAccessorImpl.java:-2
17/10/05 05:04:10 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 251.1 KB, free 610.7 KB)
17/10/05 05:04:11 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.6 KB, free 632.4 KB)
17/10/05 05:04:11 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:55073 (size: 21.6 KB, free: 208.7 MB)
17/10/05 05:04:11 INFO spark.SparkContext: Created broadcast 3 from showString at NativeMethodAccessorImpl.java:-2
17/10/05 05:04:12 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/05 05:04:12 INFO spark.SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:-2
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Got job 1 (showString at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (showString at NativeMethodAccessorImpl.java:-2)
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at showString at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/05 05:04:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 5.9 KB, free 638.2 KB)
17/10/05 05:04:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 3.3 KB, free 641.5 KB)
17/10/05 05:04:12 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:55073 (size: 3.3 KB, free: 208.7 MB)
17/10/05 05:04:12 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
17/10/05 05:04:12 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at showString at NativeMethodAccessorImpl.java:-2)
17/10/05 05:04:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/10/05 05:04:12 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/05 05:04:12 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
17/10/05 05:04:12 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/05 05:04:14 INFO codegen.GenerateUnsafeProjection: Code generated in 1563.240244 ms
17/10/05 05:04:14 INFO codegen.GenerateSafeProjection: Code generated in 182.529448 ms
17/10/05 05:04:15 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 2328 bytes result sent to driver
17/10/05 05:04:15 INFO scheduler.DAGScheduler: ResultStage 1 (showString at NativeMethodAccessorImpl.java:-2) finished in 2.549 s
17/10/05 05:04:15 INFO scheduler.DAGScheduler: Job 1 finished: showString at NativeMethodAccessorImpl.java:-2, took 2.852393 s
17/10/05 05:04:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2547 ms on localhost (1/1)
17/10/05 05:04:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| Alice|94304| null|
| 30|Brayden|94304| null|
| 19| Carla| null|10036|
+----+-------+-----+-----+


In [4]:







本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/7629878.html,如需转载请自行联系原作者

相关文章:

  • ORM规范API通用格式及禁止联表查询方案实现ORM
  • swift基础学习(九)
  • MySQL Workbench关键字转成小写设置
  • iOS-关于autoresizingMask在7.x及以下版本的一个bug
  • XV Open Cup named after E.V. Pankratiev. GP of Three Capitals
  • View 和Activity生命周期
  • Swift 2 0 如何替代 pch
  • 使用阿里云Maven镜像的正确姿势
  • 高德地图系列web篇——目的地公交导航
  • iOS 错误提示 [NSTaggedPointerString countByEnumeratingWithState objects
  • Android Fragment 从源码的角度去解析(上)
  • 数据结构中的各种排序方法-JS实现
  • Asp.net缓存简介
  • Android鬼点子 使用Kotlin编写的颜色选择器
  • 合唱队形
  • 【译】JS基础算法脚本:字符串结尾
  • @angular/forms 源码解析之双向绑定
  • [deviceone开发]-do_Webview的基本示例
  • 「面试题」如何实现一个圣杯布局?
  • Apache Zeppelin在Apache Trafodion上的可视化
  • Docker容器管理
  • IE报vuex requires a Promise polyfill in this browser问题解决
  • js ES6 求数组的交集,并集,还有差集
  • Js基础——数据类型之Null和Undefined
  • js正则,这点儿就够用了
  • leetcode378. Kth Smallest Element in a Sorted Matrix
  • MYSQL如何对数据进行自动化升级--以如果某数据表存在并且某字段不存在时则执行更新操作为例...
  • node入门
  • PAT A1120
  • STAR法则
  • 成为一名优秀的Developer的书单
  • 初探 Vue 生命周期和钩子函数
  • 力扣(LeetCode)22
  • 如何学习JavaEE,项目又该如何做?
  • 推荐一个React的管理后台框架
  • 学习使用ExpressJS 4.0中的新Router
  •  一套莫尔斯电报听写、翻译系统
  • 用 vue 组件自定义 v-model, 实现一个 Tab 组件。
  • Oracle Portal 11g Diagnostics using Remote Diagnostic Agent (RDA) [ID 1059805.
  • #Linux(make工具和makefile文件以及makefile语法)
  • #QT(一种朴素的计算器实现方法)
  • #我与Java虚拟机的故事#连载13:有这本书就够了
  • (1)(1.11) SiK Radio v2(一)
  • (react踩过的坑)Antd Select(设置了labelInValue)在FormItem中initialValue的问题
  • (ZT)一个美国文科博士的YardLife
  • (二十五)admin-boot项目之集成消息队列Rabbitmq
  • (全注解开发)学习Spring-MVC的第三天
  • (十五)使用Nexus创建Maven私服
  • (十一)JAVA springboot ssm b2b2c多用户商城系统源码:服务网关Zuul高级篇
  • (一)基于IDEA的JAVA基础10
  • (原創) 如何將struct塞進vector? (C/C++) (STL)
  • (转)PlayerPrefs在Windows下存到哪里去了?
  • .NET Core跨平台微服务学习资源
  • .net framework4与其client profile版本的区别
  • .NET WebClient 类下载部分文件会错误?可能是解压缩的锅