स्पार्क-खोल/pyspark में एक आरडीडी के स्निपेट मुद्रित करने के लिए कैसे?

स्पार्क-खोल में काम करते समय, मैं अक्सर आरडीडी का निरीक्षण करना चाहता हूं (यूनिक्स में head का उपयोग करने के समान)।स्पार्क-खोल/pyspark में एक आरडीडी के स्निपेट मुद्रित करने के लिए कैसे?

उदाहरण के लिए:

scala> val readmeFile = sc.textFile("input/tmp/README.md") 
scala> // how to inspect the readmeFile?

और ...

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark")) 
scala> // how to inspect linesContainingSpark?

स्रोत

2015-06-29 Chris Snow

मुझे पता चला यह (here) कैसे करना है और सोचा कि यह अन्य उपयोगकर्ताओं के लिए उपयोगी है, तो यहां साझा करने हो जाएगा। take(x) का चयन करता है पहले एक्स वस्तुओं और foreach उन्हें प्रिंट:

scala> val readmeFile = sc.textFile("input/tmp/README.md") 
scala> readmeFile.take(5).foreach(println) 
# Apache Spark 

Spark is a fast and general cluster computing system for Big Data. It provides 
high-level APIs in Scala, Java, and Python, and an optimized engine that 
supports general computation graphs for data analysis. It also supports a

और ...

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark")) 
scala> linesContainingSpark.take(5).foreach(println) 
# Apache Spark 
Spark is a fast and general cluster computing system for Big Data. It provides 
rich set of higher-level tools including Spark SQL for SQL and structured 
and Spark Streaming. 
You can find the latest Spark documentation, including a programming

नीचे दिए गए उदाहरणों बराबर लेकिन का उपयोग कर pyspark हैं:

>>> readmeFile = sc.textFile("input/tmp/README.md") 
>>> for line in readmeFile.take(5): print line 
... 
# Apache Spark 

Spark is a fast and general cluster computing system for Big Data. It provides 
high-level APIs in Scala, Java, and Python, and an optimized engine that 
supports general computation graphs for data analysis. It also supports a

और

>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line) 
>>> for line in linesContainingSpark.take(5): print line 
... 
# Apache Spark 
Spark is a fast and general cluster computing system for Big Data. It provides 
rich set of higher-level tools including Spark SQL for SQL and structured 
and Spark Streaming. 
You can find the latest Spark documentation, including a programming

स्रोत

2015-06-29 12:35:49

शायद आप पहले ही महसूस कर चुके हैं; 'ले लें (5)' वास्तव में यूनिक्स में 'हेड' की तरह होगा, और आपके द्वारा पोस्ट किए गए प्रश्न में आप जिस 'फ़िल्टर' का उपयोग कर रहे थे वह' grep' जैसा होगा। हालांकि, 'फ़िल्टर' ने आपको ऐसा कोई परिणाम नहीं दिया क्योंकि आपने उन्हें एकत्र नहीं किया था; सबसे आसान तरीका 'फ़िल्टर' के बाद 'टेक' जोड़ना होगा। – lrnzcig

स्पार्क-खोल/pyspark में एक आरडीडी के स्निपेट मुद्रित करने के लिए कैसे?

उत्तर

संबंधित मुद्दे