2015-11-18 8 views
5

मैं सिर्फ इन comand चलाने खिड़कियों में `first` या` take` विधि:pyspark दुर्घटना जब मैं चलाने के 7

>>> lines.first() 

    15/11/18 17:33:35 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393 

    15/11/18 17:33:35 INFO DAGScheduler: Got job 21 (runJob at PythonRDD.scala:393) 
    with 1 output partitions 
    15/11/18 17:33:35 INFO DAGScheduler: Final stage: ResultStage 21(runJob at Pytho 
    nRDD.scala:393) 
    15/11/18 17:33:35 INFO DAGScheduler: Parents of final stage: List() 
    15/11/18 17:33:35 INFO DAGScheduler: Missing parents: List() 
    15/11/18 17:33:35 INFO DAGScheduler: Submitting ResultStage 21 (PythonRDD[28] at 
    RDD at PythonRDD.scala:43), which has no missing parents 
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=619 
    446, maxMem=555755765 
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24 stored as values in memor 
    y (estimated size 4.7 KB, free 529.4 MB) 
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(3067) called with curMem=624 
    270, maxMem=555755765 
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in 
    memory (estimated size 3.0 KB, free 529.4 MB) 
    15/11/18 17:33:35 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on 
    localhost:55487 (size: 3.0 KB, free: 529.9 MB) 
    15/11/18 17:33:35 INFO SparkContext: Created broadcast 24 from broadcast at DAGS 
    cheduler.scala:861 
    15/11/18 17:33:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 
    21 (PythonRDD[28] at RDD at PythonRDD.scala:43) 
    15/11/18 17:33:35 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks 
    15/11/18 17:33:35 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33, 
    localhost, PROCESS_LOCAL, 2148 bytes) 
    15/11/18 17:33:35 INFO Executor: Running task 0.0 in stage 21.0 (TID 33) 
    15/11/18 17:33:35 INFO HadoopRDD: Input split: file:/C:/Users/elqstux/Desktop/dt 
    op.txt:0+112852 
    15/11/18 17:33:36 INFO PythonRunner: Times: total = 629, boot = 626, init = 3, f 
    inish = 0 
    15/11/18 17:33:36 ERROR PythonRunner: Python worker exited unexpectedly (crashed 
    ) 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 ERROR PythonRunner: This may have been caused by a prior excep 
    tion: 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 33) 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 33, loca 
    lhost): java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

    15/11/18 17:33:36 ERROR TaskSetManager: Task 0 in stage 21.0 failed 1 times; abo 
    rting job 
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have 
    all completed, from pool 
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Cancelling stage 21 
    15/11/18 17:33:36 INFO DAGScheduler: ResultStage 21 (runJob at PythonRDD.scala:3 
    93) failed in 0.759 s 
    15/11/18 17:33:36 INFO DAGScheduler: Job 21 failed: runJob at PythonRDD.scala:39 
    3, took 0.810138 s 
    Traceback (most recent call last): 
     File "<stdin>", line 1, in <module> 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1317, in first 

     rs = self.take(1) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take 
     res = self.context.runJob(self, takeUpToNumLeft, p) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in ru 
    nJob 
     port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition 
    s) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g 
    ateway.py", line 538, in __call__ 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in d 
    eco 
     return f(*a, **kw) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc 
    ol.py", line 300, in get_return_value 
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark. 
    api.python.PythonRDD.runJob. 
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s 
    tage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 
    33, localhost): java.net.SocketException: Connection reset by peer: socket write 
    error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

    Driver stacktrace: 
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA 
    GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D 
    AGScheduler.scala:1271) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D 
    AGScheduler.scala:1270) 
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. 
    scala:59) 
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala 
    :1270) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 
    1.apply(DAGScheduler.scala:697) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 
    1.apply(DAGScheduler.scala:697) 
      at scala.Option.foreach(Option.scala:236) 
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu 
    ler.scala:697) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(D 
    AGScheduler.scala:1496) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG 
    Scheduler.scala:1458) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG 
    Scheduler.scala:1447) 
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567 
    ) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
      at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393) 
      at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) 
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. 
    java:62) 
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces 
    sorImpl.java:43) 
      at java.lang.reflect.Method.invoke(Method.java:483) 
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
      at py4j.Gateway.invoke(Gateway.java:259) 
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
      at py4j.commands.CallCommand.execute(CallCommand.java:79) 
      at py4j.GatewayConnection.run(GatewayConnection.java:207) 
      at java.lang.Thread.run(Thread.java:745) 
    Caused by: java.net.SocketException: Connection reset by peer: socket write erro 
    r 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

जब मैं चलाने take:

>>> lines = sc.textFile("C:\Users\elqstux\Desktop\dtop.txt") 

>>> lines.count() // this work fine 

>>> lines.first() // this crash 

यहाँ त्रुटि रिपोर्ट है , यह भी दुर्घटनाग्रस्त है, मुझे कारण नहीं मिल रहा है, कौन मेरी मदद कर सकता है?

+0

क्या गिनती आपके द्वारा अपेक्षित मूल्य लौटाती है? मुझे पाठ फ़ाइलों को आयात करने के रूप में rdds परिणाम स्वरूपित करना मुझे उम्मीद नहीं थी। आम तौर पर बड़े स्ट्रिंग जिन्हें मुझे प्रारूपित करना है। – Jesse

+0

मेरे पास एक ही समस्या है। हालांकि यह त्रुटि यादृच्छिक प्रतीत होती है। कभी-कभी 'टेक' या 'पहले' मल्टीपल टाइम्स चलाना कभी-कभी सफल होता है लेकिन ज्यादातर मेरे लिए विफल रहता है। मैं अजगर 3.5 पर चल रहा हूँ – Tim

उत्तर

0

मैं विंडोज 7 और स्पार्क 1.5.0 (पायथन 2.7.11) पर एक ही समस्या के साथ घंटों तक अटक गया हूं। मैंने बिल्कुल उसी निर्माण का उपयोग करके यूनिक्स पर स्विच किया। यह एक सुरुचिपूर्ण समाधान नहीं है, लेकिन मुझे इसे हल करने का कोई और तरीका नहीं मिला।