2016-03-31 10 views
2

Ich versuche, JobManager HA im Kontext einer pro-Job-YARN-Sitzung mit der 1.0.0-rc3 von vor ein paar Tagen zu bekommen und habe ein Problem in Bezug auf Task-Manager mit mehreren Netzwerkschnittstellen.Fehler beim Abrufen von Leader-Gateway und Port

Nach manuell dem Job-Manager-Prozess zu töten, die jobmanager.log auf den neu zugewiesenen zweiten Job Manager liest:

2016-03-02 18:01:09,635 WARN Remoting             
    - Tried to associate with unreachable remote address [akka.tcp://[email protected]:34811]. 
Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. 
Reason: Connection refused: /10.127.68.136:34811 
2016-03-02 18:01:09,644 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever  
    - Failed to retrieve leader gateway and port. 
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://[email protected]:34811/), 
Path(/user/jobmanager)] 
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) 
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) 
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) 
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) 
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) 
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) 
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) 
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) 
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508) 
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541) 
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531) 
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87) 
    at akka.remote.EndpointWriter.postStop(Endpoint.scala:561) 
    at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) 
    at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415) 
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) 
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) 
    at akka.actor.ActorCell.terminate(ActorCell.scala:369) 
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) 
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) 
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) 
    at akka.dispatch.Mailbox.run(Mailbox.scala:220) 
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231) 
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 

, wo die IP nicht aus den alten Job Managern gefunden. Ist das bisher das erwartete Verhalten?

Das Problem tritt dann auf einem neuen Task-Manager auf, der auch versucht, eine Verbindung mit dem alten Job Manager erfolglos herzustellen. Die ZooKeeperLeaderRetrievalService beginnt Radfahren durch die verfügbaren Netzwerkschnittstellen, wie in der entsprechenden taskmanager.log zu sehen:

2016-03-02 18:01:13,636 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
- Starting ZooKeeperLeaderRetrievalService. 
2016-03-02 18:01:13,646 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils   
    - Trying to select the network interface and address to use by connecting to the leading 
JobManager. 
2016-03-02 18:01:13,646 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils   
    - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 
2016-03-02 18:01:13,712 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Retrieved new target address /10.127.68.136:34811. 
2016-03-02 18:01:14,079 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Trying to connect to address /10.127.68.136:34811 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address 'task.manager.eth0.hostname.com/10.127.68.136': Connection 
refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.120.193.110': Connection refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/127.0.0.1': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.120.193.110': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/127.0.0.1': Connection refused 

Nach fünf Wiederholungen, der Task-Manager den Anführer abrufen den Versuch aufgibt, und mit Hilfe der HEURISTISCHE Strategie endet mit eth1 (10.120.193.110) von jetzt an:

2016-03-02 18:01:23,650 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
- Stopping ZooKeeperLeaderRetrievalService. 
2016-03-02 18:01:23,655 INFO org.apache.zookeeper.ClientCnxn        
    - EventThread shut down 
2016-03-02 18:01:23,655 INFO org.apache.zookeeper.ZooKeeper        
    - Session: 0x25229757cff035b closed 
2016-03-02 18:01:23,664 INFO org.apache.flink.runtime.taskmanager.TaskManager   
    - TaskManager will use hostname/address 'task.manager.eth1.hostname.com' (10.120.193.110) 
for communication. 

der neuen Jobmanager Es folgt entdeckt und der Taskmanager in der Lage, an der Jobmanager mit eth1 zu registrieren. Das Problem ist, dass Verbindungen zu eth1 nicht möglich sind. Also flink sollte immer eth0 verwenden. Die Ausnahme, die wir später sehen ist:

java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'other.task.manager.eth1.hostname/10.120.193.111:46620' 
has failed. This might indicate that the remote task manager has been lost. 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196) 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131) 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83) 
    at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60) 
    at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115) 
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:388) 
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:411) 
    at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:108) 
    at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:175) 
    at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:65) 
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:224) 
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) 
    at java.lang.Thread.run(Thread.java:744) 

Die Ursache scheint, dass Netzwerk-Schnittstelle Auswahl zu sein ist immer noch den alten Jobmanager Standort mit und ist daher nicht in der Lage die richtige Schnittstelle zu wählen. Insbesondere scheint es, dass sich die Iterationsreihenfolge über die Netzwerkschnittstellen zwischen der HEURISTIC- und SLOW-Strategie, , unterscheidet, was dann dazu führt, dass die falsche Schnittstelle ausgewählt wird.

+0

Können Sie deutlicher angeben, was Ihr Problem ist? –

+0

Ich wurde erneut bearbeitet –

Antwort

1

Ihr Problem sollte mit dem kommenden Bugfix Release 1.0.1 behoben werden. Alternativ können Sie auch die aktuelle 1.1-SNAPSHOT Version von Flink verwenden.