Skip to content

Methods to Improve handling of GCP backend timeouts #7551

Open
@osha52101

Description

@osha52101

We randomly receive PKIX and RPC deadline errors on a small subset of GCP LS API and Batch runs, which always crash the main Cromwell java process. We have added local certs to our java install launching cromwell - this has reduced, but not eliminated PKIX errors. We have no remedy for these issues at the moment besides submitting again. Is there anything we can do as users to make cromwell more fault-tolerant of cloud-related issues like these?

PKIX error example:

Failed to query job status (projects/XXXXX/jobs/job-7638b0fb-XXXX-XXXX-81ca-9f609da4c664) from GCP
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1277)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.$anonfun$query$1(GcpBatchApiRequestHandler.scala:15)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.withClient(GcpBatchApiRequestHandler.scala:29)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.query(GcpBatchApiRequestHandler.scala:14)
at cromwell.backend.google.batch.actors.GcpBatchBackendSingletonActor$$anonfun$normalReceive$1.$anonfun$applyOrElse$3(GcpBatchBackendSingletonActor.scala:80)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at io.grpc.Status.asRuntimeException(Status.java:539)
... 14 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1907)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:834)
at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1041)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:927)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1409)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrapNonAppData(SslHandler.java:1327)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.access$1800(SslHandler.java:169)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.resumeOnEventExecutor(SslHandler.java:1718)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.access$2000(SslHandler.java:1609)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner$2.run(SslHandler.java:1770)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:388)
at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:271)
at java.base/sun.security.validator.Validator.validate(Validator.java:256)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:284)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:234)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:779)
at io.grpc.netty.shaded.io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1496)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1471)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.run(SslHandler.java:1787)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 common frames omitted
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:383)
... 16 common frames omitted

Deadline RPC error example:

Failed to submit job (job-2fd08385-36fe-XXXX-XXXX-cf9e03ae29fe) to GCP, workflowId = 6fd00a65-XXXX-XXXX-84ac-2c0030ba224c
com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.

WorkflowManagerActor: Workflow 415b5011-XXXX-XXXX-953b-68923c164eb0 failed (during ExecutingWorkflowState): cromwell.core.CromwellFatalException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions