Automatically retry steps if they are failed
Now when a step failed you have to set guided failure on in order to retry the step, it would be handy to have a setting where you say always retry failed steps.
For example if you are doing continues deployment where a build server triggers a deployment from a build, you have to:
- login into octopus
- select the correct project
- manually intervene and assign it to you
- retry it
It would be much better if the retry can be done automatically
Vishnu S S commented
On top of facing intermittent network errors from Azure, the inability to auto retry such deployments from Octopus is hurting us real bad.
Sam Adams commented
I work with a lot of flaky, network related tests for APIs, and I'm not alone in this at my company. Having the ability to automatically re-try a failed step would make our experience
Jeff Smith commented
Not having retry step logic really diminishes the automation capabilities Octo provides. As many have stated below, the vast majority of the time a simple retry succeeds. Why not provide the tooling to allow this process to be automated? From the surface this seems like a "big bang for the buck" request.
Vishnu S S commented
Jeff L commented
Steps that interact with a remote cloud resource fail 5% of the time for simple networking issues and then nearly always work on the first retry. In this scenario an automatic step retry X times is far superior to forcing the deploy into manual guided mode.
Paul Spain commented
How was this not a default option anyway, The world of IT can be flaky, connection drop. Files are in use, Web addresses are slow to respond, the list is endless. And having a retry system is a must.
Pausing the deploy is so bad as it also pauses all other items behind it in the queue, and if you dont answer the question fast enough then all the other tasks behind it time out. The user prompt system makes this 'automated system' next to useless as a human needs to be there. I just want to press a button and expect to come back to a result. That's literally why we are all using this tool. Can this issue please be a top priority, It should be easy to implement as you already have the retry mechanics. just take the user step out and have a counter (x) retries before fail.
This will be nice, if you can add a functionality so that when a step is failing, it will automatically retries itself depending on the number of retries you want it to perform.
As we are getting lots of error warning when doing deployments which requires manual intervention and sometimes we tend not to notice these things and leads to out of our maintenance window.
A band-aid i would recommend trying for people who have these kinds of steps that routinely fail is to just clone the step and on the clone just update the Run Condition to "Failure:only run when a previous step failed"
Furthermore if you have other steps that may fail beyond the one you know works with a retry just leverage the output variable run condition "Variable: only run when the variable expression is true" This is probably the cleanest way to gauge this but is more advanced for some. You can add a custom script with some powershell to define a true or false value.
Steve Morgan commented
I am using Azure App Service and my deployments fail regularly this feature would be fantastic, I would like there to be a feature that retries the failed step and any previous steps that you state if possible.
Aaron Roydhouse commented
It would be great to combine this with timeouts for Steps. So if an idempotent Step gets stuck, it can eventually timeout and retry. Currently your Step can wedge for _days_. And you can only cancel the whole Task, not the Step.
Steve Land commented
In our case, we frequently have failures with several causes such as
> Flaky proxy server
> SFTP errors
> Delay between terraform steps completing & services becoming live
> Azure Zipdeploy failures
in 90% of these cases an immediate retry succeeds - and since the majority of our deployments are CI triggered it would be a really great user experience if we didnt have to do this manually.
Bryan Roth commented
This would be very useful. I've noticed built-in Octopus steps failing because of a locked file, and simply retrying that step often succeeds. It would be nice to have an option to retry a step if a failure occurs up to a certain amount of retries.
I know it's possible to bake in retry functionality into script steps or step templates that you create.
Duangchan Ueta commented
Please add this. it would help us overcome failure turning on the vms.
This would be brilliant, Ie try 3 times and then stop with failure etc
This feature would be greatly appreciated, specially for long running projects. We have sporadic timeout issues with big files and it would be extremely useful.
Please add this. Many of our deploys are on Windows and there are so many things that can go wrong where a simply retry, just once, seems to correct it. To get around this issue, we've had to add our own custom retry methods which have helped a lot but we still have many step templates to port over. It would be so much more convenient and scalable to have this built into octo.
Mathew Gallagher commented
This feature would be helpful. Our particular instance comes with an issue we have starting one of our Windows services. Do to old code the service manager will sometimes time out. Simply retrying the step in a guided failure usually gets it to start. Would love to automate this error test and correction.
This feature would be useful. Have a system var that you can set for Deployment.MaxRetry. We have sporadic file access issues, but the main one is 'log4net:WARN Cannot RollFile', which clears itself on first retry.
Lee Cherry commented
This would also be useful in instances of locked files believe it or not. I have seen lots of instances whereby the deployment is waiting on a retry due to not being able to update a file, I don't do anything other than hit retry and it works. Suggests at that moment in time a file was locked.
Other instances where it would be useful is when multiple deployments are happening to the same machine, e.g. IIS websites, you may want to stop IIS for one for the sites to be deployed, and then start it again. If there were multiple different sites on the same box, it would be good to deploy all projects at once and let Octopus do the retry should IIS be stopped when it needs to be started