Jeroen sent me an interesting challenge: he would like to reload the router when the 3G WAN interface gets stuck (I thought my Nokia phone is the only one exhibiting this problem, but obviously I was wrong). The reload-on-failed-ping EEM applet I’ve published would be a perfect solution, but it uses track delay and the maximum delay timeout is three minutes, while Jeroen would like to wait 15 minutes before reloading the router.
I had two off-the-cuff ideas: execute reload in X command when SLA fails and reload cancel when SLA recovers, or use a second EEM applet with event timer watchdog that is triggered (and stopped) by the SLA-tracking applets. Both options are pretty messy so I was not really happy with either one ... and then Jeroen managed to find a third, totally unexpected solution.
He decided to use the SNMP value event detector to detect SLA failure (each SLA measurement has its own MIB variables) and combined it with a trigger saying “execute this applet if the OID value is below the threshold X times in X sampling intervals.” Here’s his SLA definition (he gets extra bonus points for starting SLA measurements 30 minutes after power up) ...
ip sla 10 icmp-echo 10.255.251.64 source-interface Loopback0 request-data-size 16384 frequency 10 ip sla schedule 10 life forever start-time after 00:30:00
... and the EEM applet (the last number in the OID string has to match ip sla entry number and the polling frequency should match the ip sla frequency):
event manager applet vodafone_down_RELOAD event snmp oid 1.3.6.1.4.1.9.9.42.1.2.9.1.6.10 get-type exact entry-op lt entry-val "2" poll-interval 10 trigger occurs 179 period 1790 action 01.0 syslog msg "No ping response last 30 min." action 02.0 syslog msg "Reloading now to see if things get better..." action 03.0 reload

Awesome!!!!!!!!!
ReplyDeleteJust a thought. But in my experience it's usually enough to do a shut/no shut on the cellular interface to get the 3G back up and running.
ReplyDeleteI've got this same request a while ago, to reload the router if 3G has been down for a few minutes.
(This was based on the customers experience with other 3G solutions, so it seems common that 3G users have to reload their equipment...)
But this ended with using EEM/TCL and doing shut/no shut on the cellular interface before reloading the router. (different timers). So if shut/no shut fixed the problem, SLA recovered and the router didn't have to reload. (And we preserve the logging buffers, and the recovery is quicker, etc.)
There's also another issue regarding 3G.
Most 3G equipment can fallback to GPRS/EDGE if the if the 3G signal is to weak or unavailable, and this can happen automatically.
However, from what I've heard*, the 3G equipment will not try to go back to 3G even if the 3G signal is available, if there is any data flowing. It will wait until there's no data transfer going on before going from GPRS/EDGE back to 3G.
(* I've not verified this myself, but I heard this from someone who's more familiar with 3G equipment than I am.)
You can also just reboot the cellular modem using "test cellular 0 2 modem-power-cycle".
ReplyDeleteA provider we hired to configure our 3G dmvpn oob routers had this problem aswell, he got in contact with TAC and they provided him after some faultsearching with a working IOS. Dont know about public release though...
ReplyDeleteIvan,
ReplyDeleteI have done similar EEM scripts in my role. But I don't reload the router, I only reload the 3G-HWIC instead and I do it after I miss 8x IP SLA consecutive pings at 1min intervals and default ping timeout of 5s.
I can share my config if you wish, let me know.
Cheers,
Joe.
That would be fantastic. Just paste it as a comment or post a link to somewhere.
ReplyDeleteThank you! Ivan
is it necessary to have this on your conf:
ReplyDeletesnmp-server enable traps ipsla
Joe,
ReplyDeleteplease share
i would appriciate any help with this one :
ReplyDeletei have an ipsla that pings a host .
if syslog message "%TRACKING-5-STATE: 222 ip sla 333 reachability Up->Down" has happened 2 times in 3 minutes, its putting a null route .
what i would like to know is how can i make it that this Null route would be removed only if its been 30 Minutes since the last syslog message "%TRACKING-5-STATE: 222 ip sla 333 reachability Down->Up" ?
the thing is i need to know i can have a reliable backup link with a mechanism to verify it [the 30minutes safe period].
track 222 ip sla 223 reachability
ip sla 223
icmp-echo x.x.x.x source-ip y.y.y.y
threshold 500
frequency 5
ip sla schedule 223 life forever start-time now
ip sla reaction-configuration 223 react timeout threshold-type xOfy 2 5 action-type trapOnly
!
event manager applet IPSEC_TUNNEL_2_FAIL
event syslog pattern "%TRACKING-5-STATE: 222 ip sla 223 reachability Up->Down"
trigger occurs 2 period 180
action 1.0 cli command "enable"
action 2.0 cli command "config t"
action 3.0 cli command "ip route 192.168.255.5 255.255.255.255 Null0 name NULL_WHEN_IPSLA223_FAIL"
action 3.1 cli command "exit"
action 4.0 syslog msg "IPSEC_VPN_TUNNEL2 TIMEOUT - MOVING TO IPSEC_TUNNEL1"
i was thinking on using watchdog timer but i understand it counts down from the time of a trigger . thats great , but if the sla is flapping and i get two "Down->Up" - i think it would initiate multiple times the specific eem , no ? if yes - then in case of a continouse flapping ill get into trouble ...
Thank you