This error is sometimes seen in large scale deployments where large number of client authentication requests fill up the RADIUS queues on the WLC. The WLC should typically be able to recover automatically but in certain cases if there is a lot of latency in the AAA server responses the WLC gets stuck in this state for a long time (have seen recovery times of more than 30-40 mins).
Typical response times from AAA server in normal functioning are in low milliseconds but can spike up to 1-10 second period.
Most customers that I have worked with have typical radius timeout value set to 5 seconds and radius retries set to default of 5 retries.
(Cisco Controller) >show radius summary Vendor Id Backward Compatibility............ Disabled Credentials Caching......................... Disabled Call Station Id Type........................ IP Address Administrative Authentication via RADIUS.... Enabled Aggressive Failover......................... Disabled Keywrap..................................... DisabledAuthentication Servers !--- This portion of code has been wrapped to several lines due to spatial! --- concerns. Idx Type Server Address Port State Tout RFC3576 --- ---- ---------------- ------ -------- ---- ------- 1 N 10.48.76.50 1812 Enabled 2 Enabled
this is how to configure the RADIUS timeout
This is how to configure:
config radius auth retransmit-timeout 1 5
During the RADIUS Queue full issue on the WLC, the WLC is start rejecting NEW authentication requests from the clients till more space is created in the RADIUS Queue to accommodate new auth requests. In the meanwhile, the WLC will continue to empty the Queue and processing the current authentications requests.
On WLC with version 7.6 you can use the devshell command below to see the accounting and authentication queues
(5508-5) >devshell printRadiusPktIdInQ
Printing Radius IDs in Queue
Queue [0] Count [0] ----> Accounting Queue
Queue [1] Count [143] -----> Authentication Queue
value = 2 = 0x2
(5508-5) >devshell printRadiusPktIdInQ
Printing Radius IDs in Queue
Queue [0] Count [0]
Queue [1] Count [255]
value = 2 = 0x2
On WLC running version 8.0
(5508-5) >show radius queue
Radius Queue
Auth Alloc : 223642
Acct Alloc : 0
Auth Free : 223642
Acct Free : 0
Auth Alloc Err : 0
Acct Alloc Err : 0
Queue depth in above case is calculated by Auth alloc - Auth Free values.
The Queue depth for both Accounting and Authentication is 256 [0 - 255]
The WLC keeps rolling from 0 - 255 and back to 0 and so on.
Here if you take a WLC port channel capture (capture filter udp port 1812), you can see the RADIUS exchange between the WLC and the AAA server. I have a display filter applied sourced from the WLC's ip address and have applied a Wireshark column to display the 'RADIUS IDs'. if you see closely the ids rotate from 0-255 and back to 0.
You can have a quick view of the auth requests the WLC sends out and the responses by plotting a Wireshark IO graph. Here my graph 1 in black is the auth requests sent by the WLC (filter: radius.code == 1)
and graph2 in red is responses coming back from the AAA server (access challenge, access accept and access-rejects) (filter: radius.code==2 or radius.code==3 or radius.code==4)
typically you would like to see both the graphs follow close with each other. This can also show any spikes in the authentications per second seen/sent by the WLC.
The client debugs will show something like this
Client RADIUS authentication fails. "debug client" shows a message similar to this: *Dot1x_NW_MsgTask_7: Dec 17 11:43:36.983: 00:11:22:33:44:55 Entering Backend Auth Response state for mobile f0:d1:a9:24:d8:a7 *Dot1x_NW_MsgTask_7: Dec 17 11:43:36.985: 00:11:22:33:44:55 Processing AAA Error 'Out of Memory' (-2) for mobile f0:d1:a9:24:d8:a7 *Dot1x_NW_MsgTask_7: Dec 17 11:43:36.999: 00:11:22:33:44:55 Sent Deauthenticate to mobile on BSSID 20:37:06:00:11:22 slot 0(caller 1x_auth_pae.c:1394) at the same time, the msglog shows a message similar to this: *Dot1x_NW_MsgTask_7: Dec 17 12:30:23.296: #DOT1X-3-ABORT_AUTH: 1x_bauth_sm.c:447 Authentication Aborted for client 00:11:22:33:44:55 and the traplog shows a message like this: 297 Mon Dec 17 12:36:29 2012 Client Deauthenticated: MACAddress:00:11:22:33:44:55 Base Radio MAC:20:37:06:00:11:22 Slot: 1 User Name: unknown Ip Address: unknown Reason:Unspecif ied ReasonCode: 1
Remediations/ Workarounds:
1. try and reduce the auth storm/ reduce number of auths coming in
2. reduce latency between WLC-AAA.
3. can use multiple radius servers and point individual dot1s WLANs to different radius servers.
4. enable client exclusion to help alleviate the high auth rate
5. Can use multiple WLCs to split up the client load
No comments:
Post a Comment