Sunday, March 8, 2015

Processing AAA Error 'Out of Memory'

This error is sometimes seen in large scale deployments where large number of client authentication requests fill up the RADIUS queues on the WLC. The WLC should typically be able to recover automatically but in certain cases if there is a lot of latency in the AAA server responses the WLC gets stuck in this state for a long time (have seen recovery times of more than 30-40 mins).

Typical response times from AAA server in normal functioning are in low milliseconds but can spike up to 1-10 second period. 

Most customers that I have worked with have typical radius timeout value set to 5 seconds and radius retries set to default of 5 retries. 


(Cisco Controller) >show radius summary 
Vendor Id Backward Compatibility............ Disabled
Credentials Caching......................... Disabled
Call Station Id Type........................ IP Address
Administrative Authentication via RADIUS.... Enabled
Aggressive Failover......................... Disabled
Keywrap..................................... DisabledAuthentication Servers

!--- This portion of code has been wrapped to several lines due to spatial!
--- concerns.
Idx  Type  Server Address    Port    State    Tout RFC3576
---  ----  ----------------  ------  --------  ----  -------
1    N     10.48.76.50       1812    Enabled   2    Enabled


this is how to configure the RADIUS timeout

This is how to configure:
config radius auth retransmit-timeout 1 5

During the RADIUS Queue full issue on the WLC, the WLC is start rejecting NEW authentication requests from the clients till more space is created in the RADIUS Queue to accommodate new auth requests. In the meanwhile, the WLC will continue to empty the Queue and processing the current authentications requests.

On WLC with version 7.6 you can use the devshell command below to see the accounting and authentication queues


(5508-5) >devshell printRadiusPktIdInQ

 Printing Radius IDs in Queue
 Queue [0]  Count [0]               ----> Accounting Queue

 Queue [1]  Count [143]           -----> Authentication Queue

value = 2 = 0x2


(5508-5) >devshell printRadiusPktIdInQ

 Printing Radius IDs in Queue
 Queue [0]  Count [0]

 Queue [1]  Count [255]

value = 2 = 0x2

On WLC running version 8.0 

(5508-5) >show radius queue
Radius Queue
  Auth Alloc         : 223642
  Acct Alloc         : 0
  Auth Free          : 223642
  Acct Free          : 0
  Auth Alloc Err     : 0
  Acct Alloc Err     : 0

Queue depth in above case is calculated by Auth alloc - Auth Free values.


The Queue depth for both Accounting and Authentication is 256 [0 - 255]

The WLC keeps rolling from 0 - 255 and back to 0 and so on.

 Here if you take a WLC port channel capture (capture filter udp port 1812), you can see the RADIUS exchange between the WLC and the AAA server. I have a display filter applied sourced from the WLC's ip address and have applied a Wireshark column to display the 'RADIUS IDs'. if you see closely the ids rotate from 0-255 and back to 0.

You can have a quick view of the auth requests the WLC sends out and the responses by plotting a Wireshark IO graph. Here my graph 1 in black is the auth requests sent by the WLC (filter: radius.code == 1)

and graph2 in red is responses coming back from the AAA server (access challenge, access accept and access-rejects) (filter: radius.code==2 or radius.code==3 or radius.code==4)

typically you would like to see both the graphs follow close with each other. This can also show any spikes in the authentications per second seen/sent by the WLC.




The client debugs will show something like this

Client RADIUS authentication fails.  "debug client" shows a message similar to this:

*Dot1x_NW_MsgTask_7: Dec 17 11:43:36.983: 00:11:22:33:44:55 Entering Backend Auth Response state for mobile f0:d1:a9:24:d8:a7
*Dot1x_NW_MsgTask_7: Dec 17 11:43:36.985: 00:11:22:33:44:55 Processing AAA Error 'Out of Memory' (-2) for mobile f0:d1:a9:24:d8:a7
*Dot1x_NW_MsgTask_7: Dec 17 11:43:36.999: 00:11:22:33:44:55 Sent Deauthenticate to mobile on BSSID 20:37:06:00:11:22 slot 0(caller 1x_auth_pae.c:1394)

at the same time, the msglog shows a message similar to this:

*Dot1x_NW_MsgTask_7: Dec 17 12:30:23.296: #DOT1X-3-ABORT_AUTH: 1x_bauth_sm.c:447  Authentication Aborted for client 00:11:22:33:44:55

and the traplog shows a message like this:

297 Mon Dec 17 12:36:29 2012 Client Deauthenticated: MACAddress:00:11:22:33:44:55
                             Base Radio MAC:20:37:06:00:11:22 Slot: 1 User
                             Name: unknown Ip Address: unknown Reason:Unspecif
                             ied  ReasonCode: 1


Remediations/ Workarounds:

1. try and reduce the auth storm/ reduce number of auths coming in
2. reduce latency between WLC-AAA. 
3. can use multiple radius servers and point individual dot1s WLANs to different radius servers.
4. enable client exclusion to help alleviate the high auth rate
5. Can use multiple WLCs to split up the client load

No comments:

Post a Comment