Thursday, February 11, 2010

Antivirus - Page Pool vs Non-Page Pool

Let me explain non-paged pool/paged pool in more detail. Paged and nonpaged pools serve as the memory resources that the operating system and device drivers use to store their data structures. As to your question why it is working fine before, but cluster resource failed frequently recently? I would like to explain that non-paged pool memory depletion could be caused by multiple reasons, suppose we have window 2003 sp2 installed, which would enable a new feature of TCP chimney as well, and this new feature would also occupy extra non-paged pool (check more detail info http://support.microsoft.com/kb/945977), although this new feature would occupy much less non-paged pool compared with Symantec dcrive(SavE). To sum up, non-paged pool memory usage would increase as long as we have more drivers deployed in the server. As we know, non-paged pool memory is a precious resource in system, and system would possibly become unstable when non-paged pool is exhausted (total non-paged pool usage is reaching 100M).

One thing that we can confirm now is that Symantec driver (SavE) is occupying too much non-paged pool(52M+), which is taking 50% of total non-paged pool. We can confirm that non-paged pool would come back to normal level if we bring SavE tag to a normal level.





Problem description:
==================
Exchange cluster HTTP resource failed in IsAlive check.

Problem analysis:
================
Captured poolmon on the server, and figured out it’s a non-paged pool pressure issue, and it is mainly caused by tag SavE, which is Symantec Antivirus software. In windows 2003 32bit system, the total non-paged pool resource is around 120MB since we enabled /3GB switch as exchange required (without /3GB, the size will be doubled in most cases), and this non-paged pool resource will be shared by most of the drivers.

In our case, Symantec driver has occupied more than 52MB non-paged pool resource, almost half of the total size, this causes available non-paged pool resource is insufficient for other drivers or services to allocate. At some points, service requests may failed due to this problem, such as Exchange HTTP resource failed in IsAlive check.

For such kind of issue, the best way is to cooperate with Symantec to reduce non-paged pool resource usage or change the way to allocate resource, this will release a large number of non-paged pool resource, and the non-paged pool pressure issue will be eased as well. After uninstall the previous Symantec antivirus software and install a new version, we did see Symantec driver won’t allocate non-paged pool resource anymore and the non-paged pool resource released around 50MB.

That means Symantec has realized this problem and change their behavior in new product version. (The paged pool resource is around 250MB on windows 2003 32bit system even /3GB has been enabled.)

Regarding your concerns why the exchange server becomes unstable after we installed some hotfixes, basically, we have reviewed the list you sent to us, all of them are Security Updates, these hotfixes are only used to fix System Security Holes, they won’t change the resource buffer size. But we can’t confirm if Symantec drivers will charge more non-paged pool resource after install these hotfixes since we didn’t get the performance data before this failure.

However, the key point to this issue is still how to release non-paged pool resource occupied by Symantec driver since it used almost half of them. And the new version of Symantec product has fixed this problem by change the way to allocate system resource, using paged pool instead. So suspect Exchange server will become more stable since the total non-paged usage is only 47MB now.

Suggestions:
============
From a long term perspective, we suggest to migrate the server to 64bit system, then we won’t have this issue anymore.

Meanwhile, please keep running poolmon and performance monitor to trace the usage of system resource.


Cluster HTTP virtual server resource failed to be brought online.

Object:
======
Exchange cluster 2k3 sp2.

Assessment & Plan:
======
- Check cluster log:
- 00000e7c.00000810::2009/12/03-04:04:25.511 ERR Microsoft Exchange DAV Server Instance : [EXRES] DwCheckProtocolBanner: failed in send. Error 10054.
- 00000e7c.00000810::2009/12/03-04:04:25.511 ERR Microsoft Exchange DAV Server Instance : [EXRES] ExchangeCheckIsAlive: IsAlive failed, will retry in 100 msec.
- 00000e7c.00000810::2009/12/03-04:04:25.620 ERR Microsoft Exchange DAV Server Instance : [EXRES] DwCheckProtocolBanner: failed in send. Error 10053.
- 00000e7c.00000810::2009/12/03-04:04:25.620 ERR Microsoft Exchange DAV Server Instance : [EXRES] ExchangeCheckIsAlive: IsAlive failed, will retry in 200 msec.
- 00000e7c.00000810::2009/12/03-04:04:25.823 ERR Microsoft Exchange DAV Server Instance : [EXRES] DwCheckProtocolBanner: failed in send. Error 10054.
- 00000e7c.00000810::2009/12/03-04:04:25.823 ERR Microsoft Exchange DAV Server Instance : [EXRES] ExchangeCheckIsAlive: IsAlive failed, will retry in 400 msec.
- 00000e7c.00000810::2009/12/03-04:04:26.214 ERR Microsoft Exchange DAV Server Instance : [EXRES] DwCheckProtocolBanner: failed in send. Error 10053.
-
- Check performance log in the server side=>total non-paged pool memory usage exceed 108MB+.
- Follow http://support.microsoft.com/kb/934878 to check EnableAggressiveMemoryUsage
- Click Start, click Run, type regedit in the Open box, and then click OK.
- Click the following registry subkey:
- HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HTTP\Parameters
- On the Edit menu, point to New, and then click DWORD Value.
- Type EnableAggressiveMemoryUsage, and then press ENTER.
- On the Edit menu, click Modify.
- Assured In the Value data box, value “1” is set.
- On the File menu, click Exit to exit Registry Editor.
- Restart the HTTP service. To do this, follow these steps:
- Click Start, click Run, type cmd in the Open box, and then click OK.
- At the command prompt, type net stop http /y, and then press ENTER.
- At the command prompt, type iisreset /restart, and then press ENTER.
- Run poolmon tool=>find tag SavE occupies around 52M+ non-paged pool.
-
- As we know, SavE tag belongs to Symantec driver module.
- System would became unstable when the total non-paged pool exceed 100M+.
- We tried to disable Symantec services, but with no luck, it is still occupying 52M+ non-paged pool after reboot.
- Currently, since we could not uninstall Symantec software, we follow the steps to disable TCP chimney feature on the server using the Netsh.exe tool:

1. Click Start, click Run, type cmd, and then click OK.

2. At the command prompt, type:

Netsh int ip set chimney DISABLED

- 3. Press the ENTER key.
- We could bring cluster resources online now.

Next Action Plan:
======
1. Please involve Symantec vendor to check if we could update drivers to fix this issue, because it is considered to be abnormal to occupy 50M+ non-paged pool for one application.



Others
Frist of all, you could check this information in Performance Monitor.
There are Pool Nonpaged Bytes and Pool Paged Bytes performance counters
under Process performance object.

In fact, I would like to add some more here. For the troubleshooting, the
key point is to find out which driver consume the nonpaged pool or paged
pool memory. What we need to do is to find out what pool tag allocates
unusal paged pool/nonpaged pool memory from the system. For example, if
there is a paged pool leak, say the Total paged pool grows from 80MB to
160MB in half day. When driver allocates paged pool from system, it needs
to provide a pool tag. You then need to find out which pool tag contributes
the most of the raise. And then, find out which driver uses the pool tag.

You can use Poolmon tool to identify how many bytes are allocated per pool tag.


a. Save the attached file to your server and unzip it to install the tool.
b. Run the PoolMon Log Creator tool installed, you only need to change its
Snapshot Interval to 60 minutes, and then click on the "Start Poolmon
Logging Service" button at the bottom to start.

NOTE: The tool requires .net framework 1.1. If you receive the related
error message, please download it from the following link and then install
it:
http://www.microsoft.com/downloads/details.aspx?familyid=262D25E3-F589-4842-
8157-034D1E7CF3A3&displaylang=en

c. Leave the tool running until the problem occurs, stop logging and then
send the log file to Microsoft Support Professionals and they will let you
know the answer

For more information:

How to use Memory Pool Monitor (Poolmon.exe) to troubleshoot kernel mode
memory leaks
http://support.microsoft.com/default.aspx/kb/177415

No comments:

Post a Comment