Marc Lognoul's IT Infrastructure Blog

Cloudy with a Chance of On-Prem

SharePoint: PAL from the Field

Leave a comment

Introduction

Performance Analysis of Logs (PAL) is a free tool designed to analyze Windows Perfmon-based logs against predefined thresholds. The thresholds are defined in configuration files usually mapped to an MS technology (.Net, IIS…) or product (SQL Server, SharePoint. It produces reports in HTML or XML formats, the first one also including eye-candy charts.

In a nutshell, PAL almost completely removes the hassle of reading and interpreting performance logs.

However, making sense of PAL reports in real life may also require time for experimenting and unfortunately, very few guidance can be found on the web. Therefore I wanted to close the gap a little.

This post assumes you are minimally familiar with PAL. If this was not the case, there are many other blogs detailing the installing and the usage basics. The CodePlex project also includes useful introduction:

What to Expect from PAL

PAL is the perfect tool to be used when you investigate mostly infrastructure-related performance problems impacting Microsoft product and technologies.

It helps translating Perfmon logs into humanly readable reports with added value brought by charts, recommended thresholds and generic guidance. A report is roughly made of 2 sections: chronologically ordered alerts and statistical figures enhanced with their matching charts.

In my opinion, PAL is not designed to help you trending or building up your capacity planning in the long run. for this purpose, product such as SCOM should be preferred. Likewise, PAL should not be used as a performance monitoring tool. Finally, PAL will not help drilling down into the code and will not cover end-to-end performance monitoring or troubleshooting. For this purpose, a real APM or tracing tool should be preferred.

Prerequisites

Make sure your performance counters are healthy, I can’t remember the number of times I had to fix broken counter before anything else could take place:

Practice a little with Perfmon capture and PAL in a test environment. It seems obvious but many organizations I worked for were directly in their production environment with a full counter set, a high capture frequency and this during abnormally long periods. This leads to loss of time for generating reports and lots of frustrations and confusions since the reports contains too many information’s to actually be helpful.

Decide if you will generate PAL report on a computer dedicated to this purpose or if you prefer to do it on the monitored server during off-peak hours. Keep in mind that while capturing counter has very little to no effect on performance, performing PAL analysis is extremely CPU and disk I/O intensive.

Although PAL does it for you, make sure you understand what each counter really means and what it means in your own environment.  Avg. Disk Queue Length/Current Disk Queue Length being a good example of misleading/misinterpreted counter.

Correctly identify your environment: what are the processes running (at least, the ones making sense), what are the physical/logical disks and their purpose, what are the memory sizing (physical and virtual) and of course the CPU characteristics.

In Perfmon/Perflogs, preferably identify processes by their PID instead of their instance ID. This is particularly useful with SharePoint and IIS where you can have multiple IIS Worker Processes (W3WP.exe)running, even in the most basic implementations

While some SharePoint counter will directly refer to SharePoint applications, others won’t. Therefore, it is always useful to have scripts at hand doing the job for you.

On Server 2003/IIS6 using a command-prompt:

cd %windir%System32
cscript.exe iisapp.vbs

From Server 2008/IIS7using a command-prompt:

cd %windir%System32inetsrv
appcmd list wp

Using PowerShell:

gwmi win32_process -filter “name=’w3wp.exe'”|Select ProcessId, CommandLine

Be watchful with process ID’s: they may evolve during the time of the capture since when a process crashes, a new one with its own ID is usually restarted. The same happens to a worker process if it recycles.

Take also time to benchmark PAL:

  • Estimate the storage used by captures
  • Estimate the time take for PAL to produce reports
  • Estimate the storage used by PAL report

While a 2-hours capture using the default SharePoint 2010 will generate from 30 to 50 MB of BLG file and take about 10 minutes for processing, things will start counting in larger amount.

Some counters (like the ones related to processes and SharePoint’s publishing cache) can boost the size and time to generate reports because they are multiplied by the number of running processes or existing Site Collections

And finally, download and install PAL on the computer you selected for this purpose. Remember, PAL will only be used to generate reports, not capture and reading reports. Therefore there is no strict requirement to install it on every SharePoint server.

Planning Performance Captures

To ease you life, generate the Perfmon configuration files directly from PAL: Start PAL, go to the tab Threshold File then select the Threshold file corresponding to the work load and finally click on the button Export to Perform Template File.

Select the format according to the operating system version captures will be taken from. LOGMAN format is the best choice if your goal is superior automation of the capture process.

Carefully plan the capture period. Usually, warm-up of ASP.Net/SharePoint application generate a lot of noise not really relevant to you performance troubleshooting, therefore, it is preferable to start capturing when your application is already in cruise mode. Unless of course if the performance problem occur at compilation time. The same applies to crawl performance troubleshooting: preferably start capturing when the crawl is effectively started, not when it is starting.

Keep the sampling interval between 5 and 15 seconds. While less than 5 does not help because it tends to make things look worse than what they actually are (very short CPU peak or intensive disk I/O…), more than 15 may make the capture inaccurate because some missing numbers. In most cases, 15 seconds will do fine.

Keep the format to binary (BLG): although not humanly readable, It’s way more compact and directly usable by PAL. Note: some tools can convert Perfmon logs whenever needed, I will discuss that at later time.

Finally, and if you run a multi-server farm (remote SQL for example), decide if you prefer to put capture from various servers into the same log file or if you which to use separate logs. Remember that in most cases, the footprint of Perform is usually negligible. if you chose for per-server capture, make sure you sufficient in control to run them simultaneously.

Happy performance troubleshooting!

Marc

Advertisements

Author: Marc Lognoul

Relentless cloud professional. Restless rider. Happy husband. Proud father. Opinions are my own.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s