One of the first things a consultant usually does when starting a consulting contract is they document the environments. In my last contract with one of the large investment banks in New York City, I was doing just that. In my discovery phase I ran into a strange anomaly that didn't make sense.
This story is about the physical and logical CPUs reported from the following query.
Most DBAs should recognize the query:
SELECT cpu_count AS [Logical CPU Count],
hyperthread_ratio AS [Hyperthread Ratio],
cpu_count/hyperthread_ratio AS [Physical CPU Count],
physical_memory_in_bytes/1048576 AS [Physical Memory (MB)]
First step is documenting the lower environments, and in my discovery interview with the client I was told that the UAT environment was almost identical to the production. There should be very little differences.
The UAT environment had 10 HP DL580 G7 Servers with Windows 2008 R2 Enterprise installed. All were running SQL Server 2008 Enterprise running SP3 without any CUs applied. The machines each had 4 sockets populated with Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz which are 10 core processors.
One of the machines appeared to only have two sockets populated. And this is where the story gets interesting, as a consultant, it is difficult to come in as an outsider and ask the question, "How many physical CPUs does this server have" and then ask, "are you sure" without annoying them. Sorry, I had to ask as I only see two of the four processors.
So the puzzle becomes interesting when I look at the production environment and notice 5 of the 10 servers also have the same issue, but also gives us some clues about the issue. The production environment has 5 servers with Intel(R) Xeon(R) CPU X7560 @ 2.27GHz processors which are 8 core processors and 5 of the same processors being used in UAT. The servers with 8 core processors were reporting the correct number of physical processors and 64 logical processors. The Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz servers were showing 2 physical processors with 40 logical processors.
All of the servers in the production environment have hyperthreading turned on, 1 server in UAT has hyperthreading enabled... guess which one?
Working in a bank requires extreme amounts of patience, as the change management process is very strict, and there is a significant segregation of duties that makes it hard to build a trust quickly with the server group and the database administrators quick enough to grease the path for the change management ticket implementation. In other words, lots of questions and lots of hesitation to change. The change management process is very important, as the cost of downtime is high.
Ticket created to disable hyperthreading in the one UAT server that was having the processor issue as soon as the realization that hyperthreading appear to be the issue and it had something to do with 64 vs 80 logical processors.
After a week of patiently waiting, hyperthreading was disabled (which requires a server reboot) on the one UAT server ... voila, I can see all 4 processors. Now came the need for an explanation as the explanation is going to be necessary to implement in production.
This is where having friends that are SQL Server MVPs with deep knowledge of Windows comes in handy. After brainstorming with Denny Cherry, he has a thought and says, "SQL Server 2008 doesn't understand what a Processor Group is". That is the ultimate clue into why we can't see more than 2 processors, but trying to put the puzzle together and explain it to the management is the trick. Why would we only see 2 physical processors, and only see 40 logical processors? It would make more sense to me that we'd see 64 logical processors vs. 40 logical processors. By the way even verified the number of schedulers in the DMV sys.dm_os_schedulers.
With one of the MVPs and a couple of CSS support engineers (and a little research), this following is the best answer to what we were experiencing:.
First, let me explain the Kernal group ( K-Group aka processor group) concept. In Windows 2008, the maximum logical processors was 64. With the number of cores per processor reaching levels greater than 10, the probability of exceeding the 64 logical processor max with hyperthreading was very likely. In Windows 2008 R2, Microsoft added 4 K groups, each that had a maximum of 64 logical processors, so they could get to 256 maximum (4 K groups).
So, a processor group looks like this:
For a four socket machine, a K-group can contain multiple NUMA nodes with a physical processor which contains multiple logical processors. The key is each K-group has a maximum logical processor limit of 64. More info about NUMA.
The following is the best representation of what happens when you have 4 sockets populated with 10 core processors:
As Windows 2008 R2 starts up, it determines it is going to need two processor groups, as 80 logical processors exceeds the 64 maximum per K-group. Windows "balances" the logical processors between the two groups with 40 in group 0 and 40 in group 1.
The trouble comes in when SQL Server 2008 (and coincidentally SQL Server 2008 R2) starts up. Since SQL Server 2008 does not understand the K-groups, it only uses K-group 0 which only has 2 physical processors and 40 logical processors.
In summary, if you have an installation of Windows 2008 R2 with SQL Server 2008 or 2008 R2, and you're only seeing two of your four installed processors in the DMV, sys.dm_os_sysinfo then you are most likely suffering from the same issue as described in this blog.
By the way, you are also suffering from half of your memory being accessed as foreign memory instead of local memory (cross numa), which can be a performance issue. In addition, the below graph shows the results of disabling hyperthreading in this scenario.
The dotted line is the point where hyperthreading was disabled. This was the weekend before a client migration that increased the transactions considerably (shown in load), but the waits decreased considerably (MS-SQL) and CPU usages decrease considerably (CPU).
Solution: Disable hyperthreading for the scenario where you are running Windows 2008 R2 and SQL Server 2008 R2 or below. Use this blog as documentation for management justification. It was a challenge justifying and convincing management to turn hyperthreading off. An additional argument by the system engineers that all servers should either have hyperthreading enabled or disabled was even more of a challenge.
The fact that the servers didn't have matching resources should have been justification enough to only change the servers experiencing the issue until further testing would confirm the proper hyperthreading for the client's workload.
By the way, Denny blogged about this same issue here.
Related Post: As The World Turns: SQL Server NUMA Memory Node and the Operating System Proximity by Bob Dorr