Windows provides pools of paged and
nonpaged memory that drivers and other components can allocate. The Executive
component of the operating system manages the memory in the pools and exposes
the ExAllocatePoolXxx functions for use by drivers. Pool memory is a subset of
available memory and is not necessarily contiguous. The size of each pool is
limited and depends on the amount of physical memory that is available and
varies greatly for different Windows releases.
The paged pool is exactly what its name
implies: a region of virtual memory that is subject to paging. The size of the
paged pool is limited and depends on both the amount of available physical
memory on each individual machine and the specific operating system release.
For example, the maximum size of the paged pool is about 491 MB on 32-bit
hardware running Windows XP and about 650 MB on Windows Server 2003 SP1.
The nonpaged pool is a region of system
virtual memory that is not subject to paging. Drivers use the nonpaged pool for
many of their storage requirements because it can be accessed at any IRQL. Like
the paged pool, the nonpaged pool is limited in size. On a 32-bit x86 system that
is started without the /3GB switch, the nonpaged pool is limited to 256 MB; with
the /3GB switch, the limit is 128 MB. On 64-bit systems, the nonpaged pool
currently has a limit of 128 GB.
The pool sizes and maximums may vary
greatly for different Windows releases.
IRQL Considerations
When you design your driver, keep in mind
that the system cannot service a page fault at IRQL DISPATCH_LEVEL or higher.
Therefore, drivers must use nonpaged pool for any data that can be accessed at
DISPATCH_LEVEL or higher. You cannot move a buffer that was allocated from the
paged pool into the nonpaged pool, but you can lock a paged buffer into memory
so that it is temporarily nonpaged.
Locks must never be allocated in the
paged pool because the system accesses them at DISPATCH_LEVEL or higher, even
if the locks are used to synchronize code that runs below DISPATCH_LEVEL.
Storage for the following items can
generally be allocated from the paged pool, depending on how the driver uses
them:
·
Information about device
resources, relations, capabilities, interfaces, and other details that are handled
in IRP_MN_PNP_QUERY_* requests. The Plug and Play manager sends all queries at
PASSIVE_LEVEL, so unless the driver must reference this information at a higher
IRQL, it can safely store the data in paged memory.
·
The registry path passed to DriverEntry. Some drivers save this
path for use during WMI initialization, which occurs at PASSIVE_LEVEL.
While running at DISPATCH_LEVEL or below,
a driver can allocate memory from the nonpaged pool. A driver can allocate
paged pool only while it is running at PASSIVE_LEVEL or APC_LEVEL because
APC_LEVEL synchronization is used within the pool manager code for pageable
requests. Furthermore, if the paged pool is nonresident, accessing it at DISPATCH_LEVEL—even
to allocate it—would cause a fatal bug check.
For a complete list of standard driver
routines and the IRQL at which each is called, see “Scheduling, Thread Context,
and IRQL,” which is listed in the Resources section at the end of this paper.
In addition, the Windows DDK lists the IRQL at which system and driver routines
can be called.
Lookaside Lists
Lookaside lists are fixed-size, reusable
buffers that are designed for structures that a driver might need to allocate
dynamically and frequently. The driver defines the size, layout, and contents
of the entries in the list to suit its requirements, and the system maintains
list status and adjusts the number of available entries according to demand.
When a driver initializes a lookaside
list, Windows creates the list and holds the buffers in reserve for future use
by the driver. The number of buffers that are in the list at any given time
depends on the amount of available memory and the size of the buffers. Lookaside
lists are useful whenever a driver needs fixed-size buffers and are especially appropriate
for commonly used and reused structures, such as I/O request packets (IRPs). The
I/O manager allocates its own IRPs from a lookaside list.
A lookaside list can be allocated from either
the paged or the nonpaged pool, according to the driver’s requirements. After the
list has been initialized, all buffers from the list come from the same pool.
Caching
Drivers
can allocate cached or noncached memory. Caching improves performance,
especially for access to frequently used data. As a general rule, drivers
should allocate cached memory. The x86, x64, and Itanium architectures all
support cache-coherent DMA, so drivers can safely use cached memory for DMA
buffers.
Drivers
rarely require noncached memory. A driver should allocate no more noncached
memory than it needs and should free the memory as soon as it is no longer
required.
Alignment
The alignment of the data structures in a
driver can have a big impact on the driver’s performance and efficiency. Two
types of alignment are important:
·
Natural alignment for the data
size
·
Cache-line alignment
Natural Alignment
Natural alignment means aligning data according
to its type. The Microsoft C compiler aligns individual data items on an
appropriate boundary for their size. For example, UCHARs are aligned on 1-byte
boundaries, ints on 4-byte
boundaries, and LONGs and ULONGs on 4-byte boundaries.
Individual data items within a structure
are also naturally aligned—the compiler adds padding bytes if required. When
you compile, structures are aligned according to the alignment requirements of
the largest member. Unions are aligned according to the requirements of the
first member of the union. To align individual members of a structure or union,
the compiler also adds padding bytes. When you compile a 32-bit driver,
pointers are 32 bits wide and occupy 4 bytes. When you compile a 64-bit driver,
pointers are 64 bits and occupy 8 bytes. A structure that contains a pointer,
therefore, might require different amounts of padding on 32-bit and 64-bit
systems. If the structure is used only internally within the driver,
differences in padding are not important. However, you must ensure that the
padding is the same on 32-bit and 64-bit systems in either of the following situations:
·
The structure is used by both
32-bit and 64-bit processes running on a 64-bit machine.
·
The structure might be passed
to or used on 32-bit hardware as a result of being saved on disk, sent over the
network, or used in a device I/O control request (IOCTL).
You can resolve this issue by using
pragmas (as described below) or by adding explicit dummy variables to the
structure just for padding. For cross-platform compatibility, you should explicitly
align data on 8-byte boundaries on both 64-bit and 32-bit systems.
Proper alignment enables the processor to
access data in the minimum number of operations. For example, a 4-byte value
that is naturally aligned can be read or written in one cycle. Reading a 4-byte
value that does not start on a 4-byte (or multiple) boundary requires an
additional cycle, and the requested bytes must be pieced together into a single
4-byte unit before the value can be returned.
If the processor tries to read or write
improperly aligned data, an alignment fault can occur. On x86 hardware, the
alignment faults are invisible to the user. The hardware fixes the fault as
described in the previous paragraph. On x64 hardware, alignment faults are
disabled by default and the hardware similarly fixes the fault. On the Intel
Itanium architecture, however, if an alignment fault occurs while 64-bit
kernel-mode code is running, the hardware raises an exception. (For user-mode
code, this is a default setting that an individual application can change,
although disabling alignment faults on the Itanium can severely degrade
performance.)
To prevent exceptions and performance problems
that are related to misalignment, you should lay out your data structures
carefully. When allocating memory, ensure that you allocate enough space to
hold not just the natural data size, but also the padding that the compiler
adds. For example, the following structure includes a 32-bit value and an
array. The array elements can be either 32 or 64 bits long, depending on the
hardware.
struct Xx {
DWORD NumberOfPointers;
PVOID Pointers[1];
};
When this declaration is compiled for
64-bit hardware, the compiler adds an extra 4 bytes of padding to align
the structure on an 8-byte boundary. Therefore, the driver must allocate enough
memory for the padded structure. For example, if the array could have a maximum
of 100 elements, the driver should calculate the memory requirements as
follows:
FIELD_OFFSET (struct Xx, Pointers) + 100*sizeof(PVOID)
The FIELD_OFFSET macro returns the byte
offset of the Pointers array in the structure Xx. Using this value in the
calculation accounts for any bytes of padding that the compiler might add after
the NumberOfPointers field.
To force alignment on a particular byte
boundary, a driver can use any of the following:
·
The storage class qualifier __declspec(align()) or the
DECLSPEC_ALIGN() macro
·
The pack() pragma
·
The PshpackN.h and Poppack.h header files
To change the alignment of a single
variable or structure, you can use __declspec(align())
or the DECLSPEC_ALIGN() macro, which is defined in the Windows DDK. The
following type definition sets alignment for the ULONG_A16 type at 16 bytes,
thus aligning the two fields in the structure and the structure itself on
16-byte boundaries:
typedef DECLSPEC_ALIGN(16) ULONG
ULONG_A16;
typedef struct {
ULONG_A16 a;
ULONG_A16 b;
} TEST;
You can also use the pack() pragma to specify the alignment of structures. This pragma applies
to all declarations that follow it in the current file and overrides any
compiler switches that control alignment. By default, the DDK build environment
uses pack (8). The default setting
means that any data item with natural alignment up to and including 8 bytes is naturally
aligned, not necessarily 8-byte aligned, and everything larger than 8 bytes is aligned
on an 8-byte boundary. Thus, two adjacent ULONG fields in a 64-bit aligned
structure are adjacent, with no padding between them.
Another way to change the alignment of
data structures in your code is to use the header files PshpackN.h (pshpack1.h, pshpack2.h, pshpack4.h,
pshpack8.h, and pshpack16.h) and Poppack.h, which are installed as part of the
Windows DDK. The PshpackN.h files
change alignment to a new setting, and Poppack.h returns alignment to its
setting before the change was applied. For example:
#include <pshpack2.h>
typedef struct
_STRUCT_THAT_NEEDS_TWO_BYTE_PACKING {
/* contents of structure
...
} STRUCT_THAT_NEEDS_TWO_BYTE_PACKING;
#include <poppack.h>
In the example, the pshpack2.h file sets
2-byte alignment for everything that follows it in the source code, until the
poppack.h file is included. You should always use these header files in pairs.
Like the pack() pragma, they
override any alignment settings specified by compiler switches.
For more information about alignment and
the Microsoft compilers, see the Windows DDK and the MSDN library, which are listed
in the Resources section of this paper.
Cache-Line Alignment
When you design your data structures, you
can further increase the efficiency of your driver by considering cache-line
alignment in addition to natural alignment.
Memory that is cache-aligned starts at a processor
cache-line boundary. When the hardware updates the processor cache, it always
reads an entire cache line rather than individual data items. Therefore, using
cache-aligned memory can reduce the number of cache updates necessary when the
driver reads or writes the data and can prevent other components from
contending for updates of the same cache line. Any memory that starts on a page
boundary is cache-aligned.
Drivers typically allocate nonpaged,
cache-aligned memory to hold frequently accessed driver data. If possible, lay
out data structures so that individual fields are unlikely to cross cache line
boundaries. The size of a cache line is generally from 16 to 128 bytes,
depending on the hardware. The KeGetRecommendedSharedDataAlignment
function returns the recommended alignment on the current hardware.
Cache-line alignment is also important
for shared data that two or more threads can access concurrently. To reduce the
number of cache updates, fields that are protected by the same lock and are updated
together should be in the same cache line. Structures that are protected by
different locks and can therefore be accessed simultaneously on two different
processors should be in different cache lines. Laying out data structures in
this way prevents processors from contending for the same cache line, which can
have a profound effect on performance.
For more information about cache line alignment
on multiprocessor systems, see “Multiprocessor Considerations for Kernel-Mode
Drivers,” which is listed in the Resources section at the end of this paper.
0 komentar:
Posting Komentar