SSIS – An Inside View Part 3
This article is part 3 of a 4 part series that explores the internals of SQL Server Integration Services. This article looks at Buffer Management in SSIS.
SSIS Buffer Management
As explained in the last article, the data flow engine requires a buffer (a buffer is a managed storage object that holds the data in memory) to store incoming data from the source, do the necessary transformation in-memory if any and upload it the destination. The creation, allocation and management of a buffer is done by the Buffer Manager. SSIS has its own buffer manager which manages data specific to the data flow pipeline engine and is different from the SQL Server buffer manager.
The data stored in buffers may be shared among several components of the pipeline (sources, transforms, destinations). The buffers provide the “wiring” by which data is moved, without performing unnecessary copy operations; instead it passes the pointers to an interface to access the buffer and its contents.
The source adapter reads rows from the source and adds them to a buffer. Once the current buffer is full, the buffer is marked as full and source adapter passes the buffer to the first transform downstream of the source adapter. The downstream transform grabs a pointer to the buffer and accesses the data one row at a time doing the component-specific processing. Once the input has been processed, a synchronous transformation pushes the existing buffer downstream to the next transform; whereas an asynchronous transformation would define a new buffer type and allocate a new buffer to store results. When the original input buffer to the asynchronous transformation is no longer needed, it will signal to the buffer manager and the buffer will be de-allocated or re-used. Once all the rows of a buffer have been processed, it is either passed downstream to the next transform if the current processing transform is a synchronous transformation, or it will be thrown away if the current processing transform is an asynchronous transformation.
What is the size of a buffer and how many records does a buffer contain? It depends, SSIS automatically tunes buffer sizes to achieve maximum memory utilization based on five parameters discussed below.
Estimated Row size/length – plays a major role in determination of buffer size, it specifies the size/length of a row when all columns contain possible largest values. SSIS calculations are based on the metadata collected about the source data at design time. The row size can be shrunk by identifying the smallest possible data types for all columns as early in the data flow as possible. This is especially important for flat file data sources because each column is automatically read into SSIS as a string data type unless the column’s data type has been explicitly configured.
It is important to note that for a given character data type, the maximum memory allocated for the type is used. For instance, if the column is a varchar (1000) and the data it has is “Arshad” even in that case 1000 bytes (maximum for the type) would be allocated to store the six-character string. It is important in terms of memory and performance to choose types accordingly, so that as little memory can be used as possible which would allow for more rows to pack on a buffer object.
DefaultBufferSize – a DFT (Data Flow Task) property which specifies maximum size for an individual buffer in bytes. By default its value is 10,485,760 bytes (10 MB). As this setting is configured consider that its upper bound is constrained by an internal SSIS parameter called MaxBufferSize which is set to 100 MB and cannot be changed.
DefaultBufferMaxRows – a DFT (Data Flow Task) property that specifies a maximum number of rows that can be held in an individual buffer. By default the value is set to 10,000, though this figure will be reduced if (based on row size estimates) the memory consumption would be greater than the DefaultBufferSize.
MaxBufferSize – a SSIS internal parameter which cannot be changed, it dictates upper limit for the DefaultBufferSize property of DFT as discussed above. Its value is 100 MB.
MinBufferSize – a SSIS internal parameter which cannot be changed, it dictates lower limit for the DefaultBufferSize property of DFT as discussed above. Normally its value is 64 KB but it depends on operating system’s virtual memory allocation functionality so may differ from machine to machine.
The data flow engine begins the task of sizing its buffers by calculating the estimated size of a single row of data, then it multiplies the estimated size of a row by the value with DefaultBufferMaxRows to obtain a preliminary working value for the buffer size, there could be three scenarios as discussed below:
Scenario 1: If the anticipated buffer size (i.e. Estimated Row Size * DefaultMaxBufferRows) is greater than the value of MaxBufferSize, the engine reduces the number of rows in a buffer to keep only those rows whose size doesn’t exceed the value of MaxBufferSize.
For example, if SSIS calculates an Estimated Row Size of 15,000 bytes per record, then the anticipated buffer size would be calculated as 15,000 bytes/record * 10,000 records, which is approximately 143 MB and about 1.5 times greater than the MaxBufferSize of 100 MB. Because the anticipated size exceeds the MaxBufferSize, SSIS reduces the number of records per buffer approximately by a factor of 1.5 to get below the 100 MB threshold. In this scenario, each buffer is sized to hold approximately 6,600 records.
Scenario 2: If the anticipated buffer size (i.e. Estimated Row Size * DefaultMaxBufferRows) is less than MinBufferSize, the engine increases the number of rows in a buffer to accommodate as many as rows it can have to maximize memory utilization.
For example, if the Estimated Row Size is 5 bytes per record (much smaller than the previous example), 5 bytes/record* 10,000 records, is approximately 48 KB which is less than a MinBufferSize of 64 KB. Because the anticipated size is less than the MinBufferSize, SSIS slightly increases the records to reach the 64 KB threshold.
Scenario 3: If the anticipated buffer size (i.e. Estimated Row Size * DefaultMaxBufferRows) is somewhere in between MinBufferSize and DefaultMaxBufferSize, then SSIS attempts to size the buffer as closely possible to result of Estimated Row Size * DefaultMaxBufferRows using a multiple of the MinBufferSize to increase memory utilization.
For example, if the Estimated Row Size is 500 bytes per record, 500 bytes/record* 10,000 records, is approximately 4.8 MB which is less than DefaultMaxBufferSize of 10 MB, but greater than a MinBufferSize of 64 KB. In this scenario, SSIS will make adjustments to size the buffer as closely as possible to 4.8 MB.
What should be avoided is creating a situation where all available memory is consumed by buffers. This causes the buffers to be written out to the disk temporarily and has a big performance impact. Ultimately it is not always possible to ensure that buffers do not get spooled to the disk. Even the most aggressive buffer strategies using the most memory-efficient components can be tripped up by a solution requirement necessitating the use of a more memory-intensive operation. In those cases, having the buffer spooling out to a disk on a separate high-speed channel can have a suitable performance difference. To change the spooling location, two properties of the Data Flow task must be modified—BufferTempStoragePath and BLOBTempStoragePath.
Binary large object (BLOB) data is typically large in size, thus it’s easy to see that BLOB data could max out a 10 MB buffer. If the BLOB fits in the buffer it will be stored there. However if the BLOB does not fit on the buffer, it will be stored to a file location in a temporary BLOB storage. This BLOB storage file is configurable, but uses the path in the TEMP environment variables by default.
Since larger BLOB data that does not fit on an in-memory buffer is not stored in memory, there is an inherent performance penalty of having to access the disk. During the execution of the pipeline, BLOB data is read and written to disk as needed. It’s a good idea to configure this to use efficient/fast disk subsystem with adequate space assuming that eventually there will be enough BLOB data to spill onto the disk.
In case of buffer overflow or memory pressure, buffer content is spooled/swapped to file system; the location will be determined by BLOBTempStoragePath and BufferTempStoragePath properties of Data Flow Task, in case no values are specified for these properties, default values will taken from TMP/TEMP environment variables.
If buffer swapping occurs, it is highly likely that more physical RAM on the server in question would improve performance. It may also mean the package designer might have an opportunity to fine-tune the amount of data that is being fed into the pipeline.
Note: There are two possible causes which cause SSIS to move data to disk as below:
- When a memory allocation fails.
- When Windows signals the low memory resource notification event.
Buffer Performance Counters
The SSIS pipeline provides several Performance Counters that can be used to gain visibility into how resources are being utilized during package execution. For example, viewing the number of rows read and the number of buffers in use as the package executes. A few of the most common counters used while troubleshooting/tuning are:
Buffers in use – number of pipeline buffers in use throughout the pipeline.
Private buffers in use (A private buffer is created by a transform (not by the pipeline) and used as a workspace to assist with the processing of transformation use for temporary work only. For example, the Aggregation transformation uses private buffers) – number of private transformation buffers in use throughout the pipeline.
Flat buffers in use (A flat buffer is a special type of private buffer used when a transform needs a chunk of memory, but doesn’t need it divided into rowscolumns. It is used as “chunk” memory so that transforms can still have their memory tracked and managed by the buffer manager. An example is the Aggregate Transform which might use this to allocate space for a hash table.) – number of flat buffers in use throughout the pipeline.
Buffers spooled – if SSIS runs out of physical memory during package execution, or the process executing the package runs out of virtual memory, SSIS begins to spool buffers to files. It has an initial value of 0. When it goes above 0, it indicates that the engine has started memory swapping. Once this occurs it indicates that the computer has insufficient memory, performance will degrade significantly. It is good practice to monitor this setting and ensure that enough memory for operations is available.
Rows read – number of rows read from all the data sources in total.
Rows written – number of rows written to all the data destination in total.
BLOB bytes written/BLOB bytes read – number of BLOB bytes written/read to/from all data sources in total.
“Buffers in use”, “Private buffers in use” and “Flat buffers in use” are useful to discover leaks. During package execution time, these counters will fluctuate. Once the package finishes execution, their values should return to the same value as before execution. Otherwise, it is an indication that buffers have leaked.
“Rows read” and “Rows written” show how many rows the entire Data Flow has processed. Giving an overall idea about the execution progress.
The image below shows how to select performance counter category and individual counter in that particular category.
The following image shows the changing values of performance counters as time passes. Note at the last point in time recorded, 14 instances of packages were running whereas the average package instance running is 12.
If using 64-bit SQL Server, the performance counters will not be listed in PerfMon. That is due to a known bug in SQL Server 2005 x64, which can be read about in KB 941154. Though there is another way or workaround for this, run the performance counter mmc in 32-bit (Start | Run | mmc.exe /32 perfmon.msc or c:windowssysWow64Perfmon.exe).
To recap what has been covered about buffer management:
- MaxBufferSize and MinBufferSize are two SSIS internal settings which cannot be changed and has 100MB and 64KB values respectively.
- DefaultMaxBufferRows is a property of data flow, which has 10000 rows as its default value.
- DefaultMaxBufferSize is a property of data flow, which has 10 MB as its default value.
- In case of buffer overflow or memory pressure, buffer content is spooled/swapped to file system; the location will be determined by BufferTempStoragePath and BLOBTempStoragePath properties of Data Flow Task, in case no values are specified for these properties, default values will taken from TEMP/TMP environment variables.
- Turn on logging of BufferSizeTuning event to see its behavior details related during runtime.