Wednesday, October 29, 2014

DB2 for z/OS: Thoughts on Table Space and Index Page Size Selection

The other day I got a note, from a DB2 for z/OS DBA, in which I was asked for my thoughts regarding the selection of page size for a table space. I typed out what I hoped would be a useful response, and found myself thinking, as I sometimes do, "Hey -- I could turn this into a blog entry." And so I have. I'll start here by offering some guidelines on page size selection for table spaces, and then I'll shift to the index perspective on the matter.

Table spaces

A table space (or index) gets its page size from the buffer pool to which it is assigned; so, a table space assigned to a buffer pool with buffers of 8K bytes in size will have pages of 8K bytes in size.

First, and obviously, the row length of a table is an important factor in selecting the page size for the associated table space. Row length determines the minimum page size that you can use for a table space. If the rows of a table are 7000 bytes in length, the pages of the associated table space will have to be at least 8K bytes in size, because a table row cannot span pages of a table space (an exception to this rule: a long LOB or XML value would span multiple pages in a LOB table space or an XML table space).

Once you've determined the minimum possible size for the pages of a table space, you can think about whether or not a page size larger than the minimum would be a good choice. Why might you want to go with a larger-than-minimum page size? One reason would be optimization of disk and server memory space utilization. Suppose that the row length for a table is 5000 bytes. The minimum page size for the associated table space would be 8K bytes; however, as rows can't span pages, that page size would waste quite a bit of space on disk and in memory (in each page you'd have 5K bytes occupied by a table row, and the remaining 3K bytes of space would be empty). In this case, a 16K page size would boost space utilization considerably: 15K bytes in each page could be occupied by 3 rows, leaving only 1K bytes of unused space). If you're wondering whether data compression would change this calculus, the answer is probably not -- that is to say, the right page size for an uncompressed table space is probably the right page size to use if you want to compress the table space; so, base page size selection on the uncompressed length of a table's rows.

Here's another reason to consider a page size that is larger than the minimum possible size for a table space (which, again, is determined by uncompressed row length): data access patterns. If data in a table space is typically accessed sequentially (i.e., if sequential scans are the rule for a table space, versus random single-page access), a larger page size could mean fewer GETPAGEs for those sequential scans, and that could boost CPU efficiency for applications that drive the sequential scans. Note that "sequential scan" doesn't necessarily mean a table space scan. It could be a matching index scan that would drive dynamic prefetch.

Just as access patterns for data in a table could lead you to select a larger-than-minimum page size for a table space, so they could push you in the other direction. When access to data in a table space is dominated by random, single-page reads, going with the minimum possible page size is probably the right move, so as to maximize buffer pool effectiveness -- in particular when SELECT statements targeting a table tend to have single-row result sets, you don't want to use an overly-large page size, because larger pages would bring into memory many rows that are not needed by a query, in addition to the one row that is needed (exception: if singleton SELECTs are executed via a "do loop" in a batch program, each time issued using a search value obtained from a file that is ordered in such a way that access to the target table is in fact sequential, dynamic prefetch would be expected to kick in and a larger page size could therefore be an efficiency-booster).

One more consideration: if you want to utilize LOB in-lining for a table (a DB2 10 for z/OS new-function mode feature), that could affect your table space page-size decision. Suppose a table has a LOB column, and the length of the table's rows without the LOB values would make 4K bytes a good page size for the associated base table space. If you determined that a high percentage of the table's LOB values could be completely in-lined with (for example), a 16K page, and if that inlining would deliver performance benefits for you, you might go with that 16K page instead of the minimum 4K page. You can read more about LOB inlining in an entry that I posted to this blog a couple of years ago.


Indexes

Prior to DB2 9 for z/OS, there was one choice for the page size of an index: 4K bytes. Starting with DB2 9 in new-function mode, the page size for an index could be 4K, 8K, 16K, or 32K bytes. Why might you go with a larger-than-4K page size for an index on a DB2 table? One reason would be to enable index compression -- something that requires the use of a greater-than-4K index page size (as noted in an entry I posted a few years ago to the blog that I maintained while working as an independent DB2 consultant).

Compression isn't the only reason to consider a larger-that-4K page size for an index. Another motivator could be a desire to reduce index page split activity. If an index is defined on a key whose values are not continuously-ascending (i.e., if key values for rows newly inserted into the underlying table are likely to be less than the key's current maximum value), throughput for high-volume insert operations could be constrained by index page split activity. [Unlike rows in a table, entries in an index MUST be physically ordered by key value, and that leads to index page splits when new key values are inserted into the "middle" of an index, as will be the case for an index defined on a non-continuously-ascending key -- if the page into which a new index entry must go is full, the page will be split, and a portion of the page's entries will be moved to a previously empty page in the index.] Index page splits can have a particularly significant impact on insert throughput in a DB2 data sharing environment, owing to an attendant increase in log write activity. When index pages are larger, index page split activity tends to decrease; so, if a primary concern is optimized throughput for high-volume insert operations, an index page size of 16K or 32K bytes could be preferred over the traditional 4K size.

Larger index page sizes can also improve the performance of index scans by reducing associated GETPAGE activity. On top of that, larger page sizes could reduce the number of levels for an index (referring to the root page level on top (logically speaking), the leaf page level on the bottom, and one or more non-leaf levels in-between in the B-tree index structure), and that would in turn result in reduced GETPAGE activity for index probes.  

On the other hand, if compression is not desired for an index, and if the primary performance concern is optimization of data retrieval involving index access, and if access to entries in an index tends to be random (versus sequential) in nature, a 4K page size is probably your best choice.

A note on changing an object's page size

If you implement a table space or an index with a certain page size and you later determine that a different page size would deliver better performance, you can make that change by way of ALTER TABLESPACE or ALTER INDEX, with a buffer pool specification that would reassign the object to a buffer pool with larger- or smaller-sized buffers, as desired. Starting with DB2 10 for z/OS running in new-function mode, a change to an object's page size is a pending DDL operation that can be non-disruptively accomplished with an ALTER (as mentioned above) followed by an online REORG of the target object. Note, however, that page-size changing via pending DDL is only possible if the table space being altered (or the underlying table space on which a to-be-altered index is defined) is of the universal variety -- one of several incentives to get to universal table spaces (prior to DB2 10, a table space could not be altered to have a different page size, and altering an index to change the page size would place the index in rebuild-pending status).

In conclusion...

Put some thought into your table space and index page-size decisions, and know that with universal table spaces (and indexes defined thereon), you can pretty easily change your mind down the road.

Monday, October 20, 2014

DB2 for z/OS Stored Procedures: Native SQL, or Java?

Recently, a mainframe DB2 DBA put a question to me: when should DB2 for z/OS native SQL procedures be used versus Java stored procedures, and vice versa?

Why he asked: the DBA worked with a group of developers who, collectively, could code both of these DB2 stored procedure types. Within this group, individuals tended to go with what they were used to. In other words, the decision to code a native SQL or a Java stored procedure tended to be based more on momentum (e.g., "I used a Java stored procedure last time, so I'll go that route again this time") than on consideration of factors that might make one type of stored procedure more appropriate than the other for a given situation. The DBA wanted to provide the developers with guidelines that would help them to make more informed decisions regarding the use of native SQL versus Java stored procedures. In response to this request, I communicated my take on the matter, and via this blog post I'll share those thoughts with you.

I'll start out with this statement: if I needed to create a DB2 for z/OS stored procedure, I'd go with a native SQL procedure unless I needed to do something that couldn't be done (or couldn't be done in a practical sense) with that type of stored procedure. If I did have a requirement to use an external stored procedure (i.e., one with an external-to-DB2 executable that would run in a stored procedure address space), I'd look at Java as a good choice for the associated programming language, unless I needed to do something that would best be done with a COBOL stored procedure program (more on this momentarily).

Why my "native-first" mind set? Three reasons:


1) Native SQL procedures can be developed (and maintained) by lots of different people. You don't have to be proficient in a programming language such as Java (or COBOL) to create a native SQL procedure. If you know SQL, you can quickly pick up the "control" SQL statements (referring to logic flow control -- statements such as ITERATE, LOOP, GOTO, etc.) that enable coding of a SQL PL routine such as a native SQL procedure ("SQL PL" is short for SQL procedure language, the language used in a DB2 environment to write stored procedures and user-defined functions using only SQL). On top of SQL PL being a pretty easy means of writing DB2 stored procedures, the free and downloadable IBM Data Studio product provides a user-friendly GUI tool for native SQL procedure development and debugging.

2) Native SQL procedures, when called by DRDA requesters (i.e., through DDF), are zIIP-offloadable. More specifically, zIIP-offloadable to the tune of about 60%. That's because they run under the caller's task and a DRDA caller's task is an enclave SRB in the DDF address space. A Java stored procedure, like any external stored procedure, will always run under its own TCB in a stored procedure address space. Yes, execution of that stored procedure's Java code will be zAAP- or zIIP-eligible (the latter via what's called zAAP-on-zIIP, which you get when zAAP-eligible work runs on a system that has zIIP engines but no zAAP engines), but execution of the SQL statements issued by the Java stored procedure will not be zIIP-eligible (you'd actually get a little zIIP offload for these SQL statements, but not what you'd get with a native SQL procedure).

3) Native SQL procedures are where you're likely to see the bulk of DB2 for z/OS stored procedure functionality enhancements. Example: with DB2 10, the XML data type can be used for a stored procedure input or output parameter, but only for a native SQL procedure. Another example: DB2 11 introduced the autonomous stored procedure (it has a separate unit of work from that of the calling process, so that a data change made by an autonomous procedure will persist even if the calling transaction subsequently fails and is rolled back by DB2). Only a native SQL procedure can be autonomous. Yet another example: DB2 11 delivered support for array-type input and output parameters for a stored procedure, but only if it's a native SQL procedure.

When might I use an external stored procedure (such as a Java stored procedure)? I'd go this route if I needed to do something that couldn't be done (either at all, or in a practical sense) with a native SQL procedure. Maybe there is a need for a level of sophistication in the logic of the stored procedure that I can't implement with SQL PL (though in that case I might ask, "How sophisticated does the logic in a data-layer program have to be?"). I might go with an external stored procedure (such as a Java stored procedure) if there were a need to access resources (data or services) outside of DB2 (a native SQL procedure can issue SQL statements, period). Note that even in that case, there would be capabilities available in a native SQL procedure that might provide the functionality required. Keep in mind that native SQL procedures can utilize DB2 functions, which include things like MQRECEIVE and MQSEND if there is a need to interact with WebSphere MQ, and SOAPHTTPNV if I need to access a Web service. I might take the Java path if I wanted to use the same (or nearly the same) Java stored procedure program in a DB2 for z/OS and a non-DB2 relational database environment -- a Java stored procedure likely wouldn't have to be much different in the two environments, whereas a SQL PL routine created for DB2 would have to be re-created in the SQL programming language of a different relational database server in order to function properly in that non-DB2 environment (a native SQL procedure provides portability within the DB2 family of relational database systems).


If I wanted to use an external stored procedure, Java would be a fine choice for the programming language. There was a time, not so long ago, when I was not a fan of Java code running on mainframe servers, but now I'm a proponent of Java in a z/OS environment, as I noted in an entry recently posted to this blog. As pointed out in said blog entry, DB2 11 for z/OS delivered some important enhancements that boost scalability and CPU efficiency for Java stored procedures.

Are there circumstances that would lead me to choose COBOL instead of Java for an external stored procedure? Yes. One scenario that comes to my mind involves invocation of an existing COBOL subroutine. That could be done very efficiently via a COBOL CALL from a COBOL stored procedure (I wrote of a real-world example of this scenario in a blog entry I posted this past spring). I might also lean towards COBOL over Java if, in my organization, there were COBOL-knowledgeable developers but not Java-knowledgeable developers available at the time I needed a DB2 external stored procedure program to be coded.

The bottom-line message is this: SQL PL (for native SQL procedures) and Java (especially in a DB2 11 for z/OS system) are both very viable choices when it comes to development of DB2 stored procedures. Rather than taking a "what we're used to" approach to choosing one versus the other, consider the relative strengths of each language in light of the objectives you have for a given stored procedure. A thoughtful and informed choice is likely to be the right one for you.