Optimization II: Optimizing InterBase SQL and Metadata

2003 Borland Convention

by Craig Stuntz

http://blogs.teamb.com/craigstuntz/articles/IBOptimization2.aspx

This session is part of a three-part series on creating high-performance systems using InterBase as the database back end. Part one (Optimization I: Optimizing InterBase Applications) explains how to optimize applications, part two details how to optimize InterBase SQL and metadata. "InterBase Server Performance Guide: How to Select and Configure Hardware," completes the series.

Introduction

While almost anyone can write SQL, the ability to optimize an InterBase database (and use of it) can make the difference between acceptable performance and an application which really flies.

Because PLAN statements can be rather terse in the simplest cases and inscrutable when they become complicated, I've written a freeware utility called InterBase PLANalyzer which graphically displays InterBase query execution PLANs, and also provides information about the indices used by a query and the internal reads performed by InterBase when running it.  This article describes both the text PLAN syntax displayed by InterBase tools such as IBConsole and ISQL and the graphical plan displays produced by InterBase PLANalyzer.  Please feel free to use whichever tool you prefer. 

This article presumes that the reader already has some understanding of how to write a good SQL statement, and it makes no real attempt to explain how to tune an InterBase server installation, but both are prerequisites which should be addressed before attempting to optimize individual statements or tables. This paper will get you started on SQL, and the third article in this series (a CD only paper for this conference) explains how to configure your InterBase server.

The Golden Rule of Optimization

If you've read the first paper in this series, you've already seen this. That's OK. Read it again. It really is that important:

"Optimize" only the slow stuff; leave the rest alone.

It is important to recognize that in a typical application, a relatively small fraction of code/metadata design will be responsible for most observed performance problems. Before optimizing anything, ensure that you are working on the slowest part of the system, as this will give you the greatest improvement for your optimization effort. Furthermore, it's senseless to needlessly obfuscate code or metadata design which is fast enough for the job at hand already by "optimizing" it in such a way as to render the result unreadable; this makes maintenance a nightmare.

In short, you want to optimize use of your own time, as well as your application!

Corollary to the Golden Rule of Optimization for database applications

Always develop database applications using a database populated with real data.

Many performance problems will not stand out until an application is used with a production database filled, perhaps even over-filled, with real data.  Nearly any application can be fast with an empty database! Most developers spend at least as much time in active development of an application as they do in formal testing, if not more. By using a real, production database during development, you will discover serious performance problems early in the development process. If radical changes are required to fix the problems, such as a dramatic change in database metadata, it is much easier to do this early in the development process. 

Furthermore, make sure you're using the application in the same way the customer will. On one project I worked on a while back, I was in the habit of logging in as an administrator because the security in the application frequently interfered with the testing I needed to do — some steps of the process I was testing required administrator privileges. Certain performance problems, however, only surfaced when logged in as a non-administrative user, because they were caused by confidentiality rules which didn't apply to administrators. Only by using the process as the customer will (in this case doing some parts of the test as an administrator and other parts as a non-administrator) can you be sure that you will find performance problems before your customers will.

Using a populated database forces you to "eat your own dog food" — if a portion of the system you are developing is too slow to make development productive when used against a large database, your customers are not likely to be productive with it, either. Use of a populated database helps you obey the Golden Rule.  Since some tables (imagine a table of U.S. states, for example) are unlikely to ever be large, you don't need to spend time optimizing their use.  It's harder to determine which tables are likely to be large when looking at an empty database.

Also, optimizing against a local DB server will disguise lag imposed by network traffic, so I strongly advise optimizing against DB server running on another machine on your network. 

Optimization Fundamentals

The fact that InterBase can accept a PLAN as part of a DSQL statement fools many people into trying to use them.  Aside from the fact that PLAN expressions can be difficult to understand -- never mind write -- PLANs are, in general, a bad way to optimize a statement.  See the How InterBase Optimizes a Statement section for details as to why. 

Rather than telling the optimizer to exit the scene entirely, we need to help it do its job.  There are several ways we can do this:

  1. Rewrite the SQL statement
  2. Change the metadata
  3. Maintain the database

We will look at each of these strategies individually.  But the first step is to figure out why the query is slow in the first place.  From there, we can determine which of the three strategies will fix the problem.

Diagnosing Optimization Problems

To start the optimization process you need the following items:

  1. A populated database
  2. A query which works (but perhaps not quickly enough)
  3. A network DB server (recommended)

NOTE: I'm going to use Employee.gdb for all examples in this document.  There are only 42 rows in the EMPLOYEE table in Employee.GDB, so it's difficult to see changes in execution speed, especially if you use a local connection.  As I mentioned earlier, it's important to use a populated database when optimizing queries.  Nevertheless, I will persist in using Employee.GDB for the examples in this document because I want readers to be able to experiment with the queries I use.  Also, including a 2 GB sample database with this document would make distribution impractical.  :)  Please feel free to fill up the tables using a test data generator for more useful examples.

Understanding Execution and Fetch Times

When you run a query several things happen:

  1. The SQL is parsed into BLR.  This happens at the time you prepare a DSQL statement, or, in a stored proc or VIEW, at the time the object is created.
  2. The query is optimized.  This always happens at the time the statement is prepared.
  3. The query is executed -- records are read from disk, and the server performs requested operations on these records
  4. Rows from the result set are fetched to the client, by the client.  IB does not return rows until the client asks for them.
  5. When the query is run in the context of a client application, other actions may be triggered by event handlers in the application.

The parsing and the optimization generally don't take much time (and are only required when the query is prepared, not each time it is run), so the first thing to check when a query is not running quickly is to run it outside the context of the client application.  If that fixes the problem then you have a coding issue and not an InterBase problem.  Using an SQL Monitor is the fastest way to figure out what is causing the problem.

[Side note: It is occasionally observed that Prepares take an appreciable amount of time in an application.  Some of this time is taken by the IB server, but some of it is taken by the database access components, which frequently do things like look up metadata values during a Prepare.]

When you run the query outside of your application, however, remember that most IB administration tools don't do a FetchAll, meaning they fetch only enough records from the result set to fill one screen and fetch additional records as the user scrolls, instead of fetching the entire result set at once.  (InterBase PLANalyer is an exception to this as a FetchAll is necessary for accurate read statistics.)  This will affect how fast the query seems to run, so make sure you're comparing apples to apples when executing the statement outside of the context of your program if your program is doing a FetchAll.  FetchAlls are sometimes necessary (for example, in a stateless application server), sometimes not, so if this is causing a speed problem you may wish to consider whether you can eliminate the FetchAll from your client application or reduce the number of records in the result set.

If it's still slow outside the context of your application, note whether the problem is in the speed of executing the query or fetching the records from the result set.  InterBase PLANalyzer gives you both numbers separately on the Statistics tab. It is important not to fall into the trap of looking only at the execution time, as both the execution time and the fetch time affect the final speed perceived by the end user of the program.  For example, sorting a result set using an index makes execution faster but fetching records slower, and sorting the same result set without the index has the opposite effect.  When both times are added together it is usually faster overall to sort without the index.  So you can't understand the nature of the speed problem unless you consider both numbers.

Large Result Sets

If the fetch time is very large then check the number of records your query returns.  Ask yourself if the client application actually needs every single record in the result set.  If the answer is "no" then it will be necessary to change the query to restrict the result sets to records your client application actually needs (usually with a WHERE clause or a JOIN).  This is especially critical on slow networks.

Records Read vs. Records Returned

If the number of records in your result set is small then the next thing to check is the number of records InterBase is reading internally to perform your query.  If no index is available to optimize a WHERE clause, for example, InterBase must read every record in the table to see if it meets the WHERE condition.  

Use the Statistics tab in InterBase PLANalyzer to determine how many records InterBase is reading to perform your query.  It should be fairly close to the number of rows returned.  Obviously, in the case of a query which combines two tables using an INNER JOIN, InterBase will have to read at least one record from each table for every row in the result set, so in this case the number of records read internally should be 2* or more the number of rows returned.

If you determine that InterBase is reading significantly more records than are needed for the result set then performance for this query can be improved using an index.  If an index is already defined for this column but is not being used then you should consider reasons why InterBase might not use the index for your query.  If an index (or multiple indices) is being used but too many records are being read then you need to see if you can improve the selectivity of the index (or combination of indices).

Fetching Strategies

If the fetch time is fairly high but the number of records seems correct, it may be that use of an index is slowing down fetching.  Read the sections on table reads and ordering for more information, and also consider that the effective selectivity of an index may be less than the displayed selectivity.  An index will only improve speed if it reduces the number of records returned!

Network Throughput

A slow network, particularly a dial-up connection, can severely degrade client/server database performance.  But even on a 100 MBPS Ethernet connection network throughput is still one of the biggest performance bottlenecks.  That's why the essence of good client/server application design is to reduce the amount of data which must be sent from the DB server to the client application.  Use WHERE clauses, avoid SELECT * when you don't need all columns, etc.  But sometimes you've done all of this and network throughput is still a problem.  What is the next step?  There are several things you can do to squeeze data through a slow pipe:

Fixing SQL Problems

Once you're certain that your query is returning only the necessary rows and columns, to minimize the amount of data which must be sent over the wire, the next step is to examine the SQL itself.  It may be necessary to create, remove, or change indices on the tables used.

Fixing Database Maintenance Problems

Rebalancing Indices

InterBase indices use a b-tree structure, and sometimes the b-trees can become out of balance, particularly after deleting or updating a large number of rows.  What this means is somewhat complex and beyond the scope of this paper (see Julian Bucknall's excellent book The Tomes of Delphi: Algorithms and Data Structures for a thorough background on this topic), but the short answer to the problem is that it's a good idea to backup and restore an InterBase database from time to time.  In addition to creating a backup (always a good idea) the restore portion of this process fills a blank database from scratch, which ensures that all indices are well-balanced.

Obviously, although InterBase's backup utility can function when users are active in the DB, it is necessary to bring the database offline before starting the backup if you intend to replace the DB with the restored version.

You can diagnose an out-of-balance index by using the gstat command-line utility included with InterBase or with the Statistics function in IBConsole (Database->Maintenance->Database Statistics).

Individual indices can be brought back into balance by setting the index inactive and then setting it active again.  The following two commands in ISQL will accomplish this:

ISQL> ALTER INDEX INDEX_NAME INACTIVE;
ISQL> ALTER INDEX INDEX_NAME ACTIVE;

This is safe to do while users are active in the DB, but be aware that it may temporarily make queries which use the table in question very slow!  Note that indices used in a constraint (such as a primary or foreign key constraint) cannot be made inactive.  These indices can only be rebuilt using a backup and restore procedure.

The best way to diagnose whether or not an index is out of balance is to use the statistics generated by the gstat command line program, or the Database Statistics command in IBConsole.  Note that it is possible to generate statistics for indices only instead of trying to find the index statistics in the middle of all of the other information these tools can generate.  Rebuild the indices if the fill distribution is low.

Setting Index Statistics

The term "statistics" is used in two different ways in InterBase when discussing indices.  It can mean the statistics displayed by the gstat command line program, or it can mean the value of the RDB$STATISTICS column in the RDB$INDICES system table.  

The SET STATISTICS command refers to the latter meaning.  This is just another way of expressing the selectivity of the index.  Because computing the statistics can be time-consuming (especially for a table with a lot of data in it), it doesn't happen very often.  In extreme cases (particularly after you've just modified a lot of data in a table) the index statistics can be inaccurate, causing InterBase's optimizer to select the wrong PLAN for your query.  The fix is very simple: Tell IB to recalculate statistics for that index.  You do this using the SET STATISTICS command in ISQL or the "Set Statistics" button on the PLAN Analysis tab of InterBase PLANalyzer.  Backing up and restoring the DB should also fix this.

How InterBase Optimizes a Statement

When you send a SELECT or another SQL statement to the InterBase server, what happens?  It is important to understand the distinction between optimization and other parts of SQL statement processing.

Statement Execution Internals

Before InterBase can run an SQL statement, it first translates it into a kind of byte code called BLR.  BLR is just another way of expressing the statement -- think of it as a less-readable version of SQL.  So the first thing that needs to happen is that InterBase will parse your SQL, and, provided that there are no errors, convert it to BLR.  No optimization is done at this time.

Now that InterBase has the query in a language the engine can understand, it moves on to optimization.  InterBase will look at various strategies it can use to run your query, and estimate the cost (essentially, the performance hit) of each one.  For example it could use an index to find a particular record, or it could use a sequential (NATURAL) search.  InterBase adds up the estimated cost of each strategy and picks the one which is predicted to be the least "expensive."  (This is an oversimplification.)

The result of this optimization step is a plan for how InterBase will execute the statement.  InterBase can (partially) express this via the PLAN statement, and you can tell InterBase how to optimize the statement by feeding it a PLAN as a part of your SQL.

But while PLAN statements created by InterBase are very useful for diagnosing and repairing optimization problems, I strongly recommend that you do not use an explicit (i.e., included in the query) PLAN to tell IB which optimization strategy to use. 

The important thing to understand here is that the optimization strategy is data-dependent -- the best strategy for running a particular query can change depending upon how many records are in a particular table, and which values are stored in those records, because this changes the selectivity of the indices used.  

As you may know, when you create a stored procedure or a VIEW, InterBase will compile it into BLR and store the BLR.  This way, the SQL parsing step is eliminated when you want to work with the object you've created.  InterBase.  The optimization plan is not stored, however; it is recreated each time you run the query.  This is important because the correct strategy can change depending upon the nature of the data in the DB.

A more detailed description of the internal optimization process is outside of the scope of this paper, but interested readers might want to look at Paul McGee's paper, Managing Your InterBase Server.

Why Not to Use Explicit PLAN Statements

By using an explicit PLAN, you're telling InterBase that there is only one way to perform the query -- period, end of sentence.  InterBase's optimizer cannot change this strategy in response to changing data conditions.  A PLAN, in other words, disables the optimizer altogether.

Also, you cannot control the names of indices which InterBase creates for foreign keys and primary keys.  They will have names like RDB$PRIMARY148.  The number at the end of the index name may be different each time you recreate the database from DDL.  So you can't use primary or foreign key index names in a PLAN statement, because you can't predict the name of the index.

So what do you do when InterBase picks an optimization strategy which you think is incorrect?  It's time to put on our detective cap and figure out why the optimizer is being fooled.

Optimizing SQL Statements

Optimizing Index Use in a WHERE Clause

Here is a very simple SQL statement:

SELECT
  *
FROM
  EMPLOYEE E
WHERE
  E.PHONE_EXT = :PhoneExt;

The PLAN it produces looks like this:

PLAN (E NATURAL)

This PLAN means that InterBase will internally read every row in the EMPLOYEE table and return rows which match the WHERE clause.  

This PLAN is fairly easy to read, but more complex queries can produce very wordy PLANs.  In that case InterBase PLANalyzer's graphical display will make the PLAN much easier to decipher.

In a table like EMPLOYEE, which has less than 100 rows, there's no real problem with this, but on a larger table, say, a few hundred thousand records or so, this could be a problem.  (Here, we would do well to recall the golden rule of optimization.)  Since Employee.gdb doesn't have any 100,000 row tables to work with, let's pretend that the EMPLOYEE table is much larger than it really is.

If this query was running too slowly the first thing to check is to look at the indices available on the table and make sure that there is not already an appropriate index available which isn't being used for some reason.

If there is an appropriate index available that the optimizer does not seem to be finding, check the following items:

If no appropriate index is available, you may choose to create one.  

Optimizing LIKE, STARTING WITH, and CONTAINING

Because InterBase uses prefix compression when storing its indices, it can only match index entries using the beginning of the value (if not the complete value).  For example, this statement can use an index:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
WHERE
  LAST_ NAME LIKE 'St%';

...because St will be the first two letters of the index entry.  But this statement cannot:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
WHERE
  LAST_ NAME LIKE '%t%';

...because IB cannot use an index to find entries containing t at any position within the string.

This becomes especially important when using parameterized queries.  Although the first statement given above can use an index if one is available, this one cannot:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
WHERE
  LAST_ NAME LIKE :LastName;

...even if the parameter 'LastName' is set to 'St%'!  The reason is that the application could just as easily set the parameter to '%t%" and InterBase must create a PLAN which works for all parameter values at the time the statement is prepared.  The fix is to use STARTING_WITH instead of LIKE:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
WHERE
  LAST_ NAME STARTING WITH :LastName;

Now set the parameter value to 'St' (without the %) and IB will deliver the same results, but it will use an index to find the records, if one is available.

CONTAINING can never use an index, for the reasons described above.  The best way to do full text searching is to use a word list algorithm.  Commercial word list tools can be purchased from http://www.fulltextsearch.com/, http://www.ibobjects.com/ibofts.html and http://www.softcomplete.com/products/ftsib/index.html, among other places, or you can write your own.

Optimizing Statements with Multiple PLANs

Let's examine a problem SQL statement:

SELECT
  *
FROM
  EMPLOYEE E
WHERE
  E.PHONE_EXT = (SELECT
                   E2.DEPT_NO
                 FROM
                   EMPLOYEE E2
                 WHERE
                   E2.EMP_NO = E.DEPT_NO);

Obviously, this statement has been constructed to demonstrate potential optimization issues rather than be an example of a realistic query!  Ignoring for a moment the obvious fact that the query itself is nonsense, let's look at the PLAN expression it produces and how we can use it to diagnose problems with the statement.

Here is the PLAN expression:

PLAN (E2 INDEX (RDB$PRIMARY7))
PLAN (E NATURAL)

The first thing we notice is that the PLAN expression includes two PLANs, one for the subquery and one for the outer query.  The following SQL features will result in more than one PLAN:

More than one PLAN for an SQL statement which includes subqueries or UNION is normal, and it's not by itself something to be concerned with.  If it's possible to write an expression without using a subquery or UNION [ALL] which produces the same results, that's probably a good idea, but sometimes you need these features, and InterBase usually does a good job executing statements which make use of them.

However, there are a couple of things which can make these statements slow.

Correlated Subqueries

When writing an SQL statement containing a subquery, take great care to make sure that the subquery is not correlated.  A correlated subquery is one which must be re-executed for every row in the result set of the outer query.  There is no easy way to tell from the PLAN whether or not the subquery is correlated; you'll have to use your own powers of inspection.  In the example given, the subquery is correlated because there is no way to execute the subquery once for all rows in the outer query's result set.  In order to produce meaningful results from the subquery, it must be executed once fore each value of E.PHONE_EXT in the outer query's result set.

Correlated subqueries can usually be replaced with a JOIN.  In this case, we can get the same results with the following query:

SELECT
  *
FROM
  EMPLOYEE E
  INNER JOIN EMPLOYEE E2
    ON (E2.DEPT_NO = E.PHONE_EXT) AND (E2.EMP_NO = E.DEPT_NO)

This query produces a PLAN expression with only one PLAN:

PLAN JOIN (E2 NATURAL,E INDEX (RDB$FOREIGN8))

...but more importantly the query is only run once.  Internally we can see that InterBase is reading the table in natural (storage) order to read in the EMP_NO values and for each row using the index RDB$FOREIGN8 to determine if there is a row in the table with a matching value for DEPT_NO.  (By clicking on the index in the InterBase PLANalyzer tree view, we can see that the index RDB$FOREIGN8 is on the column DEPT_NO).

UNION

With an SQL statement which requires the UNION keyword more than one PLAN will obviously be required.  The only thing worth mentioning here is to use UNION ALL instead of UNION whenever possible.  UNION removes duplicate rows from the result set, and this requires sorting the result set of each query in the statement.  UNION ALL does not do this duplicate checking, removing the requirement for the sort and (sometimes dramatically) speeding up the query.  Note that many queries containing UNION will not result in duplicate rows anyway.

For example, as of this writing InterBase does not have a NULLIF function.  One workaround is to use UNION ALL.  What if we chose UNION instead? 

SELECT
  FULL_NAME, PHONE_EXT
FROM
  EMPLOYEE
WHERE
  PHONE_EXT IS NOT NULL
UNION
SELECT
  FULL_NAME, CAST('None' AS VARCHAR(4))
FROM
  EMPLOYEE
WHERE
  PHONE_EXT IS NULL

This query forces InterBase to sort the result set of each SELECT statement in the query.  But since PHONE_EXT cannot be NULL and non-NULL for the same row, there is no possible way for a row to appear in both queries.  By replacing UNION with UNION ALL, we can speed up the query by reducing the amount of work which InterBase must do to perform the query.

The optimizer could conceivably be altered do this for you, but it would mean that it would have to be slowed down in order to implement rules which DBAs are expected to follow anyway. 

Lists of Constants, Indices, and Subqueries

InterBase does not perform well in one specific situation.  If a query contains a subquery, and if that subquery compares a single column to more than one constant, and if that column is indexed, IB will treat the subquery like a correlated subquery, even if it isn't.  Here are a couple examples of potentially problematic SQL:

SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO IN (SELECT EMP_NO FROM EMPLOYEE 
             WHERE LAST_NAME IN ('Nelson', 'Young'));
SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO IN (SELECT EMP_NO FROM EMPLOYEE 
             WHERE (LAST_NAME = 'Nelson') OR (LAST_NAME = 'Young'));

The PLAN produced is the same for both queries:

PLAN (EMPLOYEE INDEX (RDB$PRIMARY7,NAMEX,NAMEX))
PLAN (EMPLOYEE_PROJECT NATURAL)

Notice that the NAMEX index is used twice.  If there were three constants in the list (i.e., another last name) then NAMEX would appear three times, and so on.  This is not a problem in and of itself -- using the index repeatedly is probably the fastest way to pull rows from the table.  The problem is that InterBase internally does these lookups once for each row in the result set.  If the number of rows returned from the statement is large, this can take a long time.

Before trying any of the workarounds listed below, ensure that the speed problem really is coming from this issue, rather than from somewhere else.  The easiest way to prove that this is the problem is to use only one constant in the IN clause.  If the speed improves exponentially, then this issue is the problem.  If not, the problem is elsewhere.

Workaround #1: Use Table Data Instead of Constants

There are a couple of workarounds for the issue.  One is to select the matching values from a table instead of using constants in the query.  For example, we could create a table called SELECTED_EMPLOYEES and INSERT two records for employees Nelson and Young.  We could then rewrite the first statement as follows:

SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO IN (SELECT EMP_NO FROM SELECTED_EMPLOYEES);

This fixes the problem.

Workaround #2: Use UNION or UNION ALL

When only one constant is used there is no problem.  Using UNION or UNION ALL to combine the results of two queries with only one constant is often (considerably) faster than the single statment with two constants.  The following statement gives the same result as the first statement in this section, but may be considerably faster on a large table:

SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO = (SELECT EMP_NO FROM EMPLOYEE 
             WHERE LAST_NAME = 'Nelson')
UNION ALL
SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO = (SELECT EMP_NO FROM EMPLOYEE 
             WHERE LAST_NAME = 'Young');

Workaround #3: Force IB to Use a Sequential Scan

By forcing IB not to use an index, you work around the problem.  This workaround is obviously practical only if the table in the subquery is fairly small.

SELECT
  *
FROM
  EMPLOYEE_PROJECT
WHERE
  EMP_NO IN (SELECT EMP_NO FROM EMPLOYEE WHERE LAST_NAME || '' IN ('Nelson', 'Young'));

By concatenating LAST_NAME with an empty string, we defeat IB's ability to use an index to find matching records.  The PLAN produced is:

PLAN (EMPLOYEE INDEX (RDB$PRIMARY7))
PLAN (EMPLOYEE_PROJECT NATURAL)

So looking up each EMP_NO from EMPLOYEE is slower -- significantly slower, if the EMPLOYEE table is large, but the query no longer behaves like it has a correlated subquery, meaning it will run faster if there are a large number of rows in the result set.

Understanding Table Reads

There are two ways that records can be read from a table:

  1. In storage order (i.e., in whatever order the records happen to be stored on the disk)
  2. Out of storage order

In order to understand a PLAN statement you need to be able to distinguish these two methods of reading data and understand their consequences.  

Reading Records in Storage Order

The fastest way to read records from the disk is to read them in storage order.  Moving the hard drive read head to a random place on the drive is one of the slowest things the DB server has to do, and fetching records in storage order minimizes the number of times the read head has to move, and allows full use of the read-ahead cache.

When a table is read in storage order, it will appear in the PLAN statement like this:

PLAN (EMPLOYEE NATURAL)

The portion of the PLAN statement in bold above indicates that the EMPLOYEE table will be read in storage order. 

Reading Records Out of Storage Order

It is usually much slower to read records out of storage order because the hard drive head generally must be moved frequently (unless the records just happen to be in the same order on disk as they are in the index, which can sometimes happen with primary key indices).  However, the loss in speed is worth it when the index can be used to significantly reduce the number of records which must be read.

There are two ways that InterBase can indicate that a table is being read out of storage order.  The first syntax indicates that an index (or multiple indices) is being used to optimize a WHERE clause or a JOIN:

PLAN (EMPLOYEE INDEX (NAMEX,RDB$PRIMARY7))

The section in bold indicates that InterBase is combining two indices and using the combination to locate matching records in the table. 

The checkmarks and the numbers beside the indices indicate their selectivity.  The tabbed notebook icon beside the table is another reminder that the table is being read with an index, rather than in storage order.

InterBase implements this PLAN by reading each index, storing it internally as a bitmap, and then doing a boolean AND operation to combine the two bitmaps.  In this way, a combination index is created which InterBase can then use to find matching records.  There is some overhead for doing this combination so depending upon what else you need to do with the table a compound index might be more appropriate.

The second syntax indicates that InterBase is using an index to perform an ORDER BY:

PLAN (EMPLOYEE ORDER NAMEX)

If InterBase is using the index both to perform an ORDER BY and to optimize a WHERE clause or a JOIN the second syntax will be used.

It is very important that you do not confuse this second syntax for reading records out of storage order with the PLAN syntax for reading records in storage order and then performing an in-memory sort.  These are two very different operations, although they both produce an ordered result set.

Understanding ORDER BY

InterBase can order a result set either by reading the records from the hard disk drive in order (using an index) or by reading the records in storage order and then sorting them.  In general, the latter is faster when both execution and fetch times are considered.

Fetching in Storage Order Plus In-memory SORTs

Consider this query:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
ORDER BY
  FIRST_NAME

This query causes InterBase to return the following PLAN:

PLAN SORT ((EMPLOYEE NATURAL))

This means that InterBase first reads the entire EMPLOYEE table in storage order, and uses an in-memory sort to order the results.  If there were too many rows in the table to sort in memory on the server, InterBase would page the records to temporary files as necessary.  

If the EMPLOYEE table had more rows, we would have seen that sorting the records in memory caused the query execution time to be longer than a query against the same table and rows but without a SORT in the PLAN statement, but that fetching the records is relatively quick.  This is because the records are now in order in a temporary sort file on the server, and the ordered rows can be read in storage order.  

Fetching Out of Storage Order and ORDER BY

Now let's modify the query:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
ORDER BY
  LAST_NAME, FIRST_NAME

This query causes InterBase to return the following PLAN:

PLAN (EMPLOYEE ORDER NAMEX)

This PLAN statement means that InterBase will not read the table in storage order.  Instead it will first read the index and then use the information in the index to visit each record on disk in alphabetical order.

Because InterBase does not have to read the entire table and sort it in memory in order to deliver the first record to the client, the execution time of this query will be as fast as the query without the ORDER BY statement.  But fetching the records may be much slower.  In the worst possible case, InterBase may have to move the read head on the hard disk drive in order to read each successive record.  In the best case the records will just happen to be in order on the disk anyway, so a lot of hard drive read head movement won't be required.

Fetching Out of Storage Order and WHERE

Sometimes fetching records out of order can significantly improve query speed.  In particular, it helps if we can reduce the number of records read:

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME
FROM
  EMPLOYEE
WHERE
  LAST_NAME STARTING WITH 'S'

This query causes InterBase to return the following PLAN:

PLAN (EMPLOYEE INDEX (NAMEX))

Notice the difference between this PLAN and the one above (for the query which ended with "ORDER BY LAST_NAME, FIRST_NAME").  Instead of reading the entire table using the index, InterBase is using the NAMEX index to go directly to the records with a value for LAST_NAME that begins with S.  Since the query has no WHERE clause we cannot presume that the records returned will be in any particular order.  If we add "ORDER BY LAST_NAME, FIRST_NAME" to this query (keeping the WHERE clause shown) the PLAN will change to the same PLAN we saw for the query without this WHERE clause.  However, we can tell that only three records are read internally by InterBase by looking at the Statistics tab in InterBase PLANalyzer.  By cutting the number of records InterBase must read from the hard drive from 42 to three, we gain more than enough speed to make up for the cost of reading the three records out of storage order. 

ORDER BY Performance Summary

As illustrated above, the two methods of reading records (in storage order and out of storage order) and the two methods of ordering records (using an in-memory sort or reading the records in sort order using an index) affect different parts of the query execution cycle and interact in somewhat complex ways.  Furthermore, the effects on speed vary depending upon whether or not the records happen to be stored on disk in something close to the order specified in the ORDER BY clause of the query.  So the following conclusions are a generalization and may vary depending upon your query and your DB.

In general, the fastest road to an ordered result set is an in-memory sort, when you consider both the time to run the query and the time to fetch all records in the result set.  Many people having a hard time believing this, but the simple fact is that moving the hard drive read head is very slow in comparison to a QuickSort.  Even when the server must spool records to disk while sorting it can do this in order, making the read and write very efficient.  You can gain additional speed by putting the InterBase temp directory (where the sort files go) and the DB itself on two different physical (not logical) hard drives.  That way the head on a single drive doesn't have to thrash when copying records from the DB to the sort file.

Of course, many applications do not do a FetchAll.  While you shouldn't be SELECTing records you have no intention to fetch, there are times where a fast initial response is more important than a fast overall execution time.  An ORDER BY statement performed using an index (and showing up in the PLAN statement as a table read with ORDER beside it) will always be faster than an in-memory SORT, except when the times are so close together as to be indistinguishable.

A Hack

InterBase will use an index to perform an ORDER BY if one is available.  As noted in this section, this can result in the query execution and time to fetch all records being slower than if the same ORDER BY is performed without an index.  You can prevent InterBase from using the index to optimize the ORDER BY clause (without resorting to a PLAN statement, which would disable the optimizer altogether) by adding a meaningless operation to one of the fields, as shown in the following examples.  This allows you to "tweak" the optimization of the ORDER BY but still allow the optimizer to do its job for the rest of the query.

SELECT
  EMP_NO, FIRST_NAME, LAST_NAME || '' 
FROM
  EMPLOYEE
ORDER BY
  3, 2

Concatenating an empty string to the end of the string column makes IB not recognize the column as a part of the index, but doesn't change the result set.  The PLAN is now:

PLAN SORT ((EMPLOYEE NATURAL))

For a numeric field, add 0 to the column.

Understanding Indices

An index is a data structure used by a database to find records quickly.  The exact way in which InterBase's indices work is outside the scope of this document, but essentially an index entry allows InterBase to go directly to one or more matching records meeting certain conditions.  It's a little like an index in a book; if you look up EXECUTE PROCEDURE in the index of the InterBase Operations guide, you can see that you'll find relevant information on pages 147 & 159, allowing you to find that information faster than you could by starting on page 1 and looking for the words "EXECUTE PROCEDURE" on each successive page.

Uniqueness

Because an index can tell you if a certain value exists in a table quickly (it's not even necessary to read the record in question; simply look and see if the value exists in the index) they are useful for enforcing unique constraints.  It is so much faster to enforce a unique constraint with an index than without one that InterBase automatically creates an corresponding index when you create a unique constraint.  Indices can be non-unique (the default) or unique.

When an index is not unique, it will contain references to multiple database records for some of the values stored in the index.  

Ascending vs. Descending indices

InterBase uses prefix compression when storing values in its indices.  That is, it stores the first value in the index, and then when it stores the next value it only stores the characters from the point at which they begin to differ from the first value.  So if InterBase needed to store references to the following two values in an index:

AAAAAA
AAAABB

It would actually store something like this in the index itself:

AAAAAA
4 BB

This is called prefix compression. It saves a lot of space in many operations, and it works quite quickly at runtime.  However, it makes it impossible for InterBase to read the index backwards.  If you looked up the last value in the index, you might have to read all the way to the beginning of the index to figure out what it was.  

Because of this InterBase offers two different kinds of indices: ASCENDING and DESCENDING.  Most indices are ASCENDING.  ASCENDING indices are used when looking up records which meet specific criteria, such as:

SELECT
  FOO
FROM
  BAR
WHERE
  FOO_ID = 1000

DESCENDING indices are mostly useful for the MAX function.  MAX in InterBase is very efficient with a DESCENDING index, and very inefficient without it, because it uses a non-indexed table scan when a DESCENDING index is not available.

When are Indices Useful?

It's worth mentioning that some queries run faster when an index isn't used.  Indices are, generally speaking, useful when you want to read a small subset of the records in a table.  Since disk I/O takes a while, it's much faster to go directly to the few records you want to read than to scan every record in the table looking for them.

When InterBase uses an index to find a record, it finds up the database page (essentially, the general area on disk where the record can be found) and the offset (the precise area within that general area to find the record) in the index, and then goes directly there.  When an index is not used, InterBase will read every record in the table until it finds the record(s) you're looking for.

When you want to read every record in the table (or even most of them), it is often faster to not use the index.  For example, if you want every record in the table, sorted alphabetically, it is probably faster to read the records in storage order and then sort them in memory, paging to disk if necessary.  This is because it's much faster to read the records in storage order, moving the read head as little as possible, than to read the records out of order, moving the read head a lot.

When are Indices Not Useful?

Indices are not useful when they don't significantly reduce the number of records InterBase must look though to find a match. The most obvious example is with a BOOLEAN column containing roughly equal numbers of TRUE and FALSE values. Consider this table:

CREATE TABLE PERSON (
  PERSON_ID INTEGER NOT NULL PRIMARY KEY,
  NAME VARCHAR(60),
  IS_MALE BOOLEAN);

Creating an index on the BOOLEAN column would be unlikely to help InterBase run any query, even one like this:

SELECT
  PERSON_ID, NAME
  FROM 
    PERSON
  WHERE
    IS_MALE = FALSE
  ORDER BY
    NAME, PERSON_ID;

Why? Because if an index is used then finding matching records is a two-step process. First IB must look for a match in the index, second it must find the record itself on disk. Since index order is probably not the same as disk storage order, retrieving the record is relatively slow. By contrast, without the index InterBase can go through the records in storage order and reject those which don't match the WHERE condition. This is much faster than an indexed lookup. The index shown here would help the query shown above if the data in the table represented 100,000 men and 10 women. The important point is that indices are only useful when they significantly reduce the number of records that InterBase must examine -- and "significantly," here, means an exponential reduction.

The measure of the usefulness of an index for an average query (i.e., discounting unusual distribution of data like the 100,000 men / 10 women example) is called selectivity.  We'll discuss selectivity in more detail later on.

Analyzing Indices

Besides the selectivity numbers displayed in InterBase PLANalyzer, another useful tool for examining an index at a very low level is the output of the command-line gstat program (also available as Database Statistics in IBConsole).

Understanding Compound Indices

In the Understanding Indices section we considered how an index can speed up a simple query like:

SELECT
  FOO
FROM
  BAR
WHERE
  FOO_ID = 1000

Indices are obviously useful for simple queries like the one above.  But what about more complicated queries like:

SELECT
  FOO
FROM
  BAR
WHERE
  FOO_COLOR = 'Blue'
    AND
  FOO_DATE = CURRENT_DATE
    AND
  FOO_DSC STARTING WITH 'Foo';

When we design our metadata, there are a few approaches we can take:

  1. Create a single index for only one of the columns.  If there are only 10 records in the table for any given date, for example, then a single index on FOO_DATE is enough to allow us to find any record in the table quickly.
  2. Create one index for each column in the WHERE clause (three total).  InterBase will combine the indices when the query is run.
  3. Create a single index on all three columns (one index total).

If solution (1) works for your particular application, great.  If not, we need to examine (2) and (3).

Let's imagine you choose solution (2), and create indices on FOO_DATE, FOO_COLOR, and FOO_DSC.  When you type a WHERE clause like the one above, the InterBase optimizer will notice that all three indices could be useful in performing the query.  It makes a kind of bitmap of each index, where a bit is turned on if there's an index node for a particular value, and turned off if there isn't.  It then combines the bitmaps using a binary AND operation, and uses the resulting bitmap as a single index into the table.  There is some overhead in doing this, but it isn't too bad in most cases.

If you choose solution (3), then InterBase can use your multi-column index as-is, without any combination of separate indices.  So (3) may produce the fastest query results for this particular SELECT.  But what about other statements; will the index still be useful? 

Order of Columns is Important

The answer is "maybe."  InterBase can use a multi-column index in some cases even when you don't use all of the columns in the index in your query.  In particular, InterBase can use a multi-column index when you use the first n columns in the index in a query.  In other words, the order in which the columns are listed in the index is important!  Let's presume we create the following index to help with the query above:

CREATE INDEX XBAR ON BAR (FOO_DATE, FOO_COLOR, FOO_DSC);

This index will, obviously, be very useful on the query above.  It will also be helpful with the following query:

SELECT
  FOO
FROM
  BAR
WHERE
  FOO_DSC = 'Foo'
    AND
  FOO_DATE = CURRENT_DATE;

...because FOO_DATE and FOO_COLOR are the first two columns in the index.  InterBase cannot use this index for this query:

SELECT
  FOO
FROM
  BAR
WHERE
  FOO_DSC = 'Foo';

...however, because FOO_DSC is not the first column in the index.

Compound Primary Keys

This is particularly important if you have compound primary keys on some of your tables.  When you have a compound primary key on a table you must always specify the full primary key value -- all columns included -- if you want the index to be used to locate a record.

Imagine you're creating tables to keep track of employees and salary grades.  The employee and salary grade tables are simple enough:

CREATE TABLE EMPLOYEE (
  ID INTEGER NOT NULL,
  NAME VARCHAR(60),
  PRIMARY KEY (ID));
CREATE TABLE SALARY_GRADE (
  ID INTEGER NOT NULL,
  NAME VARCHAR(60),
  SALARY NUMERIC(16,2), /* We can always hope... */
  PRIMARY KEY (ID));

Now we need a way to store which employees are in which salary grade.  One possible way to design the table would be to use a compound primary key:

CREATE TABLE EMPLOYEE_SALARY_GRADE (
  EMPLOYEE_ID INTEGER NOT NULL,
  SALARY_GRADE_ID INTEGER NOT NULL,
  PRIMARY KEY(EMPLOYEE_ID, SALARY_GRADE_ID));

But this design has a substantial drawback:  With this metadata design, the primary key index will only be used if we search for a specific record by using both the EMPLOYEE_ID and the SALARY_GRADE_ID, or if we search for a specific EMPLOYEE_ID.  The index cannot be used if we want a list of all employees in a certain salary grade.  We could reverse the fields in the primary key definition and therefore be able to use the index in a search for all employees in a certain salary grade, but we wouldn't be able to use the index when searching for a particular employee's record.

A more flexible solution would be to give the EMPLOYEE_SALARY_GRADE table its own primary key, and create individual indices for the EMPLOYEE_ID and SALARY_GRADE_ID columns.  The table will now be defined as follows:

CREATE TABLE EMPLOYEE_SALARY_GRADE (
  ID INTEGER NOT NULL,
  EMPLOYEE_ID INTEGER NOT NULL,
  SALARY_GRADE_ID INTEGER NOT NULL,
  PRIMARY KEY(ID));

But before creating indices on the EMPLOYEE_ID or SALARY_GRADE_ID columns it will be necessary to consider whether or not we'll be creating FOREIGN KEY constraints on the EMPLOYEE_ID and SALARY_GRADE_ID columns, in order to make the InterBase server enforce referential integrity.  If we will be creating these constraints, it is not necessary to create indices on these columns manually because InterBase will create these indices automatically to help it enforce the constraints.  If we will not be creating FOREIGN KEY constraints on these two columns, then we may want to create an index on each column in order to speed up the queries discussed above.

Understanding Selectivity

Selectivity is, in very general terms, a measurement of how useful an index is in locating an average record in a table.  It is equivalent to the average number of records matching any entry in the index.

Index Selectivity

Some indices are UNIQUE -- that is, for every value  in the index, you will find one and only one record in the corresponding table.  A good example is a PRIMARY KEY.  PRIMARY KEYs are always associated with a unique index.

Some indices are not unique.  Imagine an table of employees with an index on the "last name" field.  Since many employees might have the same last name, each leaf in the index might point to one record, or to many records.

Selectivity is a number which represents the average number of records corresponding to each value in the index.

Indices with low selectivity numbers allow InterBase to find records in a large table very quickly.  Indices with high selectivity numbers mean that the index may be less useful for finding a record.  For example, imagine a table of people with 10000 records.  Imagine that there is a column called "eye color," and that there is an index on this column with a selectivity of 2000.  In other words, an average of 2000 people will have any given eye color.  This index will not be very helpful in locating a record for a particular individual, since even if we know they have blue eyes we'll still have to look through 2000 or more records to find their individual record.  In fact, this index might not even be useful for selecting a list of people with blue eyes, because it's probably faster for the InterBase server to read the entire table and ignore records with an eye color other than blue than to move the disk read head repeatedly in order to go directly to the "blue eye" records.

Obviously, selectivity can depend upon the nature of the data you're storing.  If you have 10000 different color values in your table of people, then the "eye color" index starts to make sense.  You can tell InterBase to recalculate the selectivity of an index based on the current state of the database by clicking on the index in the plan tree view (on the "PLAN analysis" tab) and then clicking the "Set Statistics" button.  "Set statistics" may take a while on a large table.  See the InterBase Operations Guide for information about setting statistics ("Database and Server Performance" chapter, "Database tuning tasks" section, "Tuning Indices).

Effective Selectivity

The selectivity of the index or indices used in a query does not by itself tell you how useful the index is to a particular query, however.  Several things can make an index behave as though it were more or less selective than it really is for a specific query you may run:

InterBase will make the best guess it can, but there is no efficient way to always determine the exact, effective selectivity of an index for any particular query, and this can cause InterBase's optimizer to choose PLANs which are not the most efficient way of running a particular statement.  It is therefore necessary to test the statements your software will be using against a populated database and to look closely at the ones which turn out to be slow.

Creating Indices

The act of creating an index is very simple, but care should be given to designing the optimum strategy for indexing the tables in a database.  There is no single "right" way to design indices.  Instead, the DBA must carefully consider how the table will be used.  In other words, the right strategy for indexing a table depends upon the SQL which will be run against it.

SELECT vs. INSERT and UPDATE

Optimization of databases nearly always begins with SELECT statements.  This is because SELECT is a very common operation.  In fact, UPDATE does a SELECT internally before altering the record.  But it is important to understand that there are performance tradeoffs which impact statements which alter a record (INSERT, UPDATE, and DELETE) but not SELECTs.  Depending upon the use of a table, these tradeoffs may be important.

For example, a table with no indices can be filled using 100,000 INSERT statements much more quickly than a table with 10 indices, because each INSERT statement updates each index.  But the indices may make subsequent SELECTs slower.  It is possible to disable indices other than indices used to enforce primary and foreign key constraints (using the ALTER INDEX command in ISQL), and this can indeed dramatically speed up the initial population of a database, but this technique should not be used after the DB is deployed since it is a global change which affects all users of the DB.

Instead, the DBA should design the best compromise of SELECT speed vs. INSERT speed, based on the anticipated usage of the table.

Don't Index a Column Twice

Obviously, it is not useful to have two indices covering the exact same column (except for the special case of having an ASCENDING and a DESCENDING index on one or more columns), but this situation can crop up in some unexpected places.  

For example, if the same column is in both a simple index (on that column alone) and a compound index (on that column plus one or more other columns) this can confuse the InterBase optimizer.  The optimizer will often pick the compound index because it appears to be more highly selective from its statistics.  There is no way to tell what the "real" selectivity is when referencing only one column in a multicolumn index.  

Also, it's important to remember that InterBase creates indices to enforce PRIMARY KEY, FOREIGN KEY, and UNIQUE constraints.  If you have already created a constraint on a column, there is no need to create an index on that column because InterBase has already done it for you.  

Create Highly Selective Indices

In fact, the indices created by InterBase for a FOREIGN KEY constraint can be a problem in a few cases.  If there are only 10 values in the related table (or if there are 1,000,000 values, but only 10 are actually used in the related table), then the index created to help enforce the constraint is not going to be very selective.  The only workaround for this problem is to use triggers instead of a formal constraint to enforce the business rule.

An index which is not highly selective is not useful, unless it can be combined with other indices to produce a more selective effective index for a particular query.

Simple vs. Compound Indices

This tradeoff is discussed in more detail in the Understanding Compound Indices section.  But look also at the Understanding Selectivity section and especially the subsection on Effective Selectivity for a full understanding of this complex topic.

Create a DESCENDING Index to Optimize the MAX Function

The MAX() aggregate function can only use a DESCENDING index, and MIN() can only use an ASCENDING index.

Avoid Indices for ORDER BY

ORDER BY clauses can actually be slowed by the use of an index.

The DBA must carefully consider all SQL statements which are likely to reference the table in question when designing indices.  In addition, it is a good idea to revisit tables from time to time using real customer data, since the data in question will influence the selectivity of the indices.  InterBase's optimizer will try to avoid using indices which are not highly selective, but it cannot create an index where none exists.

Understanding GSTAT Output

The statistics for an index look like this:

Index NAMEX (1) 
	Depth: 1, leaf buckets: 1, nodes: 43 
	Average data length: 15.00, total dup: 1, max dup: 1 
	Fill distribution: 
	     0 - 19% = 0 
	    20 - 39% = 1 
	    40 - 59% = 0 
	    60 - 79% = 0 
	    80 - 99% = 0 

Here is what these numbers mean

Depth
The maximum depth of the index tree.  The index above has only one level.  This number should usually be fairly low.
Leaf buckets
This term is somewhat misleading.  It really means the number of database pages occupied by the bottom level of the index.  This index uses only one page since there are very few values and Nodes.
Nodes
The number of nodes in the index tree.  This is fairly close to the number of table rows pointed to by the index, but there are higher-level nodes in the tree, as well (one, in this case).
Average data length
This is the average amount of space in the DB occupied by the data field of a typical node in this index.  Note that this can be very different from the average length of the value represented due to prefix compression.  In extreme cases (very poor selectivity) the average data length can be close to zero since IB does not store duplicates literally in an index, but instead refers back to the duplicated value.  If this number is close to zero check the selectivity number (not part of the statistics display, but available on the PLAN tab in IB PLANalyzer) or the Total dup number (see below); if it is large consider dropping the index.
Total dup
The number of Nodes in the index which refer to the same value as one or more other index node.  Total dup for a primary key index, for example, is always zero.
Max dup
In a non-unique index, each value can refer to one or many Nodes (rows).  Max dup is the number of Nodes referred to by the value which refers to more rows than any other value.  If this number is a significant fraction of the Nodes number it's another good clue that selectivity for this index may be poor and that the index might not be useful.
Fill distribution
Indices are stored on database pages just like table data.  And, as with table data, InterBase normally leaves some space on each page in order to ensure that additional entries/versions can be stored reasonably close to the original.  As the DB is used, this space will be occupied and freed up, depending upon the nature of the operations performed.  The numbers displayed in this section indicate how many pages are in a particular range of "fullness."  In the example above, one page (the only page in the index) is somewhere between 20 and 39 percent full.  This index is OK since there are so few entries (there is not enough data to fill up the single page in the index), but on a large index having sparsely populated pages can lead to poor performance since it means the data is spread across the hard disk.

Conclusion

This paper is intended to be a comprehensive reference to InterBase query optimization and as such is a bit long. Sorry about that!