Objective: Performance!

Hi all,

This is a message received today from IS management:

“Your mission today, should you accept it, is to implement a new table in the database, perform an initial data load with an execution time of less than 10 minutes, and provide the team of developers with a way to best manage the response times imposed on a given query which, according to the application testing team reporting to the Director of Operations, has an unacceptable  response time . You have IBM Informix Dynamic Server V11.70 FC2 Innovator-C Edition, a Linux server with one socket / 4cores, 16 Gb of RAM fully at your disposal. Oh! I forgot something: the table has 70 millions rows to be loaded, but  its schema is not very complex. In addition, you will have to load about 500 000 rows every night “.

Such requests will certainly remind you situations experienced either as a DBA or a developer. It is generally observed at these time points, a degree of uncertainty in the eyes of the recipient of this request, as well as characteristic symptoms of depression or a ferocious desire to deny the request on charges of “it is unrealistic “. It is true that in earlier versions of IDS that request was a real technical challenge.

Aside from the use of High Performance Loader, not so simple to implement under pressure, we must admit that there were at least reasons to be disturbed by this request. Fortunately, we are in 2011 and IBM Informix laboratories have concocted a number of useful enhancements, which will enable us to meet the challenge with IDS 11.70.

Let’s first take a look at the table schema:

1) The new table contains flight tickets on routes between France, Germany, England, Italy, Spain and the United States since 1 January 2005.
here’s the schema:
create table tickets (
ticketno bigint # Unique ticket ID
lastname varchar (60), # travelers Last Name
firstname varchar (30), # travelers First Name
gender char (3), # MR / MRS / INF
flightdate date, flight # Date
flightno char (6), # Flight No.
fileno char (8), # File No.
prize money (8.2) # Price of flight
Carrier char (2), # Airline
departfrom char (3), # departure airport
destination char (3) # destination airport
);
create index on tickets3 tickets (flightdate);
create index on tickets7 tickets (destination);

And finally the query that causes problems for our friends from the development team:
select destination, count (*)
from tickets, airports
Where tickets.destination = airports.code
and tickets.destination between “AAA” and “ZZZ”
and airports.country = ‘France’
and flightdate between “21/03/2006” and “31/07/2010”
group by 1

I had previously managed to convince the developers to provide single-column indexes, to avoid having too many B-Tree levels that would  decrease performance. The evocation of the figure of 70 million initial rows, increasing by about one million every two days, is likely to raise at least a minimum of the interest, at worst a clear concern about the defined objectives .

We will not speak today about fragmentation of this table, that could provide us with a solution in case of significant I/O problems detection. We will not discuss either the tables compression, technique inherited from XPS and implemented starting with version 11.50.xC6, which, while reducing by about half (or more) disk space, significantly increases performance on the “big” tables. This feature is currently “optional” (read pay), and is therefore not an option to consider in our case.

So let’s focus on the first part of the challenge:  load 70 millions rows in less than 10 minutes. We thought about the High Performance Loader, which is a very effective tool, but a little too complex to implement for a first time and in the allotted time.

This is why I wanted to mention the use of external tables, also inherited from XPS and implemented since version 11.50 xC6. The principle is very simple: when you create an external table, you create a regular Informix table structure ( i.e. schema) that, instead of pointing to somewhere in an IDS chunk, points to an ascii file, with fixed length or delimited fields, that is already existing on a file system.

This is what we did for today:

CREATE EXTERNAL TABLE ext_tickets
SameAs tickets
USING (DATAFILES (“DISK :/ home / eric / StagingArea / ObjectifPerf / tickets.unl”), DELIMITER “|”, EXPRESS, MAXERRORS 10, REJECTFILE “/ tmp / x_tickets.out”)

Let us explain:
“SameAs tickets” means the external table will have the same schema as the regular table “tickets”
“DATAFILES DISK” is our file on disk and is called …
“DELIMITER”:  the file structure has fields delimited  here by the “pipe” character.
“EXPRESS” Load mode. It opposes DELUXE mode. This method avoids the use of pages buffers  when reading this table, which will significantly enhance the performance of our load.
“MAXERRORS” beyond this number of errors, the creation of the table stops.
“REJECTFILE” file, containing the load errors.

An external table can be manipulated by all kinds of SELECT statements (order by, group by, predicates), but can not be UPDATED, DELETED or INSERTED.

We have measured the load operation execution time, first with the tickets table created with the RAW type.
INSERT INTO SELECT * FROM tickets ext_tickets
70000000 row (s) inserted.
real 6m20.937s user 0m0.005s sys 0m0.005s

We compared the method with “insert into table load from file” meethod, and the result is clear: we have stopped loading after 24 minutes, and only 24 Million rows were loaded.

May I bring your attention to the following : we created a table of type “RAW”. This table type has the special interest to disable the transaction logging of the referenced table. Consequently, the load performance is drastically improved due to the absence of logical logs disk writes activity. The difference is bold! The big advantage is that it is possible to change for type STANDARD (logging) after loading with a simple ALTER TABLE, and then return to RAW for consecutive loads, just like this:
ALTER table tickets TYPE (RAW);
ALTER table tickets TYPE (STANDARD);

This method is also a smart hint to avoid the pitfall of ‘long transaction’ occurrence.

We have therefore reached the first goal to go below 10 minutes of loading time, and found an efficient and stable solution for the upcoming daily loads.

Let’s now face the second challenge: the tickets table has a respectable number of rows (70 million rows), and at this scale it is necessary to seriously consider system resource consumption as well as response times.
We’ll start from the baseline proposed by the development team and take a performance measure.
We have given up the baseline test scenario, understanding the reasons for complaints from the developers: the  query has been interrupted before completion after 60 minutes. Here is the query plan:

Estimated Cost: 3126460
Estimated # of Rows Returned: 775663
Temporary Files Required For: Group By
1) eric.airports: INDEX PATH
Filters: (eric.airports.code> = ‘AAA’ AND eric.airports.code <= ‘ZZZ’)
(1) Index Name eric.i_airp3
Index Keys: country (Serial, fragments: ALL)
Lower Index Filter: eric.airports.country = ‘France’ 2) eric.tickets:
INDEX PATH Filters: (eric.tickets.flightdate> = 21/03/2006 AND eric.tickets.flightdate <= 31/07/2010)
(1) Index Name eric.tickets7
Index Keys: destination (Serial, fragments: ALL)
Lower Index Filter: = eric.tickets.destination eric.airports.code
NESTED LOOP JOIN

So we get a nested loop join, using indexes on countries in the table airports (10000 rows), and index on fligh date for the table tickets. Onstat -p returns a result of 7,509,528 rows read.

That’s now the right time to introduce the new directive optimizer MULTI_INDEX , which will allow us to use multiple indexes on a table, unlike what was done before. The syntax is:

SELECT –+multi_index(tickets tickets3 tickets7)
distinct(destination) , count(*) ….
from tickets,airports etc….

We will tell the optimizer to use indexes tickets3 AND tickets7 for the tickets table, in order to read a minimum of data pages, and consequently reduce the buffer/disk IO on this query.
Here is the result of the set explain:

DIRECTIVES FOLLOWED: MULTI_INDEX ( tickets tickets3 tickets7 )
DIRECTIVES NOT FOLLOWED:
Estimated Cost: 113176408
Estimated # of Rows Returned: 775663
Temporary Files Required For: Group By
1) eric.tickets: MULTI INDEX PATH (SKIP SCAN)
(1) Index Name: eric.tickets7
Index Keys: destination (Serial, fragments: ALL)
Lower Index Filter: eric.tickets.destination >= ‘AAA’
Upper Index Filter: eric.tickets.destination <= ‘ZZZ’ AND
(2) Index Name: eric.tickets3 Index Keys: flightdate (Serial, fragments: ALL)
Lower Index Filter: eric.tickets.flightdate >= 21/03/2006
Upper Index Filter: eric.tickets.flightdate <= 31/07/2010
2) eric.airports: INDEX PATH
Filters: Table Scan Filters: (eric.airports.code >= ‘AAA’ AND eric.airports.code <= ‘ZZZ’ )
(1) Index Name: eric.i_airp3
Index Keys: country (Serial, fragments: ALL)
Lower Index Filter: eric.airports.country = ‘France’
DYNAMIC HASH JOIN
Dynamic Hash Filters: eric.tickets.destination = eric.airports.code

The aspect of the query plan has totally changed : we can see that the MULTI-INDEX directive on the table tickets has been used, which has the effect of using simultaneously the tickets7 and tickets3 indexes, and finalizing the query plan by a HASH JOIN.

Onstat-p indicates that only 2,294,947 rows have been read, which means a significant decrease.

Now what about the response time?
Here it is:
155 row (s) unloaded
real 5m22.672s user 0m0.005s sys 0m0.008s

Your eyes can read well! the result is 5 min 22s. The objective of improving the response time is reached. Since we did not implement data across multiple disks, we cannot take advantage of multiplexing scan threads, which could further improve our performance. Not having launched this test on the Ultimate Edition, we cannot benefit from PDQPRIORITY which would probably have improved the results slightly.
We have clearly benefited from the MULTI INDEX directive, which has reduced the response time from over one hour (maybe more) to 5 min 22s.

You should also know that this technique is extremely efficient in queries with predicate “OR” on various indexes of a table. How many times did you have to rewrite a query of this type by including a UNION? With this technique, tou don’t need to rewrite everything, just add the relevant directive optimizer, et voila!

We also encourage you to use the onmode option ‘-Y sessionid 1 outputfile’, which will create the session’s query plan file without having to add the ‘SET EXPLAIN ON’ command in the application code .However, this is to be used moderately in a production environment, because this mode is very verbose and can generate undesired disk write contention.

Well, mission accomplished! Thanks to the techniques of external tables mode EXPRESS, table-like RAW which temporarily suppress logging, and the new directive MULTI_INDEX optimizer.

See you on our favorite station for new challenges.

, , , , , , , , ,

Trackbacks/Pingbacks

  1. dbimport error - *** Import data is corrupted! - Page 2 - dBforums - July 23, 2013

    […] your sem_c01404.unl file for the datafiles parameter. You can check how to create an external table here ( and read the interesting article 🙂 also set maxerrors to 100 and specify the rejectfile […]

Leave a Reply

%d bloggers like this: