Sachin Arora's Blog: November 2006

Thursday, November 30, 2006

Precaution while defining data types in Oracle

I was reading a wonderful article on Tom Kyte’s blog on repercussion of ill defined data types.

Some of the example he mentions is:

- Varchar2(40) Vs Varchar2(4000)
- Date Vs varchar2

Varchar2(40) Vs Varchar2(4000)

Generally developers ask for this to avoid issues in the application. They always want the uppermost limit to avoid any application errors. The fact on which they argue is “varchar2” data type will not reserve 4000 character space so disk space is not an issue. But what they don’t know is how costly are these.

Repercussions

- Generally application does an “array fetch” from the database i.e. they select 100 (may be more) rows in one go. So if you are selecting 10 varchar2 cols. Then effective RAM (not storage) usage will be 4000(char) x 10 (cols) x 100 (rows) = 4 MB of RAM. On contrary, had this column defined with 40 char size the usage would have been 40 x 10 x 100 ~ 40KB (approx)!! See the difference; also we didn’t multiply the “number of session”. That could be another shock!!

- Later on, it will be difficult to know for what the column was made. Ex: for first_name, if you define varchar2 (40000), it’s confusing for a new developer to know for what this column was made.

Date Vs varchar2

Again lots of developers define date cols as varchar2 (or char) for their convenience. But what they forget is not only data integrity (a date could be 01/01/03 .. what was dd,mm,yy .. then u don’t know what did you defined) but also performance.

While doing “Index range scans” by using “ between and ”
Optimizer will not be able to use the index as efficiently as in case of:
“ between and ”

I suggest you to read the full article by maestro himself.

Wednesday, November 29, 2006

Friendly note

Friends,

The present (or future) articles on this blog may not be originally mine.
After reading some intersting article on web and books, if i feel that this could help others, i write that here.

Motive of this blog:

1. Help the community using Oracle (Many others doing it, i'm just trying).
2. Learn from others feedback.
3. Share what i learned. Obviously so. Oracle was not written by me. Many experts have interpreted and written books. I just analyse those and put a feedback.

Tuesday, November 28, 2006

Oracle table compression

Compress your tables

This feature has been introduced in 9i rel 2 and is most useful in a warehouse environment (for fact tables).

How to compress? Simple

SQL> alter table test move compress;

Table altered.

How Oracle implements compression?

Oracle compress data by eliminating duplicate values within a data-block. Any repetitive occurrence of a value in a block is replaced by a symbol entry in a “symbol table” within the data block. So for example deptno=10 is repeated 5 times within a data block, it will be only stored once and rest 4 times a symbol entry will be stored in symbol table.
Its very important to know that every data block is self contained and sufficient to rebuild the uncompressed form of data.

Table compression can significantly reduce disk and buffer cache requirements for database tables while improving query performance. Compressed tables use fewer data blocks on disk, reducing disk space requirements.

Identifying tables to compress:

First create the following function which will get you the extent of compression

create function compression_ratio (tabname varchar2)
return number is — sample percentage
pct number := 0.000099;
blkcnt number := 0; blkcntc number; begin
execute immediate ' create table TEMP$$FOR_TEST pctfree 0
as select * from ' tabname ' where rownum < 1';
while ((pct < 100) and (blkcnt < 1000)) loop
execute immediate 'truncate table TEMP$$FOR_TEST';
execute immediate 'insert into TEMP$$FOR_TEST select *
from ' tabname ' sample block (' pct ',10)';
execute immediate 'select
count(distinct(dbms_rowid.rowid_block_number(rowid)))
from TEMP$$FOR_TEST' into blkcnt;
pct := pct * 10;
end loop;
execute immediate 'alter table TEMP$$FOR_TEST move compress ';
execute immediate 'select
count(distinct(dbms_rowid.rowid_block_number(rowid)))
from TEMP$$FOR_TEST' into blkcntc;
execute immediate 'drop table TEMP$$FOR_TEST';
return (blkcnt/blkcntc);
end;
/

1 declare
2 a number;
3 begin
4 a:=compression_ratio('TEST');
5 dbms_output.put_line(a);
6 end
7 ;
8 /

2.91389728096676737160120845921450151057

PL/SQL procedure successfully completed.

1 select bytes/1024/1024 "Size in MB" from user_segments
2* where segment_name='TEST'
SQL> /

Size in MB
----------
18

SQL> alter table test move compress;

Table altered.

SQL> select bytes/1024/1024 "Size in MB" from user_segments
2 where segment_name='TEST';

Size in MB
----------
6

After compressing the table, you need to rebuild indexes because the rowid's have changed.

Notes:

- This feature can be best utilized in a warehouse environment where there are lot of duplicate values (for fact tables). Infact a larger block size is more efficient, becuase duplicate values will be only stored once within a block.

- This feature has no -ve effect, infact it accelerates the performance of queries accessing large amount of data.

- I suggest you to read the following white paper by Oracle which explains the whole algorithm in details along with industry recognized TPC test cases.

http://www.vldb.org/conf/2003/papers/S28P01.pdf

I wrote the above article after reading the oramag. I suggest you to read the full article on Oracle site

The Clustering factor

The Clustering Factor

The clustering factor is a number which represent the degree to which data is randomly distributed in a table.

In simple terms it is the number of “block switches” while reading a table using an index.

Figure: Bad clustering factor

The above diagram explains that how scatter the rows of the table are. The first index entry (from left of index) points to the first data block and second index entry points to second data block. So while making index range scan or full index scan, optimizer have to switch between blocks and have to revisit the same block more than once because rows are scatter. So the number of times optimizer will make these switches is actually termed as “Clustering factor”.

Figure: Good clustering factor

The above image represents "Good CF”. In an event of index range scan, optimizer will not have to jump to next data block as most of the index entries points to same data block. This helps significantly in reducing the cost of your SELECT statements.

Clustering factor is stored in data dictionary and can be viewed from dba_indexes (or user_indexes)

SQL> create table sac as select * from all_objects;

Table created.

SQL> create index obj_id_indx on sac(object_id);

Index created.

SQL> select clustering_factor from user_indexes where index_name='OBJ_ID_INDX';

CLUSTERING_FACTOR
-----------------
545

SQL> select count(*) from sac;

COUNT(*)
----------
38956

SQL> select blocks from user_segments where segment_name='OBJ_ID_INDX';

BLOCKS
----------
96

The above example shows that index has to jump 545 times to give you the full data had you performed full table scan using the index.

Note:
- A good CF is equal (or near) to the values of number of blocks of table.

- A bad CF is equal (or near) to the number of rows of table.

Myth:
- Rebuilding of index can improve the CF.

Then how to improve the CF?

- To improve the CF, it’s the table that must be rebuilt (and reordered).
- If table has multiple indexes, careful consideration needs to be given by which index to order table.

Important point: The above is my interpretation of the subject after reading the book on Optimizer of Jonathan Lewis.

Friday, November 24, 2006

Star Vs Snowflake schema

Star Schemas

The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in figure.

Fact Tables

A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated.

Dimension Tables

A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values.
Dimension tables are generally small in size as compared to fact table.

-To take an example and understand, assume this schema to be of a retail-chain (like wal-mart or carrefour).

Fact will be revenue (money). Now how do you want to see data is called a dimension.

In above figure, you can see the fact is revenue and there are many dimensions to see the same data. You may want to look at revenue based on time (what was the revenue last quarter?), or you may want to look at revenue based on a certain product (what was the revenue for chocolates?) and so on.
In all these cases, the fact is same, however dimension changes as per the requirement.

Note: In an ideal Star schema, all the hierarchies of a dimension are handled within a single table.

Star Query

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them.

Snoflake Schema

The snowflake schema is a variation of the star schema used in a data warehouse.

The snowflake schema (sometimes callled snowflake join schema) is a more complex schema than the star schema because the tables which describe the dimensions are normalized.

Flips of "snowflaking"

- In a data warehouse, the fact table in which data values (and its associated indexes) are stored, is typically responsible for 90% or more of the storage requirements, so the benefit here is normally insignificant.

- Normalization of the dimension tables ("snowflaking") can impair the performance of a data warehouse. Whereas conventional databases can be tuned to match the regular pattern of usage, such patterns rarely exist in a data warehouse. Snowflaking will increase the time taken to perform a query, and the design goals of many data warehouse projects is to minimize these response times.

Benefits of "snowflaking"

- If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate.

- A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalised. A snowflake schema will hence be easier to implement.

- A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organisation of the database.

- Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snowflaking would typically be required to permit simple query tools such as Cognos Powerplay to form such a query, especially if provision for these forms of query weren't anticpated when the data warehouse was first designed.

In practice, many data warehouses will normalize some dimensions and not others, and hence use a combination of snowflake and classic star schema.

Source: Oracle documentation, wikipedia

Touch-Count Algorithm

Touch-Count Algorithm:

In advancement to LRU/MRU algorithm, Oracle 8i moves towards an efficient
algorithm of managing the Buffer cache i.e. touch-count algorithm.

The Buffers in the Data Buffer Cache is managed as shown in the figure above. Prior to Oracle 8, when the data is fetched into data buffers from disk, the data used to
automatically place at the head of the MRU end. However from Oracle 8 onwards the
new data buffers are placed at the middle of block-chain. After loading the data-block, Oracle keeps the track of the “touch” count of the data.

And according to the number of touches to the data, Oracle moves the data-buffers either towards MRU (hot) end or LRU (cold) end.
This is a huge advancement to the management of the data-buffers in Oracle 8
We have hot and cold areas in each buffer-pool (default, recycle, keep).
The size of hot regions is configured by the following newly added parameters of init.ora

a) _db_percent_hot_default.
b) _db_percent_hot_keep
c) _db_percent_hot_recycle

Finding Hot Blocks inside the Oracle Data-Buffers
Oracle 8i provides a internal X$BH view that shows relative performance of the databuffer pools.

Following columns are more of interest :-

a) tim : The tim column is related to the new _db_aging_touch_time init.ora
parameter and governs the amount of time between touches.

b) tch : represents the number of times a buffer has been touched by the user
access. This touch relates directly to the promotion of buffers from cold region to hot region in a buffer pool

SQL> Select b.Object_name object, a.tch touches from x$bh a, dba_objects b
2 Where a.obj=b.object_id and a.tch > 100
3* Order by a.tch desc;
OBJECT TOUCHES
----------------------------------------
APPBROKERINFO 335
SYS_C003474 335
I_SYSAUTH1 278
I_SYSAUTH1 269
I_SYSAUTH1 268
SYSAUTH$ 259
PROPS$ 258
SESSIONDATA 236
8 rows selected.

The above advanced query can be very useful for DBA’s for tracking down those objects,
which are perfect candidates to be moved from DEFAULT pool to KEEP pool.

.

The above article i have written was inspired after reading a Steve Adams article for which i lost the link

Bad SQL design

Important point

If the statement is designed poorly, nothing much can be done by optimizer or indexes

Few known thumb rules

–Avoid Cartesian joins

–Use UNION ALL instead of UNION – if possible

–Use EXIST clause instead of IN - (Wherever appropiate)

–Use order by when you really require it – Its very costly

–When joining 2 views that themselves select from other views, check that the 2 views that you are using do not join the same tables!

–Avoid NOT in or NOT = on indexed columns. They prevent the optimizer from using indexes. Use where amount > 0 instead of where amount != 0

- Avoid writing where is not null. nulls can prevent the optimizer from using an index

- Avoid calculations on indexed columns. Write WHERE amount > 26000/3 instead of WHERE approved_amt/3 > 26000

- The query below will return any record where bmm_code = cORE, Core, CORE, COre, etc.

select appl_appl_id where upper(bmm_code) LIKE 'CORE%'

But this query can be very inefficient as it results in a full table scan. It cannot make use of the index on bmm_code.

Instead, write it like this:

select appl_appl_id from nci_appl_elements_t where (bmm_code like 'C%' or bmm_code like 'c%') and upper(bmm_code) LIKE 'CORE%'

This results in Index Range Scan.

You can also make this more efficient by using 2 characters instead of just one:

where ((bmm_code like 'CO%' or bmm_code like 'Co%' or bmm_code like 'cO%' or bmm_code like 'co%') and upper(bmm_code) LIKE 'CORE%')

Inviting Experts

Friends, feel free to correct me. I will appreciate if you can add your comments also.

Consider declaring NOT NULL columns

Consider declaring NOT NULL columns

People sometimes do not bother to define columns as NOT NULL in the data dictionary, even though these columns should not contain nulls, and indeed never do contain nulls because the application ensures that a value is always supplied. You may think that this is a matter of indifference, but it is not. The optimizer sometimes needs to know that a column is not nullable, and without that knowledge it is constrained to choose a less than optimal execution plan.

1. An index on a nullable column cannot be used to drive access to a table unless the query contains one or more predicates against that column that exclude null values.

Of course, it is not normally desirable to use an index based access path unless the query contains such predicates, but there are important exceptions.

For example, if a full table scan would otherwise be required against the table and the query can be satisfied by a fast full scan (scan for which table data need not be read) against the index, then the latter plan will normally prove more efficient.

Test-case for the above reasoning

SQL> create index color_indx on automobile(color);

Index created.

SQL> select distinct color,count(*) from automobile group by color;

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=4 Card=1046 Bytes=
54392)

1 0 SORT (GROUP BY) (Cost=4 Card=1046 Bytes=54392)
2 1 TABLE ACCESS (FULL) OF 'AUTOMOBILE' (TABLE) (Cost=3 Card
=1046 Bytes=54392)

SQL> alter table automobile modify color not null;

Table altered.

SQL> select distinct color,count(*) from automobile group by color;

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=4 Card=1046 Bytes=
54392)

1 0 SORT (GROUP BY) (Cost=4 Card=1046 Bytes=54392)
2 1 INDEX (FAST FULL SCAN) OF 'COLOR_INDX' (INDEX) (Cost=3 C
ard=1046 Bytes=54392)

2. If you are calling a sub-query in a parent query using the NOT IN predicate, the indexing on column (in where clause of parent query) will not be used.

Because as per optimizer, results of parent query needs to be displayed only when there is no equi-matching from sub-query, And if the sub-query can potentially contain NULL value (UNKNOWN, incomparable), parent query will have no value to compare with NULL value, so it will not use the INDEX.

Test-case for the above Reasoning

SQL> create index sal_indx on emp(sal);

Index created.

SQL> create index ename_indx on emp(ename);

Index created.

SQL> select * from emp where sal not in (select sal from emp where ename='JONES');

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=17 Card=13 Bytes=4
81)

1 0 FILTER
2 1 TABLE ACCESS (FULL) OF 'EMP' (TABLE) (Cost=3 Card=14 Byt
es=518) –> you can see a full table scan even when index exist on SAL

3 1 TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (TABLE) (Cost=2 C
ard=1 Bytes=10)

4 3 INDEX (RANGE SCAN) OF 'ENAME_INDX' (INDEX) (Cost=1 Car
d=1)

SQL> alter table emp modify sal not null;

Table altered.

SQL> select * from emp where sal not in (select sal from emp where ename='JONES');

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=5 Card=12 Bytes=56
4)

1 0 MERGE JOIN (ANTI) (Cost=5 Card=12 Bytes=564)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (TABLE) (Cost=2 C
ard=14 Bytes=518)

3 2 INDEX (FULL SCAN) OF 'SAL_INDX' (INDEX) (Cost=1 Card=1
4) -> Here you go, your index getting used now

4 1 SORT (UNIQUE) (Cost=3 Card=1 Bytes=10)
5 4 TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (TABLE) (Cost=2
Card=1 Bytes=10)

6 5 INDEX (RANGE SCAN) OF 'ENAME_INDX' (INDEX) (Cost=1 C
ard=1)

The above article has been an inspiration after reading an article on ixora . The article was missing some of the testcases, so I thought of adding few for newbiews to relate to it.

IN Vs Exist in SQL

IN Vs EXISTS

The two are processed quite differently.

IN Clause

Select * from T1 where x in ( select y from T2 )

is typically processed as:

select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;

The sub query is evaluated, distinct, indexed and then
joined to the original table -- typically.

As opposed to "EXIST" clause

select * from t1 where exists ( select null from t2 where y = x )

That is processed more like:

for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop

It always results in a full scan of T1 whereas the first query can make use of
an index on T1(x).

So, when is exists appropriate and in appropriate?

Lets say the result of the subquery
( select y from T2 )

is "huge" and takes a long time. But the table T1 is relatively small and
executing ( select null from t2 where y = x.x ) is fast (nice index on
t2(y)). Then the exists will be faster as the time to full scan T1 and do the
index probe into T2 could be less then the time to simply full scan T2 to build
the subquery we need to distinct on.

Lets say the result of the subquery is small -- then IN is typicaly more
appropriate.

If both the subquery and the outer table are huge -- either might work as well
as the other -- depends on the indexes and other factors.

I wrote this note after searching on same issue and found a simple explanation by Tom Kyte.

Here is the link of original