Posts Tagged ‘parallel’

For OLTP, oracle has been proved to be the best choice regarding the performance and scalability, exadata just enhances its leadership in this direction. But for data warehouse, could exadata compete and succees against teradata by supporting more IOPS, larger cache and interconnect throughput?

Teradata has the built-in parallelism and it’s cheaper to run in parallel because of it’s share nothing architecture as there is less coordination overhead. For oracle to run in parallel, slaves in different slave sets have to exchange message. The total memory for exchanging message is X*degree*degree*parallel_execution_message_size. X is 3 for non-RAC and X is 4 for RAC. Huge teredata databases could have more than 2,000 nodes. To get oracle reach the same parallelism, how much memory does oracle need to exchange message between slaves? If set parallel_execution_message_size to 64K, the maximum size we could set (for large query,  the more parallel_execution_message_size the better), so oracle needs 3*2000*2000*64k=768G memory for non-RAC,  or 4*2000*2000*64k=1T memory for RAC.  Typically exadata is used as RAC, so it needs 1T memory purely for message exchange. And what’s the overhead of manage this 1T memory? I have no box to test it out.  Another problem is that when the producer generates message faster than the consumer consumes the message, then the buffer (parallel_execution_message_size) for message will be full, and the producer has to wait until there are free space in the buffer. This a not problem if producer is always faster than consumer, but it’s a problem if sometime producer is faster and sometime consumer is faster.

At any time at most two slave set can work at the same time, and each slave set works for the same step in an execution plan,  if a sql runs in degree of 1000 and one slave is slow,  all other 999 slaves have to wait for it to complete before execute the next step. For a hash join with 10 tables,  totally we have 9 joins, each join will complete until the slowest slave completed. Oracle doc suggests to use denormalized for data warehouse, and teradata doc said denormalization is not needed as join in teradata is cheap. Do you like denormalization? More often than not, we have to do the denormalization because of performance issue, if the performance is not a problem, why use denormalization?

If not use partition table, oracle has to hash every row in the table to know which slave the row should go to, if use partition table AND the join column is the partition key, then oracle could use partition wise join or partial partition wise join, in this way oracle doesn’t need to hash every row. Teradata also has to distribute the rows in one node to other nodes if the join column is not the primary index (which is used to determine which node the row should sit in), but the mechanism behind is simple and efficient: distribute the rows in one node to all others node based on the join column and put distributed rows in a pool, there is no limit as parallel_execution_message_size in oracle. The pool teradata used is just like temporary tablespace which needs physical IO to perform against it,  you might say memory is faster than disk, but for large query, who expects the data could be cached fully in memory?

Read Full Post »

We have a huge partition table (25 partitions) on db1, and need to copy it to another database db2. The straightforward way I could think of is to insert in parallel over network. So run the command on db2:

insert /*+ append parallel(a,4) */ into tab1 a select /*+ parallel(b,4) */ * from tab1@db1;

At the remote DB side db1, I could see 4 slaves are waiting for “PX Deq Credit: send blkd”, and the coordinator is waiting for “more data to client”.  That means 4 slaves are waiting for the coordinator to consume the data, and the coordinator is waiting for data to be transferred over network. The bottleneck here is the network. To improve the transfer speed over network, I need kick off more non-parallel sql instead of 1 sql with parallel.

For a partition table with 25 partitions, if I use 5 separate non-parallel sql (each load 5 partitions) to load the data , then we could have 5 inter-database connections instead of 1 inter-database connection, in this way we could get better network throughput.  (update 25 Dec, 2009:  I saw many cases where the client side used 100% of a single cpu, so the network throughput can’t increase even there are still lots of free bandwidth, increase sessions could help)

So what need to do is:

1: Remove the hint of parallel(b,4). This hint is of no use, because the speed to read the data from the table is quicker than the speed to send the data to network. FTS is not the bottleneck.

2: one sql statement inserts data from one partition, and kick off 5 sessions, each session run 5 of the following sql.

insert /*+ append  */ into tab1 a select * from tab1@db1 partition (part00);
insert /*+ append  */ into tab1 a select * from tab1@db1 partition (part01);
insert /*+ append  */ into tab1 a select * from tab1@db1 partition (part02);
insert /*+ append  */ into tab1 a select * from tab1@db1 partition (part03);
insert /*+ append  */ into tab1 a select * from tab1@db1 partition (part24);

After the change, the data transferred per second increases a lot,

Read Full Post »