RAC Cache Fusion

Cache Fusion

I mentioned above Cache Fusion in myGRD section, here I go into great detail on how it works, I will also provide a number of walk through examples on my RAC system.

Cache Fusion uses the most efficient communications as possible to limit the amount of traffic used on the interconnect, now you don't need this level of detail to administer a RAC environment but it sure helps to understand how RAC works when trying to diagnose problems. RAC appears to have one large buffer but this is not the case, in reality the buffer caches of each node remain separate, data blocks are shared through distributedlocking andmessagingoperations. RAC copies data blocks across the interconnect to other instances as it is more efficient than reading the disk, yes memory and networking together are faster than disk I/O.

Ping

The transfer of a data block from instances buffer cache to another instances buffer cache is know as a ping. As mentioned already when an instance requires a data block it sends the request to the lock master to obtain a lock in the desiredmode, this process is known as blocking asynchronous trap (BAST). When an instance receives a BAST it downgrades the lock ASAP, however it might have to write the corresponding block to disk, this operation is known as disk ping or hard ping. Disk pings have been reduce in the later versions of RAC, thus relaying on block transfers more, however there will always be a small amount of disk pinging. In the newer versions of RAC when a BAST is received sending the block or downgrading the lock may be deferred by tens of milliseconds, this extra time allows the holding instance to complete an active transaction and mark the block header appropriately, this will eliminate any need for the receiving instance to check the status of the transaction immediately after receiving/reading a block. Checking the status of a transaction is an expensive operation that may require access (and pinging) to the related undo segment header and undo data blocks as well. The parameter _gc_defer_time can be used to define the duration by which an instance deferred downgrading a lock.

Past Image Blocks (PI)

In theGRD section I mentioned Past Images (PIs), basically they are copies of data blocks in the local buffer cache of an instance. When an instance sends a block it has recently modified to another instance, it preserves a copy of that block, marking as a PI. The PI is kept until that block is written to disk by the current owner of the block. When the block is written to disk and is known to have a global role, indicating the presents of PIs in other instances buffer caches, GCS informs the instance holding the PIs to discard the PIs. When a checkpoint is required it informs GCS of the write requirement, GCS is responsible for finding the most current block image and informing the instance holding that image to perform a block write. GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block, allowing the global resource to be released. You can view the past image blocks present in the fixed table X$BH

PIs select state, count(state) from X$BH group by state;

Note: the state column with 8 is the past images.

Cache Fusion I

Cache Fusion I is also know as consistent read server and was introduced in Oracle 8.1.5, it keeps a list of recent transactions that have changed a block.the original data contained in the block is preserved in the undo segment, which can be used to provide consistent read versions of the block.

In a single instance the following happens when reading a block

When a reader reads a recently modified block, it might find an active transaction in the block

The reader will need to read the undo segment header to decide whether the transaction has been committed or not

If the transaction is not committed, the process creates a consistent read (CR) version of the block in the buffer cache using the data in the block and the data stored in the undo segment

If the undo segment shows the transaction is committed, the process has to revisit the block and clean out the block (delay block cleanout) and generate the redo for the changes.

In an RAC environment if the process of reading the block is on an instance other than the one that modified the block, the reader will have to read the following blocks from the disk

data block to get the data and/or transaction ID and Undo Byte Address (UBA)

undo segment header block to find the last undo block used for the entire transaction

undo data block to get the actual record to construct a CR image

Before these blocks can be read the instance modifying the block will have to write those's blocks to disk, resulting in 6 I/O operations. In RAC the instance can construct a CR copy by hopefully using the above blocks that are still in memory and then sending the CR over the interconnect thus reducing 6 I/O operations.

As from Oracle 8 introduced a new background process called the Block Server Process makes the CR fabrication at the holders cache and ships the CR version of the block across the interconnect, the sequence is detailed in the table below

An instance sends a message to the lock manager requesting a shared lock on the block

Following are the possibilities in the global cache

If there is no current user for the block, the lock manager grants the shared lock to the requesting instance

if the other instance has an exclusive lock on the block, the lock manager asks the owning instance to build a CR copy and ship it to the requesting instance.

Based on the result, either of the following can happen

if the lock is granted, the requesting instance reads the block from disk

The owning instance creates a CR version of the buffer in its own buffer cache and ships it to the requesting instance over the interconnect

The owning instance also informs the lock manager and requesting instance that it has shipped the block

The requesting instance has the locked granted, the lock manager updates the IDLM with the new holders of that resource

While making a CR copy, the holding instance may refuse to do so if

it does not find any of the blocks needed in its buffer cache, it will not perform a disk read to make a CR copy for another instance

It is repeatedly asked to send a CR copy of the same block, after sending the CR copies four times it will voluntarily relinquish the lock, write the block to the disk and let other instances get the block from the disk. The number of copies it will serve before doing so is governed by the parameter _fairness_threshold

Cache Fusion II

Read/Write contention was addressed in cache fusion I, cache fusion II addresses the write/write contention

An instance sends a message to the lock manager requesting an exclusive lock on the block

Following are the possibilities in the global cache

If there is no current user for the block, the lock manager grants the exclusive lock to the requesting instance

if the other instance has an exclusive lock on the block, the lock manager asks the owning instance to release the lock

Based on the result, either of the following can happen

if the lock is granted, the requesting instance reads the block from disk

The owning instance sends the current block to the requesting instance via the interconnect, to guarantee recovery in the event of instance death, the owning instance writes all the redo records generated for the block to the online redolog file. It will keep a past image of the block and inform the master instance that it has sent the current block to the requesting instance

The lock manager updates the resource directory (GRD) with the current holder of the block

Cache Fusion in Operation

A quick recap of GCS, a GCS resource can be local or global, if it is local it can be acted upon without consulting other instances, if it is global it cannot be acted upon without consulting or informing remote instances. GCS is used as a messaging agent to coordinate manipulation of a global resource. By default all resources are in NULL mode (remember null mode is used to convert from one type to another (share or exclusive)).

The table below denotes the different states of a resource

Mode/Role Local

Global

Null (N) NL

Shared (S) SL

Exclusive (X) XL

States

SL it can serve a copy of the block to other instances and it can read the block from disk, since the block is not modified there is no need to write to disk

XL it has sole ownership and interest in that resource, it has exclusive right to modify the block, all changes to the blocks are in the local buffer cache and it can write the block to the disk. If another instance wants the block it can to come via the GCS

NL used to protect consistent read block, if an instance wants it in X mode, the current instance will send the block to the requesting instance and downgrades its role to NL

SG a block is present in one or more instances, an instance can read the read from disk and serve it to other instances

XG a block can have one or more PIs, the instance with the XG role has the latest copy of the block and is the most likely candidate to write the block to the disk. GCS can ask the instance to write the block and serve it to other instances

NG after discarding PIs when instructed to by GCS, the block is kept in the buffer cache with NG role, this serves only as the CR copy of the block.

Below are a number of common scenarios to help understand the following

reading from disk

reading from cache

getting the block from cache for update

performing an update on a block

performing an update on the same block

reading a block that was globally dirty

performing a rollback on a previously updated block

reading the block after commit

We will assume the following

Four RAC environment (Instances A, B, C and D)

Instance D is the master of the lock resource for the data block BL

We will only use one block and it will reside at SCN 987654

We will use a three-letter code for the lock states first letter will indicate the lock mode - N = Null, S = Shared and X = Exclusive

second latter will indicate lock role - G = Global, L = Local

The third letter will indicate the PIs - 0 = no PIs, 1 = a PI of the bloc

for example a code of SL0 means a global shared lock with no past images (PIs)

Reading a block from disk

instance C want to read the block it will request a lock in share mode from the master instance

Instance C requests the block by sending a shared lock request to master D

The block has never been read into the buffer cache of any instance and it is not locked. Master D grants the lock to instance C. The lock granted is SL0 (see above to work out three-letter code)

Instance C reads the block from the shared disk into its buffer cache

Instance C has the block in shard mode, the lock manager updates the resource directory.

Reading a block from the cache

Carrying on from the above example, Instance B wants to read the same block that is cached in instance C buffer.

Instance B sends a shared lock request to master instance D

The lock master knows that the block may be available at instance C and sends a ping message to instance C

Instance C sends the block to instance B via theinterconnect, along with the block instance C indicates that instance B should take the current lock mode and role from instance C, instance C keeps a copy of the block

Instance B sends a message to instance D that it has assumed the SL lock for the block. This message is not critical for the lock manager, thus the message is sent asynchronously

Getting a (Cached) clean block for update

Carrying on from the above example, instance A wants to modify the same block that is already cached in instance B and C (block 987654)

Instance A sends an exclusive lock request to master D

The lock master knows that the block may be available at instance B in SCUR mode and at instance C in CR mode. it also sends a ping message to the shared lock holders. The most recent access was at instance B and instance D sends aBAST message to instance B

Instance B sends the block to instance A via the interconnect and closes it shared lock. The block may still be in its buffer to be as CR, but all locks are released

Instance A now has the exclusive lock on the block and sends an assume message to instance D, the lock is in XL0

Instance A modifies the block in its buffer cache, the changes are not committed and thus the block has not been written to disk, thus the SCN remains at 987654

Getting a (Cached) modified block for update and commit

Carrying on from the above example, instance C now wants to modify the block, if it tries to modify the same row it will have to wait until instance A either commits or rolls back. However in this case instance C wants to modify a different row in the same block.

Instance C sends an exclusive lock request to master D

The lock master knows that instance A holds an exclusive lock on the block and hence sends a ping message to instance A

Instance A sends the dirty buffer to instance C via the interconnect, it downgrades the lock from XCR to NULL, it keeps a PI version of the block and disowns any lock on that buffer. Before shipping the block, Instance A has to create aPI image and flush any pending redo for the block change, the block mode on instance A is now NG1

Instance C sends a message to instance D indicating it has the block in exclusive mode. The block role G indicates that the block is in global mode and if it needs to write the block to disk it must coordinate it with other instances that have past images (PIs) of that block. Instance C modifies the block and issues a commit, the SCN is now 987660.

Commit the previously modified block and select the data

Carrying on from the above example, instance A now issues a commit to release the row level locks held by the transaction and flush the redo information to the redologs

Instance A wants to commit the changes, commit operations do not require any synchronous modifications to the block

The lock status remains the same as the previous state and change vectors for the commits are written to the redologs.

Write the dirty buffers to disk due to a checkpoint

Carrying on from the above example, instance B writes the dirty blocks from the buffer cache due to a checkpoint (this is were it gets interesting and very clever)

Instance B sends a write request to master D with the necessary SCN

The master knows that the most recent copy of the block may be available at instance C and hence sends a message to instance C asking to write

Instance C initiates a disk write and writes a BWR into the redolog file

Instance C get the write notification that the write is complete

Instance C notifies the master that the write is completed

On receipt of the notification, instance D tells all PI holders to discard their PIs, and the lock at instance C writes the modified block to the disk

All instances that have previously modified this block will also have to write a BWR. The write request by instance C has now been satisfied and instance C can now proceed with its checkpoint as usual

Master instance crashes

Carrying on from the above example

the master instance D crashes

The Global Resource Directory is frozen momentarily and the resources held by master instance D will be equally distributed in the surviving nodes, also know as remastering (seeremastering for more details).

Select the rows from Instance A

Carrying on from the above example, now instance A queries the rows from that table to get the most recent data

Instance A sends a shared lock to now the new master instance C

Master C knows the most recent copy of the block may be in instance C and asks the holder to ship the CR block to instance A

Instance C ships the CR block to instance A via the interconnect

The above sequence of events can be seen in the table below

Example

Operation on Node

Buffer Status

read block from disk

SCUR

read the block from cache

SCUR

update the block

XCUR

update the same block

XCUR

commit the changes

XCUR

trigger checkpoint

XCUR

instance crash

select the rows

XCUR

Previous Menu Next

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。