[JF-gofer:10019] [Release] linux-3.6 gfs2-glocks.txt

2012年 10月 2日 (火) 07:52:54 JST

かねこです。題記をリリースします。
---------- >8 ---------- >8
TITL: gfs2-glocks
CONT: Linux global file system ファイルシステムの内部ロックの解説
NAME: filesystems/gfs2-glocks.txt
JDAT: 2012/10/02
BVER: 3.6
AUTH: unknown
TRNS: Seiji Kaneko < skaneko at a2 dot mbn dot or dot jp >
---------- >8 ---------- >8
=========================================================
これは、
Linux-3.6/Documentation/filesystems/gfs2-glocks.txt の和訳です。
翻訳団体： JF プロジェクト < http://linuxjf.sourceforge.jp/ >
更新日 ： 2012/10/02
翻訳者 ： Seiji Kaneko < skaneko at a2 dot mbn dot or dot jp >
=========================================================
#                   Glock internal locking rules
#                  ------------------------------
                  Glock 内部ロックルール
                  ----------------------

#This documents the basic principles of the glock state machine
#internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
#has two main (internal) locks:
この文書は、glock ステートマシンの内部処理の基本的方針を記載したものです。
各 glock (fs/gfs2/incore.h の gfs2_glock 構造体) は二つのメイン (内部) ロ
ックを持っています。

# 1. A spinlock (gl_spin) which protects the internal state such
#    as gl_state, gl_target and the list of holders (gl_holders)
# 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
#    threads from making calls to the DLM, etc. at the same time. If a
#    thread takes this lock, it must then call run_queue (usually via the
#    workqueue) when it releases it in order to ensure any pending tasks
#    are completed.
1. gl_spin スピンロック。このロックは gl_state、gl_target、ロック保持者の
  リスト gl_holders などの保護に用いられます。
2. GLF_LOCK ノンブロッキングビットロック。このロックは、DLM などの呼び出し
  が他のスレッドから同時に起きないように排他を行うために用います。スレッド
  がこのロックを取得しようとする場合、ロック開放時には引き続き run_queue を
   (通常は workqueue 経由で) 呼び出して、仕掛かり中のタスクの完了を確認しな
   ければなりません。

#The gl_holders list contains all the queued lock requests (not
#just the holders) associated with the glock. If there are any
#held locks, then they will be contiguous entries at the head
#of the list. Locks are granted in strictly the order that they
#are queued, except for those marked LM_FLAG_PRIORITY which are
#used only during recovery, and even then only for journal locks.
gl_holder リストには、単にロックの保持者だけではなく、glock に関連した、キ
ューに入った全てのロック要求が格納されます。保持されているロックがある場
合、リストの先頭から順に連続エントリとして格納されます。ロックは例外をのぞ
き、厳格にキューに入った順に与えられます。その例外とは、リカバリ時にのみ
用いる LM_FLAGS_PRIORITY フラグのついた要求で、さらに対象とするのはジャーナ
ルロックのみです。

#There are three lock states that users of the glock layer can request,
#namely shared (SH), deferred (DF) and exclusive (EX). Those translate
#to the following DLM lock modes:
glock レイヤの利用者が要求可能なロック状態は三つあります。共有 (SH:Shared)、
遅延ロック (DF:Deferred)、排他 (EX:Exclusive) です。
これらは DLM ロック状態に以下のように反映されます。

#Glock mode    | DLM lock mode
#------------------------------
#    UN        |    IV/NL  Unlocked (no DLM lock associated with glock) or NL
#    SH        |    PR     (Protected read)
#    DF        |    CW     (Concurrent write)
#    EX        |    EX     (Exclusive)
 Glock モード | DLM ロックモード
------------------------------
    UN        |    IV/NL  ロックなし (glock に関連した DLM ロックなし) または NL
    SH        |    PR     (保護された読み出し)
    DF        |    CW     (並行ライト)
    EX        |    EX     (排他)

#Thus DF is basically a shared mode which is incompatible with the "normal"
#shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
#operations. The glocks are basically a lock plus some routines which deal
#with cache management. The following rules apply for the cache:
このように、基本的には DF は「通常の」共有ロックである SH とは非互換の共有
モードという位置づけです。GFS2 では、DF モードは直接 I/O (Direct I/O) での
み用いられています。これらの glock は基本的には、ロックにキャッシュ管理を行う
処理を加えたものです。キャッシュには以下の規則が適用されます。

#Glock mode   |  Cache data | Cache Metadata | Dirty Data | Dirty Metadata
#--------------------------------------------------------------------------
#    UN       |     No      |       No       |     No     |      No
#    SH       |     Yes     |       Yes      |     No     |      No
#    DF       |     No      |       Yes      |     No     |      No
#    EX       |     Yes     |       Yes      |     Yes    |      Yes
 Glock モード | Cacheデータ | Cacheメタデータ| ダーティデータ | ダーティメタデータ
#---------------------------------------------------------------------------------
#    UN       |    いいえ   |     いいえ     |    いいえ      |      いいえ
#    SH       |    はい     |      はい      |    いいえ      |      いいえ
#    DF       |    いいえ   |      はい      |    いいえ      |      いいえ
#    EX       |    はい     |      はい      |    はい        |      はい

#These rules are implemented using the various glock operations which
#are defined for each type of glock. Not all types of glocks use
#all the modes. Only inode glocks use the DF mode for example.
これらの規則は、各 glock 種毎に定義されている様々な glock 操作を用いて実装
されています。様々なタイプの glock には、これらの全モードを持っていないも
のもあります。例えば inode glock のみが DF モードを用います。

#Table of glock operations and per type constants:
glock 操作と、タイプ毎の定数の一覧を示します。

#Field            | Purpose
#----------------------------------------------------------------------------
#go_xmote_th      | Called before remote state change (e.g. to sync dirty data)
#go_xmote_bh      | Called after remote state change (e.g. to refill cache)
#go_inval         | Called if remote state change requires invalidating the cache
#go_demote_ok     | Returns boolean value of whether its ok to demote a glock
#                 | (e.g. checks timeout, and that there is no cached data)
#go_lock          | Called for the first local holder of a lock
#go_unlock        | Called on the final local unlock of a lock
#go_dump          | Called to print content of object for debugfs file, or on
#                 | error to dump glock to the log.
#go_type          | The type of the glock, LM_TYPE_.....
#go_callback	 | Called if the DLM sends a callback to drop this lock
#go_flags	 | GLOF_ASPACE is set, if the glock has an address space
#                 | associated with it
  フィールド     | 目的
----------------------------------------------------------------------------
go_xmote_th      | リモートの状態変更の前に呼ばれます (例: ダーティデータの sync など)
go_xmote_bh      | リモートの状態変更の後に呼ばれます (例: キャッシュの再読み込みなど)
go_inval         | リモートの状態変化により、キャッシュを無効化する必要がある場合に呼
ばれます
go_demote_ok     | glock の開放が可能であるかどうかを論理値で返します
                 | (例: タイムアウトをチェックし、キャッシュされたデータがないことを確認)
go_lock          | ローカルでのロックの最初の (そのマシンで他に誰もロックを持っていない)
                 | 取得時に呼ばれます
go_unlock        | ローカルでのロックの最後の開放時に呼ばれます
go_dump          | debugfs ファイル向けのオブジェクトの内容の表示や、エラ
		 | ー時にログに glock 値をダンプする際に呼ばれます。
go_type          | glock の種別。LM_TYPE_.....
go_callback	 | DLM がこのロックを落とすためにコールバックを送った場合に呼ばれます。
go_flags	 | glock にアドレス空間が関連付けられていた場合に、GLOF_ASPACE がセット
		 | されます。

#The minimum hold time for each lock is the time after a remote lock
#grant for which we ignore remote demote requests. This is in order to
#prevent a situation where locks are being bounced around the cluster
#from node to node with none of the nodes making any progress. This
#tends to show up most with shared mmaped files which are being written
#to by multiple nodes. By delaying the demotion in response to a
#remote callback, that gives the userspace program time to make
#some progress before the pages are unmapped.
各ロックの最小ホールド時間とは、リモートのロックの取得許可後、リモートから
の開放要求を無視する時間です。これは、ロックがクラスタ内でノード間でのピン
ポンを繰り返し、操作が進まなくなる状況を避けるためのものです。このような競
合状況は、複数のノードから書き込まれる共有 mmap ファイルで起こりがちです。
リモートからのコールバックによるロックの解放応答を遅らせることで、ユーザ空
間のプログラムにページがアンマップされる前に操作を少しでも進ませることがで
きるよう余裕を与えます。

#There is a plan to try and remove the go_lock and go_unlock callbacks
#if possible, in order to try and speed up the fast path though the locking.
#Also, eventually we hope to make the glock "EX" mode locally shared
#such that any local locking will be done with the i_mutex as required
#rather than via the glock.
go_lock と go_unlock コールバックを、可能な限り削減しようという計画がありま
す。これはロック関連で高速パスの速度を改善することが目的です。また、将来は
glock "EX" モードをローカルで共有して、ローカルでのロックを glock 経由では
なく i_mutex で行おうという希望的計画もあります。

#Locking rules for glock operations:
glock 操作のロック保持ルール

#Operation     |  GLF_LOCK bit lock held |  gl_spin spinlock held
#-----------------------------------------------------------------
#go_xmote_th   |       Yes               |       No
#go_xmote_bh   |       Yes               |       No
#go_inval      |       Yes               |       No
#go_demote_ok  |       Sometimes         |       Yes
#go_lock       |       Yes               |       No
#go_unlock     |       Yes               |       No
#go_dump       |       Sometimes         |       Yes
#go_callback   |       Sometimes (N/A)   |       Yes
  操作        |  GLF_LOCK ビットの保持  |  gl_spin スピンロック保持
-----------------------------------------------------------------
go_xmote_th   |       必要              |       いいえ
go_xmote_bh   |       必要              |       いいえ
go_inval      |       必要              |       いいえ
go_demote_ok  |       場合による        |       必要
go_lock       |       必要              |       いいえ
go_unlock     |       必要              |       いいえ
go_dump       |       場合による        |       必要
go_callback   |    場合による (適用外)  |       必要

#N.B. Operations must not drop either the bit lock or the spinlock
#if its held on entry. go_dump and do_demote_ok must never block.
#Note that go_dump will only be called if the glock's state
#indicates that it is caching uptodate data.
注記:操作時には、開始時点で持っていたビットロックやスピンロックを途中で解
放してはいけません。go_dump や do_demote_ok はブロックしてはいけません。
go_dump は、glock の状態から最新のデータをキャッシュしていると分かっている
場合にのみ呼ばれることに注意してください。

#Glock locking order within GFS2:
GFS2 内の glock ロック順序を以下に示します。

# 1. i_mutex (if required)
# 2. Rename glock (for rename only)
# 3. Inode glock(s)
#    (Parents before children, inodes at "same level" with same parent in
#     lock number order)
# 4. Rgrp glock(s) (for (de)allocation operations)
# 5. Transaction glock (via gfs2_trans_begin) for non-read operations
# 6. Page lock  (always last, very important!)
 1. i_mutex (必要な場合)
 2. rename glock (リネームの場合のみ)
 3. Inode glock
     (親を子よりも先にロック、次に同じ親で同じ階層にある inode というロック順で)
 4. Rgrp glock (割り当てと開放処理で)
 5. 読み出しを行わない操作について、Transaction glock (gfs2_trans_begin 経由で)
 6. ページロック (常に最後で。ここ重要!)

#There are two glocks per inode. One deals with access to the inode
#itself (locking order as above), and the other, known as the iopen
#glock is used in conjunction with the i_nlink field in the inode to
#determine the lifetime of the inode in question. Locking of inodes
#is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
#In general we prefer to lock local locks prior to cluster locks.
inode 毎に二つの glock があります。一方は inode 自体へのアクセスを扱い (ロ
ック順は上記のとおり)、他方 (iopen glock) は、inode の i_nlink フィールド
と組み合わせて、対象となる inode の寿命を判断するのに使います。inode のロ
ックは、inode 毎に行います。rgrp へのロックも rgrp 毎に行います。
また、通常はクラスタロックよりローカルロックを優先して使用するようにします。

#                            Glock Statistics
#                           ------------------
			     Glock 統計情報
			   ------------------

#The stats are divided into two sets: those relating to the
#super block and those relating to an individual glock. The
#super block stats are done on a per cpu basis in order to
#try and reduce the overhead of gathering them. They are also
#further divided by glock type. All timings are in nanoseconds.
統計情報は、スーパブロック関連のものと個々の glock に関連するものの二種類
に分類できます。スーパブロック関連の統計情報取得は、収集のオーバヘッド軽減
のため CPU 毎に行われます。これらは更に glock タイプで分類されます。
時間単位は全てナノ秒です。

#In the case of both the super block and glock statistics,
#the same information is gathered in each case. The super
#block timing statistics are used to provide default values for
#the glock timing statistics, so that newly created glocks
#should have, as far as possible, a sensible starting point.
#The per-glock counters are initialised to zero when the
#glock is created. The per-glock statistics are lost when
#the glock is ejected from memory.
スーパブロックと glock の両方の統計情報が取得される場合には、各々で同じ情報
が収集されます。スーパブロックの時間に関する情報は、glock の時間に関する情
報の標準値を提供するのに用いられており、新しく作成された glock に対しても可
能な限り妥当な初期値が設定されるようになっています。glock 毎にあるカウンタ
は glock 作成時に 0 に初期化されます。glock 毎の統計情報は、glock がメモリ
から削除された際に消去されます。

#The statistics are divided into three pairs of mean and
#variance, plus two counters. The mean/variance pairs are
#smoothed exponential estimates and the algorithm used is
#one which will be very familiar to those used to calculation
#of round trip times in network code. See "TCP/IP Illustrated,
#Volume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement",
#p. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards.
#Unlike the TCP/IP Illustrated case, the mean and variance are
#not scaled, but are in units of integer nanoseconds.
統計情報は、平均値・分散の組 3 つと 2 つのカウンタからなります。平均値・分
散の組みは、幾何的に均す処理が行われています。使われているアルゴリズムは、
ネットワークコードでのラウンドトリップ時間の計算に慣れている人には馴染みの
あるものでしょう。W. Richard Stevens の "TCP/IP Illustrated, Volume 1" の
21.3 節 "Round-Trip Time Measurement" p.299 からの内容と、Volume 2 の
25.10 節の p.838 からの内容を参照ください。
TCP/IP Illusrated の場合とは異なり、平均値と分散はナノ秒単位の整数値で、
桁数調整はされません。

#The three pairs of mean/variance measure the following
#things:
平均値/分散の 3 つの組は以下の通りです。

# 1. DLM lock time (non-blocking requests)
# 2. DLM lock time (blocking requests)
# 3. Inter-request time (again to the DLM)
 1. DLM ロック時間 (ノンブロッキング要求)
 2. DLM ロック時間 (ブロッキング要求)
 3. リクエスト間の間隔 (これも DLM の)

#A non-blocking request is one which will complete right
#away, whatever the state of the DLM lock in question. That
#currently means any requests when (a) the current state of
#the lock is exclusive, i.e. a lock demotion (b) the requested
#state is either null or unlocked (again, a demotion) or (c) the
#"try lock" flag is set. A blocking request covers all the other
#lock requests.
ノンブロッキング要求は、問題としている DLM ロックの状態によらず、すぐに終
了するものです。現在このことは、(a) 現在のロックの状態が排他 (EX) である
(つまりロック解放要求である) か、(b) ロック開放要求で、対象ロックの状態が
null かアンロックであった (同じくロック開放) か、(c) "try lock" フラグが
セットされていた場合のいずれかであることを意味します。
ブロッキング要求は、それ以外の全てのロック要求を扱います。

#There are two counters. The first is there primarily to show
#how many lock requests have been made, and thus how much data
#has gone into the mean/variance calculations. The other counter
#is counting queuing of holders at the top layer of the glock
#code. Hopefully that number will be a lot larger than the number
#of dlm lock requests issued.
さらに 2 つのカウンタがあります。一つは主にロック要求回数を示すもので、つ
まり平均と分散を出すのにいくつのデータが使われたのかを示しています。もう
ひとつのカウンタは、glock コードの最上位レベルでロック保持者のキュー入力回数
を数えるものです。このカウンタの数値は dlm ロック要求発行回数よりずっと大
きいことが望ましいです。

#So why gather these statistics? There are several reasons
#we'd like to get a better idea of these timings:
では、なぜ統計数値を集めるのか。これらの時間情報を集めて知見を得ようとす
るのは、以下のいくつかの理由からです。

#1. To be able to better set the glock "min hold time"
#2. To spot performance issues more easily
#3. To improve the algorithm for selecting resource groups for
#allocation (to base it on lock wait time, rather than blindly
#using a "try lock")
1. glock の "min hold time (最小ホールド時間)" により適切な値を設定するため
2. 性能上の問題をより容易に摘出するため
3. 割り当て時のリソースグループ選択アルゴリズムを改善するため (何も考えずに
  "try lock (ロック試行)" するのではなく、選択にあたってロックウェイト時間
  を用いるようになっています)

#Due to the smoothing action of the updates, a step change in
#some input quantity being sampled will only fully be taken
#into account after 8 samples (or 4 for the variance) and this
#needs to be carefully considered when interpreting the
#results.
更新を急激には変化しないものとするため、一部の入力の階段状の変更は 8 サン
プル後になるまで (分散の場合は 4 サンプル後まで) 計算に入れられないため、
結果の解釈の際には注意深い判断が必要です。

#Knowing both the time it takes a lock request to complete and
#the average time between lock requests for a glock means we
#can compute the total percentage of the time for which the
#node is able to use a glock vs. time that the rest of the
#cluster has its share. That will be very useful when setting
#the lock min hold time.
ロック要求が完了するまでに要する時間、および glock ロック要求間の平均経過
時間の両方がわかっていれば、そのノードで glock を利用できる時間と残りのク
ラスタの取り分との割合を計算できます。これは、ロックの最小保持時間を設定
するためにとても有用な値です。

#Great care has been taken to ensure that we
#measure exactly the quantities that we want, as accurately
#as possible. There are always inaccuracies in any
#measuring system, but I hope this is as accurate as we
#can reasonably make it.
実際に必要な値が可能な限り正確に取得できていることを確認するためには、十
分に注意を払う必要があります。どんな測定システムにも誤差はつきものですが、
妥当な範囲で正確なものになっていることと期待しています。

#Per sb stats can be found here:
sb 毎の統計情報は以下から取得できます。
/sys/kernel/debug/gfs2/<fsname>/sbstats
#Per glock stats can be found here:
glock 毎の統計情報は以下から取得できます。
/sys/kernel/debug/gfs2/<fsname>/glstats

#Assuming that debugfs is mounted on /sys/kernel/debug and also
#that <fsname> is replaced with the name of the gfs2 filesystem
#in question.
debugfs が /sys/kernel/debug にマウントされているならば、問題となる
gfs2 ファイルシステムの名前を <fsname> に置き換えた擬似ファイルからも情報
を取得できます。

#The abbreviations used in the output as are follows:
出力に用いられる略号は以下の通りです。

#srtt     - Smoothed round trip time for non-blocking dlm requests
#srttvar  - Variance estimate for srtt
#srttb    - Smoothed round trip time for (potentially) blocking dlm requests
#srttvarb - Variance estimate for srttb
#sirt     - Smoothed inter-request time (for dlm requests)
#sirtvar  - Variance estimate for sirt
#dlm      - Number of dlm requests made (dcnt in glstats file)
#queue    - Number of glock requests queued (qcnt in glstats file)
srtt	 - ノンブロッキング dlm 要求の取得時間の移動平均
srttvar  - srtt の分散の推定値
srttb	 - ブロッキング (の可能性のある) dlm 要求の取得時間の移動平均
srttvarb - srttb の分散の推定値
sirt	 - dlm リクエストのリクエスト間の間隔の移動平均
sirtvar	 - sirtb の分散の推定値
dlm	 	 - dlm リクエストの要求回数 (glstats ファイルの dcnt 値)
queue	 - キューイングされた glock 要求の回数 (glstats ファイルの qcnt 値)

#The sbstats file contains a set of these stats for each glock type (so 8 lines
#for each type) and for each cpu (one column per cpu). The glstats file contains
#a set of these stats for each glock in a similar format to the glocks file, but
#using the format mean/variance for each of the timing stats.
sbstats ファイルには、各 glock タイプ毎にこれらの統計情報の組が格納 (し
たがってタイプごとに 8 行) され、その組みが更に CPU 毎に格納されています。
glstats ファイルには、タイミング情報としては平均値と分散を用いている点を
除けば各 glocks ファイルと同様のフォーマットでこれらの統計情報の組が格納
されています。

#The gfs2_glock_lock_time tracepoint prints out the current values of the stats
#for the glock in question, along with some addition information on each dlm
#reply that is received:
gfs2_glock_lock_time トレースポイントは、対象となる glock の統計情報の現
在の値を出力し、さらに受け取った各 dlm の応答に関する追加情報も出力され
ます。

#status - The status of the dlm request
#flags  - The dlm request flags
#tdiff  - The time taken by this specific request
#(remaining fields as per above list)
status - dlm 要求の状態
flags  - dlm 要求フラグ
tdiff  - 特定のリクエストの処理に要した時間
 (残りのフィールドは上記リストに従う)

---------- >8 ---------- >8

-- 
Seiji Kaneko