最優原地後綴排序算法

本章介紹線性時間複雜度的後綴排序的就地算法¹（Optimal In-Place Suffix Sorting）。

Warning

本章 只建議 在 非常非常熟悉 SA-IS²³的前提下閲讀。

全局設定

目標字符串 \(\texttt{Pat}\)，後綴數組 \(\texttt{SA}\)，串的序號從 0 開始，結尾字符是警戒哨，不妨設為 0。

在整形字母表上的後綴排序

事實上這一部分可以看成是原地版本的 SA-IS 算法。

因為是原文中細節相對最清楚，實現也較為簡單的算法，也是瞭解後續算法的基礎，是本文介紹的重點。

原地化的原理是用重命名的 \(\texttt{Pat}\) 代替 S、L 桶，用額外 \(O(n)\) 的操作代替類型桶。

重命名目標串 Pat

簡單來説，我們會在不改變後綴大小的相對順序的前提下，重命名 \(\texttt{Pat}\)，用重命名後的 \(\texttt{Pat}\) 來取代原來 S、L 桶，來指明桶頭或者桶尾。

重命名的方法是將 \(\texttt{Pat}\) 中的 S 型字符替換為所在桶的桶尾索引，L 型字符替換為所在桶的桶頭索引。

如下圖所示：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat}:\qquad&\texttt{ 2 1 1 3 3 1 1 3 3 1 2 1 0} \\ \texttt{Type}:\qquad&\texttt{ L S S L L S S L L S L L S} \\ \texttt{Bucket}:\qquad&\texttt{(0)}\texttt{ }\texttt{(1}\texttt{ }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 1 }\texttt{ } \texttt{ 1) }\texttt{(2 }\texttt{ }\texttt{ 2) }\texttt{(3}\texttt{ }\texttt{ }\texttt{ 3 }\texttt{ }\texttt{ 3 }\texttt{ }\texttt{ 3)} \end{aligned} \]

重命名後的 \(\texttt{Pat'}\)（之後直接將重命名後的 \(\texttt{Pat'}\) 稱做 \(\texttt{Pat}\)）：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat'}:\qquad&\texttt{ 7 6 6 9 9 6 6 9 9 6 7 1 0} \end{aligned} \]

由於桶內的字符，L 型字符後綴小，作為桶頭；而 S 型字符後綴大，作為桶尾，因此保持了後綴大小的相對順序。

描述一下重命名的具體步驟：

和 SA-IS 一樣，對 \(\texttt{Pat}\) 中每個字符計數，計算其前綴和（計數排序），來構建 S/L 桶，只不過這裏用 \(\texttt{SA}\) 盛放這個前綴和；
從尾到頭，掃描 \(\texttt{Pat}\) 的每個字符，這樣只需記錄上一個字符的類型，就可以動態地判斷每個字符的類型，然後依據前綴和將其重命名。

對 LMS 字符排序

這裏重點是使用了一個內部計數器的技巧。

初始化

初始的時候將 \(\texttt{SA}\) 每一項設為 E（EMPTY）。

從尾到頭掃描 \(\texttt{Pat}\)，如果發現是 LMS 字符，\(\texttt{Pat[i]}\)，那麼就設置 \(\texttt{SA[Pat[i]]}\) 的標記：

如果 \(\texttt{SA[Pat[i]]}\) 是 E，就將其設為 U（UNIQUE）；

如果 \(\texttt{SA[Pat[i]]}\) 是 U，就將其設為 M（MULTIPLE）；

其他情況，不做處理。

結果如下圖所示：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat}:\qquad&\texttt{ 7 6 6 9 9 6 6 9 9 6 7 1 0} \\ \texttt{LMS}:\qquad&\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ ∗ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ * }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ * }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ }\texttt{ * } \\ \texttt{SA}:\qquad&\texttt{(}\underline{\color{red}{\texttt{U}}}\texttt{) }\texttt{(E)}\texttt{ }\texttt{(E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{M}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \end{aligned} \]

把 LMS 字符的索引放入 SA

從尾到頭掃描 \(\texttt{Pat}\)，對於 LMS 字符 \(\texttt{Pat[i]}\)，根據 \(\texttt{SA[Pat[i]]}\) 的符號進行分類討論：

U：直接讓 \(\texttt{SA[Pat[i]] = i}\)

M：意味着桶中有至少兩個 LMS 字符。

如果桶中有至少三個 LMS 字符：就把桶中倒數第二個位置作為臨時計數器，標誌桶中已填充的 LMS 字符數（桶中倒數第一位就是標誌 M）將新的 LMS 字符從倒數第三個位置開始插入，讓臨時計數器自增 1。如果發現桶已經滿了，就把桶中從桶頭到倒數第三個的所有元素向右平移 2 個位置，然後把新元素插入到桶中第二個位置（桶中第一個位置填為 E）
如果桶中有且只有 2 個 LMS 字符，顯然不需要計數器，直接從右到左順序插入即可。

正常的值：

根據我們之前的討論，此時不管桶中有兩個還是兩個以上的 LMS 字符，這都意味着 $\texttt{i}$ 是桶中最後一個待插入的 LMS 字符的位置，

只需要從桶頭開始向左掃描，找到第一個標記為 E 的位置，將其設為 $\texttt{i}$。

最後要從尾到頭掃描一遍 \(\texttt{SA}\)，清除可能殘餘的特殊符號 M（桶中未被填滿，所以 M 和計數器未被覆蓋）。

方法是將桶中 LMS 字符如上述步驟一樣向右平移 2 位，將左邊空出來的位置填為 E。

如下圖所示：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat}:\qquad&\texttt{ }\texttt{ 7 6 6 9 9 6 6 9 9 6 7 1 0} \\ \texttt{SA}:\qquad&\texttt{(}\underline{\color{red}{\texttt{12}}}\texttt{)}\texttt{ (E)}\texttt{ (E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ M}\texttt{) }\texttt{(E }\texttt{ } \texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)}\\ \texttt{SA}:\qquad&\texttt{(12) }\texttt{(E)}\texttt{ (E }\texttt{ }\texttt{ E }\texttt{ } \texttt{ }\underline{\color{red}{\texttt{9}}}\texttt{ }\texttt{ }{\color{red}{\texttt{ 1 }}}\texttt{ }\texttt{ } {\color{red}{\texttt{M}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E } \texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \\ \texttt{SA}:\qquad&\texttt{(12) }\texttt{(E)}\texttt{ (E }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{5}}}} \texttt{ }\texttt{ }\texttt{ }{\texttt{9}}\texttt{ }\texttt{ }{\color{red}{\texttt{ 2 }}}\texttt{ }\texttt{ } {\color{red}{\texttt{M}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E } \texttt{ }\texttt{ E }\texttt{ }\texttt{ E)}\\ \texttt{SA}:\qquad&\texttt{(12) }\texttt{(E)}\texttt{ (}\underline{\color{red}{\texttt{1}}}\texttt{ }\texttt{ }{\texttt{ 5}}\texttt{ }\texttt{ }\texttt{ }{\texttt{9}}\texttt{ }\texttt{ }{\color{red}{\texttt{ 3 }}}\texttt{ }\texttt{ }{\color{red}{\texttt{M}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)}\\ \texttt{SA}:\qquad&\texttt{(12) }\texttt{(E)}\texttt{ (}\texttt{E }\texttt{ }\texttt{ E }\texttt{ }{\color{red}{\texttt{ 1 }}\texttt{ }{\texttt{ 5 }}\texttt{ }\texttt{ }{\texttt{9}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \end{aligned} \]

這個階段，由於每個桶只需要被移動和掃描一次，所以時間複雜度是 \(O(n)\)。

誘導排序 LMS 子串

誘導排序 LMS 前綴

將 LMS 前綴進行誘導排序，同 SA-IS 一樣，這部分同後面對後綴的誘導排序完全一樣（使用同一個函數），因此這裏直接跳過。

這裏直接給出排序結果：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{(}\texttt{12}\texttt{)}\texttt{(11)}\texttt{ (1 }\texttt{ }\texttt{ 5 }\texttt{ }\texttt{ 9 }\texttt{ }\texttt{ 2 }\texttt{ }\texttt{ 6}\texttt{) }\texttt{(10 }\texttt{ }\texttt{ 0) }\texttt{(4}\texttt{ }\texttt{ }\texttt{ 8 }\texttt{ }\texttt{ 3 }\texttt{ }\texttt{ 7)} \end{aligned} \]

將已排序的 LMS 子串放到 SA 尾部

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{ }\texttt{E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }{\underline{\color{red}{\texttt{12}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{5}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{9}}}} \end{aligned} \]

構建規模縮減的子目標串 Pat1

從左到右掃描 \(\texttt{SA}\) 尾部的 LMS 子串，確定其大小關係「重命名」，將 \(\texttt{SA[i]}\) 重命名的值存儲在 \(\texttt{SA}\left[\left\lfloor\frac{\texttt{SA}[i]}{2} \right\rfloor\right]\)。

因為 LMS 字符並不相鄰，所以不會有衝突，這樣做是將重命名後的值按照所代表的子串在 \(\texttt{Pat}\) 中的原順序放置：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{ }\underline{\color{red}{\texttt{1}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{1}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{2}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{0}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ 12 }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 5 }\texttt{ }\texttt{ 9 } \end{aligned} \]

然後掃描 \(\texttt{SA}\)，收集這些重命名的值到 \(\texttt{SA}\) 頭部：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{ }\underline{\color{red}{\texttt{1}}}\texttt{ }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{1}}}\texttt{ }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{2}}}\texttt{ }\texttt{ }\texttt{ }\underline{\color{red}{\texttt{0}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ 12 }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 5 }\texttt{ }\texttt{ 9 } \end{aligned} \]

通過遞歸解決 Pat1，完成對 LMS 後綴的排序

同 SA-IS 一樣，遞歸解決 \(\texttt{SA}\) 頭部的規模縮減的 \(\texttt{Pat1}\) 的後綴排序，結果存到 \(\texttt{SA}\) 尾部：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{ }\texttt{1 }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 2 }\texttt{ }\texttt{ 0 }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{3}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{0}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{2}}}} \end{aligned} \]

將 \(\texttt{SA}\) 尾部的 \(\texttt{SA1}\) 挪到 \(\texttt{SA}\) 頭部，重新從尾到頭掃描 \(\texttt{Pat}\)，將其中 LMS 字符按照在 \(\texttt{Pat}\) 中的順序放到 \(\texttt{SA}\) 尾部：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{ }{\underline{\color{red}{\texttt{3}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{0}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{2}}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{5}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{9}}}}\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{12}}}} \end{aligned} \]

依照 \(\texttt{SA}\) 尾部的「對照表」，將 \(\texttt{SA1}\) 頭部的 \(\texttt{SA}\) 還原為 \(\texttt{Pat}\) 中對應的 LMS 後綴的索引位置：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&{\underline{\color{red}{\texttt{12}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{5}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{9}}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 5 }\texttt{ }\texttt{ 9 }\texttt{ 12 } \end{aligned} \]

將 \(\texttt{SA}\) 頭部的排好序的 LMS 後綴按順序放入到對應的桶中（從尾部開始放）：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{(}{\underline{\color{red}{\texttt{12}}}}\texttt{)}\texttt{ }\texttt{(E)}\texttt{ (}\texttt{E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{5}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{9}}}}\texttt{)}\texttt{ }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \end{aligned} \]

對 Pat1 中所有的後綴進行誘導排序

這一部分就是利用前面用過的內部計數器技巧，進行原地版的誘導排序。

假如我們已經有排好序的 LMS 後綴（在桶尾），來誘導 L 型後綴⁴：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat}:\qquad&\texttt{ }\texttt{ 7 6 6 9 9 6 6 9 9 6 7 1 0} \\ \texttt{SA}:\qquad&\texttt{(12) }\texttt{(E)}\texttt{ (}\texttt{E }\texttt{ }\texttt{ E }\texttt{ }{\texttt{ 1 }\texttt{ }{\texttt{ 5 }}\texttt{ }\texttt{ }{\texttt{9}}}\texttt{) }\texttt{(E }\texttt{ }\texttt{ E) }\texttt{(E}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \end{aligned} \]

如同排序 LMS 字符一樣，先對 L 型字符用特殊符號計數：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{Pat}:\qquad&\texttt{ }\texttt{ 7 6 6 9 9 6 6 9 9 6 7 1 0} \\ \texttt{SA}:\qquad&\texttt{(}{{\texttt{12}}}\texttt{)}\texttt{ }\texttt{(}{\underline{\color{red}{\texttt{U}}}}\texttt{)}\texttt{ }\texttt{(E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ 1 }\texttt{ }\texttt{ 5 }\texttt{ }{\texttt{ 9}}\texttt{) }\texttt{(}{\underline{\color{red}{\texttt{M}}}}\texttt{ }\texttt{ }\texttt{ E) }\texttt{(}{\underline{\color{red}{\texttt{M}}}}\texttt{ }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E }\texttt{ }\texttt{ E)} \end{aligned} \]

從左到右掃描 SA，同對 LMS 字符排序一樣，複雜一點的是判斷 \(\texttt{suf[SA[i] - 1]}\) 的類型，需要分類討論（詳情參考代碼）：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{(}{\overrightarrow{\color{red}{\texttt{12}}}\texttt{)}\texttt{(}{\underline{\color{red}{\texttt{11}}}}}\texttt{)}\texttt{ (E E 1 5 9) (M E) (M E E E)}\\ \texttt{SA}:\qquad&\texttt{(}\texttt{12}\texttt{)}\texttt{(}{\overrightarrow{\color{red}{\texttt{11}}}}\texttt{)}\texttt{ (E E 1 5 9)}\texttt{(}{\underline{\color{red}{\texttt{10}}}}\texttt{ }\texttt{ }\texttt{ E)}\texttt{ (M E E E)}\\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E }\texttt{ }\texttt{ } {\overrightarrow{\color{red}{\texttt{1}}}}\texttt{ }\texttt{ 5 9)}\texttt{(10 }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{0}}}}\texttt{)}\texttt{ (M E E E)}\\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E 1 }\texttt{ }\texttt{ } {\overrightarrow{\color{red}{\texttt{5}}}}\texttt{ }\texttt{ 9)}\texttt{(10 0)}\texttt{ (}{\color{red}{\texttt{M 1}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{4}}}}\texttt{ }\texttt{ }\texttt{ E)}\\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E 1 5}\texttt{ }\texttt{ } {\overrightarrow{\color{red}{\texttt{9}}}}\texttt{)}\texttt{(10 0)}\texttt{ (}{\color{red}{\texttt{M 2}}}\texttt{ }\texttt{ }\texttt{ 4 }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{8}}}}\texttt{)}\\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E 1 5 9)(10 0)}\texttt{ (}{\overrightarrow{\color{red}{\texttt{4}}}}\texttt{ }\texttt{ 8 }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{3}}}}\texttt{ }\texttt{ }\texttt{ E}\texttt{)}\\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E 1 5 9)(10 0)}\texttt{ (4 }\texttt{ }\texttt{ }{\overrightarrow{\color{red}{\texttt{8}}}}\texttt{ }\texttt{ 3 }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{7}}}}\texttt{)} \end{aligned} \]

區別於 SA-IS 的是，對一個類型字符誘導排序後，需要清理 LMS 字符以免對後面的原地誘導排序：

\[ \begin{aligned} \texttt{Index}:\qquad&\texttt{ }\texttt{ 0 1 2 3 4 5 6 7 8 9 10 11 12} \\ \texttt{SA}:\qquad&\texttt{(12)(11)}\texttt{ (E E }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{E}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{E}}}}\texttt{ }\texttt{ }\texttt{ }{\underline{\color{red}{\texttt{E}}}}\texttt{)}\texttt{(10 0)}\texttt{ (4 8 3 7)} \end{aligned} \]

至於從 L 後綴誘導 S 後綴與從 LMS 後綴誘導 L 後綴完全對稱，這裏就不做多餘介紹。

到這兒為止，誘導排序就完成了。

實現

時間性能上和 SA-IS 沒有顯著差別，空間佔用變為不到原來的 \(\dfrac{1}{3}\)（代碼量多 1 倍），算是不愧為原文 Optimal In-Place Suffix Sorting¹的標題。

參考代碼

use std::cmp::max;
use std::cmp::Ordering;
use std::slice::from_raw_parts_mut;


const LTYPE: bool = false;
const STYPE: bool = true;
const MAX_SA_VALUE: usize = usize::MAX / 2;
const EMPTY: usize = MAX_SA_VALUE + 1;
const UNIQUE: usize = MAX_SA_VALUE + 2;
const MULTI: usize = MAX_SA_VALUE + 3;  // >= 258


fn lms_str_cmp<E: Ord>(l1: &[E], l2: &[E]) -> Ordering {
    for (x, y) in l1.iter().zip(l2.iter()) {
        let cmp_res = x.cmp(&y);

        if cmp_res != Ordering::Equal { return cmp_res; }
    }

    Ordering::Equal
}

#[inline]
fn pat_char_type(cur: usize, prev: usize, last_scanned_type: bool) -> bool {
    if cur < prev || cur == prev && last_scanned_type == STYPE { STYPE }
    else { LTYPE }
}


fn rename_pat(pat: &mut [usize], sa: &mut [usize]) {
    let patlastpos = pat.len() - 1;
    // 全部刷成bucket head
    //sa.fill(0);
    for i in 0..sa.len() { sa[i] = 0 }

    for i in 0..pat.len() { sa[pat[i]] += 1 }
    for i in 1..sa.len() { sa[i] += sa[i - 1] }

    for i in 0..pat.len() - 1 {
        pat[i] = sa[pat[i]] - 1;
    };
    // 將L-suffix刷成bucket head
    //sa.fill(0);
    for i in 0..sa.len() { sa[i] = 0 }

    for i in 0..pat.len() { sa[pat[i]] += 1 }
    let mut last_scanned_type = STYPE;
    pat[patlastpos] = 0;
    for i in (0..pat.len() - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == STYPE {
            last_scanned_type = STYPE;
        } else {
            pat[i] -= sa[pat[i]] - 1;
            last_scanned_type = LTYPE;
        }
    }

}


fn sort_lms_char(pat: &mut [usize], sa: &mut [usize]) -> usize {
    //sa.fill(EMPTY);
    for i in 0..sa.len() { sa[i] = EMPTY }

    let mut last_scanned_type = STYPE;
    for i in (0..pat.len() - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == STYPE {
            last_scanned_type = STYPE;
        } else {
            if last_scanned_type == STYPE {  // pat[i + 1] is LMS type
                sa[pat[i + 1]] += 1;
            }

            last_scanned_type = LTYPE;
        }
    }

    let mut lms_cnt = 0;
    last_scanned_type = STYPE;
    for i in (0..pat.len() - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == STYPE {
            last_scanned_type = STYPE;
        } else {
            let e_i = i + 1;
            let e = pat[e_i];

            if last_scanned_type == STYPE {  // pat[i + 1] is LMS type
                lms_cnt += 1;
                if sa[e] == UNIQUE {
                    sa[e] = e_i;
                } else if sa[e] >= MULTI && sa[e - 1] == EMPTY {
                    if sa[e - 2] == EMPTY {
                        sa[e - 2] = e_i;
                        sa[e - 1] = 1;  // set counter
                    } else {  // MUL = 2
                        sa[e] = e_i;
                        sa[e - 1] = EMPTY;
                    }
                } else if sa[e] >= MULTI && sa[e - 1] != EMPTY {
                    let c = sa[e - 1];  // get counter

                    if sa[e - 2 - c] == EMPTY {
                        sa[e - 2 - c] = e_i;
                        sa[e - 1] += 1;  // update counter
                    } else {
                        for j in (1..c + 1).rev() {
                            sa[e - c + j] = sa[e - 2 - c + j]
                        }
                        sa[e - c] = e_i;
                        sa[e - c - 1] = EMPTY;
                    }
                } else if sa[e] < EMPTY {
                    for j in (0..e).rev() {
                        if sa[j] == EMPTY {
                            sa[j] = e_i;
                            break;
                        }
                    }
                }
            }

            last_scanned_type = LTYPE;
        }
    }

    for i in (0..pat.len()).rev() {
        if sa[i] >= MULTI {
            let c = sa[i - 1];
            for j in (1..c + 1).rev() {  // 逆序防止前面的覆蓋後面的
                sa[i - c + j] = sa[i - 2 - c + j];
            }
            sa[i - c - 1] = EMPTY;
            sa[i - c] = EMPTY;
        }
    }

    lms_cnt
}


fn sort_lms_substr(pat: &mut [usize], sa: &mut [usize]) {
    // step 1
    induced_sort(pat, sa);

    // step 2
    let pat_last_pos = pat.len() - 1;
    let mut lms_cnt = 0;
    let mut i = pat_last_pos;
    let mut bucket_tail_ptr = pat_last_pos + 1;  // for renamed bucket ver
    let mut bucket = EMPTY;  // 可以省略，但是為了書寫代碼方便
    let mut num = 0;  // S type number of bucket
    while i > 0 {
        if pat[sa[i]] != bucket {  // reach new bucket
            num = 0;

            let mut l = 0;
            while pat[sa[i - l]] == pat[sa[i]] {  // 掃描桶來計算桶中S字符數量，根據定義 當l=i時循環必然終止
                let pat_i = sa[i - l];             // l < i, 即 i - l > 0, 0 <= pat_i < patlen - 1
                if pat[pat_i] < pat[pat_i + 1] {
                    let mut k = pat_i;
                    while k > 0 && pat[k - 1] == pat[pat_i] { k -= 1 }
                    num += pat_i - k + 1;
                } else {
                    break;   // bucket不含S字符，結束掃描
                }

                l += 1;
            }

            bucket_tail_ptr = i;
            bucket = pat[sa[bucket_tail_ptr]];
        }

        if num > 0
        && i > bucket_tail_ptr - num
        && sa[i] > 0
        && pat[sa[i]] < pat[sa[i] - 1]  {
            sa[pat_last_pos - lms_cnt] = sa[i];
            lms_cnt += 1;
        }

        i -= 1;
    }

    sa[pat_last_pos - lms_cnt ] = sa[i];  // i = 0
    lms_cnt += 1;
    //sa[0..pat_last_pos - lms_cnt + 1].fill(EMPTY);
    for i in 0..pat_last_pos - lms_cnt + 1 { sa[i] = EMPTY }
}


fn construct_pat1(pat: &mut [usize], sa: &mut [usize], lms_cnt: usize) -> bool {
    let patlen = pat.len();

    let mut prev_lms_str_len = 1;
    let mut rank = 0;
    sa[(patlen - 1) / 2] = rank;
    let mut has_duplicated_char = false;
    for i in patlen - lms_cnt + 1..patlen {  // 從警戒哨字符的下一個字符開始
        let mut j = sa[i];
        while pat[j] <= pat[j + 1] { j += 1 } // 尋找suf(sa[i])右邊第一個L字符，因為排除了警戒哨這個LMS後綴，所以必然不會越界
        let mut k = j;
        while k + 1 < patlen && pat[k] >= pat[k + 1] { k += 1 }  // 找到suf(sa[i])右邊第一個LMS字符
        let cur_lms_str_len = k + 1 - sa[i];
        let cmp_res = lms_str_cmp(&pat[sa[i]..sa[i] + cur_lms_str_len], &pat[sa[i - 1]..sa[i - 1] + prev_lms_str_len]);

        if  cmp_res != Ordering::Equal {
            rank += 1
        }

        if rank == sa[sa[i - 1] / 2] {
            has_duplicated_char = true;
        }
        let rank_index = sa[i] / 2;
        sa[rank_index] = rank;  // 整除

        prev_lms_str_len = cur_lms_str_len;
    }

    // move to head of sa
    let mut j = 0;
    for i in 0..patlen - lms_cnt {
        if sa[i] != EMPTY {
            sa[j] = sa[i];
            if i > j {
                sa[i] = EMPTY;
            }
            j += 1;
        }
    }
    //sa[lms_cnt..patlen].fill(EMPTY);
    for i in lms_cnt..patlen { sa[i] = EMPTY }

    has_duplicated_char
}

fn sort_lms_suf(pat: &mut [usize], sa: &mut [usize], lms_cnt: usize, has_duplicated_char: bool) {
    // solve T1 recursively
    let patlen = pat.len();
    let salen = sa.len();
    unsafe {
        let sa_ptr = sa.as_mut_ptr();
        let mut pat1 = from_raw_parts_mut(sa_ptr, lms_cnt);
        let mut sa1 = from_raw_parts_mut(sa_ptr.offset((patlen - lms_cnt) as isize), salen - (patlen - lms_cnt));

        if has_duplicated_char {
            _compute_suffix_array_16_1(&mut pat1, &mut sa1);
        } else {
            for i in 0..lms_cnt { sa1[pat1[i]] = i }
        }
    }

    // move SA1 to SA[0...n1-1]
    for i in 0..lms_cnt {
        sa[i] = sa[patlen- lms_cnt + i];
    }

    // put all LMS-suffixes in SA tail
    let mut last_scanned_type = STYPE;
    let mut j = 0;
    for i in (0..pat.len() - 1).rev() {
        if pat[i] < pat[i + 1] || pat[i] == pat[i + 1] && last_scanned_type == STYPE {
            last_scanned_type = STYPE;
        } else {
            if last_scanned_type == STYPE {
                sa[patlen - 1 - j] = i + 1;
                j += 1;
            }

            last_scanned_type = LTYPE;
        }
    }

    // backward map the LMS-suffixes rank
    for i in 0..lms_cnt {
        let relative_rank = sa[i];
        sa[i] = sa[patlen - lms_cnt + relative_rank];
        sa[patlen - lms_cnt + relative_rank] = EMPTY;
    }

    let mut tail = EMPTY;
    let mut rfp = EMPTY;
    for i in (1..lms_cnt).rev() { // sa[0] 保持原位
        if pat[sa[i]] != tail {
            tail = pat[sa[i]];
            rfp = tail;
        }

        sa[rfp] = sa[i];
        if rfp != i { sa[i] = EMPTY }
        rfp -= 1;
    }
}

// PASS!
fn induced_sort(pat: &mut [usize], sa: &mut [usize]) {
    let patlen = pat.len();

    // place L-suff in SA
    // init
    let mut last_scanned_type = STYPE;
    for i in (0..patlen - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == LTYPE {
            sa[pat[i]] += 1;  // >= EMPTY
            last_scanned_type = LTYPE;
        } else {
            last_scanned_type = STYPE;
        }
    }
    //place
    let mut i = 0;
    while i < patlen {
        if sa[i] < EMPTY && sa[i] > 0 {
            let j = sa[i] - 1;
            let mut is_ltype = false;
            if pat[j] > pat[j + 1] {
                is_ltype = true;
            } else if pat[j] == pat[j + 1] {  // 判斷sa[i]是否是L後綴的編號
                let next_i = sa[pat[sa[i]]];
                if next_i >= MULTI {
                    is_ltype = true;
                } else if next_i < EMPTY && pat[sa[i]] + 1 < patlen {
                    if sa[pat[sa[i]] + 1] == EMPTY {
                        is_ltype = true;
                    } else if sa[pat[sa[i]] + 1] < EMPTY {
                        if pat[sa[pat[sa[i]] + 1]] == pat[sa[i]] {
                            is_ltype = true;
                        }
                    }
                }
            }

            if is_ltype {
                if sa[pat[j]] == UNIQUE {
                    sa[pat[j]] = j;
                } else if sa[pat[j]] >= MULTI && sa[pat[j] + 1] == EMPTY {
                    if sa[pat[j]] - EMPTY > 2 {
                        sa[pat[j] + 2] = j;
                        sa[pat[j] + 1] = 1;  // set counter
                    } else {
                        sa[pat[j]] = j;
                    }
                } else if sa[pat[j]] >= MULTI && sa[pat[j] + 1] != EMPTY {
                    let e = pat[j];
                    let c = sa[e + 1];
                    let lfp = e + c + 2;
                    if  c + 2 < sa[pat[j]] - EMPTY {  // 沒到bucket尾部
                        sa[lfp] = j;
                        sa[e + 1] += 1;  // update counter
                    } else {
                        for k in 1..c + 1 {
                            sa[e + k - 1] = sa[e + k + 1];
                        }
                        sa[e + c] = j;
                        sa[e + c + 1] = EMPTY;
                        if i >= e + 2 && i <= e + c + 1 {
                            i -= 2;
                        }
                    }
                } else if sa[pat[j]] < EMPTY {
                    for k in pat[j]..patlen {
                        if sa[k] == EMPTY {
                            sa[k] = j;
                            break;
                        }
                    }
                }
            }
        } else if sa[i] >= MULTI {
            i += 1;
        }

        i += 1;
    }

    // remove LMS-suff form SA, 一個桶裏可能有多個LMS後綴
    last_scanned_type = STYPE;
    for i in (0..pat.len() - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == STYPE {
            last_scanned_type = STYPE;
        } else {
            if last_scanned_type == STYPE {  // pat[i + 1] is LMS type
                if sa[pat[i + 1]] <= EMPTY {
                    sa[pat[i + 1]] = UNIQUE;
                } else {
                    sa[pat[i + 1]] += 1;
                }
            }

            last_scanned_type = LTYPE;
        }
    }
    i = patlen - 1;
    while i > 0 {
        if sa[i] > EMPTY {
            let c = sa[i] - EMPTY;
            for k in 0..c {
                sa[i - k] = EMPTY;
            }
            i -= c - 1;
        }

        i -= 1;
    }
    sa[0] = pat.len() - 1;

    // place S-suff in SA
    // init
    let mut last_scanned_type = STYPE;
    for i in (0..patlen - 1).rev() {
        if pat_char_type(pat[i], pat[i + 1], last_scanned_type) == STYPE {
            if sa[pat[i]] >= EMPTY {
                sa[pat[i]] += 1;
            } else {
                sa[pat[i]] = UNIQUE;
            }
            last_scanned_type = STYPE;
        } else {
            last_scanned_type = LTYPE;
        }
    }
    i = patlen - 1;
    while i > 0 {
        if sa[i] < EMPTY && sa[i] > 0 {
            let j = sa[i] - 1;
            let mut is_stype = false;
            if pat[j] < pat[j + 1] {
                is_stype = true;
            } else if pat[j] == pat[j + 1] {  // 判斷sa[i]是否是S後綴的編號
                let next_i = sa[pat[sa[i]]];
                if next_i >= MULTI {
                    is_stype = true;
                } else if next_i < EMPTY && pat[sa[i]] - 1 > 0 {
                    if sa[pat[sa[i]] - 1] == EMPTY {
                        is_stype = true;
                    } else if sa[pat[sa[i]] - 1] < EMPTY {
                        if pat[sa[pat[sa[i]] - 1]] == pat[sa[i]] {
                            is_stype = true;
                        }
                    }
                }
            }

            if is_stype {
                if sa[pat[j]] == UNIQUE {
                    sa[pat[j]] = j;
                } else if sa[pat[j]] >= MULTI && sa[pat[j] - 1] == EMPTY {
                    if sa[pat[j]] - EMPTY > 2 {
                        sa[pat[j] - 2] = j;
                        sa[pat[j] - 1] = 1;  // set counter
                    } else {
                        sa[pat[j]] = j;
                    }
                } else if sa[pat[j]] >= MULTI && sa[pat[j] - 1] != EMPTY {
                    let e = pat[j];
                    let c = sa[e - 1];
                    let num = sa[pat[j]] - EMPTY;
                    if c + 2 < num {  // 沒到bucket頭部
                        let rfp = e - c - 2;
                        sa[rfp] = j;
                        sa[e - 1] += 1;
                    } else {
                        for k in 1..c + 1 {
                            sa[e - k + 1] = sa[e - k - 1];
                        }
                        sa[e - c] = j;
                        sa[e - c - 1] = EMPTY;
                        if i >= e - num + 1 && i <= e - 2 {
                            i += 2;
                        }
                    }
                } else if sa[pat[j]] < EMPTY {
                    for k in (0..pat[j]).rev() {
                        if sa[k] == EMPTY {
                            sa[k] = j;
                            break;
                        }
                    }
                }
            }
        } else if sa[i] >= MULTI {
            i -= 1;
        }
        i -= 1;
    }
}

fn _compute_suffix_array_16_1(pat: &mut [usize], sa: &mut [usize]) {
    rename_pat(pat, sa);
    let lms_cnt = sort_lms_char(pat, sa);
    sort_lms_substr(pat, sa);
    let has_duplicated_char = construct_pat1(pat, sa, lms_cnt);
    sort_lms_suf(pat, sa, lms_cnt, has_duplicated_char);
    induced_sort(pat, sa);
}

pub fn suffix_array_16(pat: &[u8]) -> Vec<usize> {
    let mut pat = pat.into_iter().map(|x| *x as usize).collect::<Vec<usize>>();
    pat.push(0);
    let mut sa = vec![0; max(pat.len(), 256) * 1];
    _compute_suffix_array_16_1(&mut pat[..], &mut sa[..]);

    sa
}

fn input() -> String {
    use std::io;

    let mut input = String::new();
    io::stdin().read_line(&mut input).unwrap();
    String::from(input.trim())
}


fn main() {
    let pat = input();

    let sa_16 = suffix_array_16(pat.as_bytes());

    for i in 1..pat.len() + 1 { print!("{} ", sa_16[i] + 1) }
}

在只讀的整形字母表上的後綴排序

使用複雜方法解決複雜問題，通過分治，解決空間緊張的問題。

算法實現的難點在於在 \(\texttt{SA}\) 上構建 BitMaps⁵，來替代本來由重命名後的 T 所指示的指示桶尾/桶頭的位置。

這裏的 BitMaps 指得是使用比特向量（bit vector）表示的有序字典（multiset），是一種緊湊型結構（compact data structure）。

有興趣瞭解的暫時只能閲讀原文以及本文引用的 BitMaps 的有關論文自行了解。

在只讀的一般字母表上的後綴排序

前置知識是歸併排序和堆排序。

由於筆者對於其中確定字符類型的方法的時間複雜度有疑問，這裏也不再介紹，建議閲讀原文自行了解。

註解

Li, Zhize; Li, Jian; Huo, Hongwei (2016).Optimal In-Place Suffix Sorting. Proceedings of the 25^th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. 11147. Springer. pp. 268–284. arXiv:1610.08305. doi:10.1007/978-3-030-00479-8_22. ISBN:978-3-030-00478-1. ↩↩
Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In Data Compression Conference (DCC), pages 193–202. IEEE, 2009. ↩
推薦閲讀博文和它的 issue 列表 ↩
如果是 LML 後綴，就先誘導 S 型後綴，唯一區別是計算 LML 後綴時需要將警戒哨也算進去。 ↩
Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps. In Proc. 11^th International Symposium on Experimental Algorithms (SEA), pages 295–306, 2012. ↩

本页面最近更新：，更新历史
发现错误？想一起完善？在 GitHub 上编辑此页！
本页面贡献者：OI-wiki
本页面的全部内容在 CC BY-SA 4.0 和 SATA 协议之条款下提供，附加条款亦可能应用