Gepostet vor 2021-01-2116 minutes lesen (Über 2468 Wörter)

Bomb Lab

本文记录 CSAPP 的 Bomb Lab 完成方案。

bomb 1

在 phase_1 中，调用 strings_not_equal 函数：

000000000000140f <phase_1>:
    140f:	48 83 ec 08          	sub    $0x8,%rsp
    1413:	48 8d 35 36 1d 00 00 	lea    0x1d36(%rip),%rsi        # 3150 <_IO_stdin_used+0x150>
    141a:	e8 f0 04 00 00       	callq  190f <strings_not_equal>
    141f:	85 c0                	test   %eax,%eax
    1421:	75 05                	jne    1428 <phase_1+0x19>
    1423:	48 83 c4 08          	add    $0x8,%rsp
    1427:	c3                   	retq   
    1428:	e8 ee 05 00 00       	callq  1a1b <explode_bomb>
    142d:	eb f4                	jmp    1423 <phase_1+0x14>

如果字符串不相等，函数返回 1，jne 指令发生跳转，进入 explode_bomb 函数；如果字符串相等的话，函数返回 0，jne 指令不发生跳转，直接退出。
strings_not_equal 函数有两个参数，分别为%rdi和%rsi:

000000000000190f <strings_not_equal>:
    190f:	41 54                	push   %r12
    1911:	55                   	push   %rbp
    1912:	53                   	push   %rbx
    1913:	48 89 fb             	mov    %rdi,%rbx
    1916:	48 89 f5             	mov    %rsi,%rbp
    1919:	e8 d4 ff ff ff       	callq  18f2 <string_length>
    ...

一个参数为字符串输入，另一个参数由 phase_1 传入。因此，用 gdb 在 strings_not_equal 处进行断点调试，便可得出答案：

(gdb) break strings_not_equal
Breakpoint 1 at 0x190f
(gdb) run
Starting program: /mnt/c/ubuntu/bomb65/bomb 
Welcome to my fiendish little bomb. You have 6 phases with
which to blow yourself up. Have a nice day!
test

Breakpoint 1, 0x000000000800190f in strings_not_equal ()
(gdb) p (char*)$rdi
$1 = 0x80056a0 <input_strings> "test"
(gdb) p (char*)$rsi
$2 = 0x8003150 "I am just a renegade hockey mom."

第一关答案为 I am just a renegade hockey mom. 。

bomb 2

直接看 phase_2：

000000000000142f <phase_2>:
    ...
    143e:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
    1443:	31 c0                	xor    %eax,%eax
    1445:	48 89 e6             	mov    %rsp,%rsi
    1448:	e8 f4 05 00 00       	callq  1a41 <read_six_numbers>
    144d:	83 3c 24 00          	cmpl   $0x0,(%rsp)              -> arr[0] = 0
    1451:	75 07                	jne    145a <phase_2+0x2b>
    1453:	83 7c 24 04 01       	cmpl   $0x1,0x4(%rsp)           -> arr[1] = 1
    1458:	74 05                	je     145f <phase_2+0x30>
    145a:	e8 bc 05 00 00       	callq  1a1b <explode_bomb>
    145f:	48 89 e3             	mov    %rsp,%rbx                -> %rbx = arr[0]
    1462:	48 8d 6b 10          	lea    0x10(%rbx),%rbp          -> %rbp = arr[4]
    1466:	eb 0e                	jmp    1476 <phase_2+0x47>
    1468:	e8 ae 05 00 00       	callq  1a1b <explode_bomb>
    146d:	48 83 c3 04          	add    $0x4,%rbx                -> %rbx = arr[1]
    1471:	48 39 eb             	cmp    %rbp,%rbx                -> if (%rbx == arr[4])
    1474:	74 0c                	je     1482 <phase_2+0x53>
    1476:	8b 43 04             	mov    0x4(%rbx),%eax           -> %eax = arr[1]
    1479:	03 03                	add    (%rbx),%eax              -> %eax = arr[0] + arr[1]
    147b:	39 43 08             	cmp    %eax,0x8(%rbx)           -> if (%eax == arr[2])
    147e:	74 ed                	je     146d <phase_2+0x3e>
    1480:	eb e6                	jmp    1468 <phase_2+0x39>
    1482:	48 8b 44 24 18       	mov    0x18(%rsp),%rax
    1487:	64 48 33 04 25 28 00 	xor    %fs:0x28,%rax
    ...

在 147b 处，若 arr[2] = arr[0] + arr[1]，则函数继续跳转到 146d 处执行，这时 %rbx 的保存的地址由 arr[0] 变为 arr[1]，接下来判断 arr[3], arr[4]……直到 %rbx == arr[4]，也就是判断完了 arr[5] 后停止。因此可以得出，这 6 个数由前两项分别为 0 和 1 的斐波拉契数列组成。第二关答案为 0 1 1 2 3 5 。

bomb 3

在 phase_3 中：

1 2	14c1: e8 5a fc ff ff callq 1120 <__isoc99_sscanf@plt> 14c6: 83 f8 01 cmp $0x1,%eax

在 14c6 处打断点，输入多个数后，打印 %eax 中的值：

1 2	(gdb) i r eax eax 0x2 2

可以判断，在 phase_3 中，我们需要输入两个数。再观察以下代码：

1 2	14cb: 83 3c 24 07 cmpl $0x7,(%rsp) 14cf: 77 4b ja 151c <phase_3+0x7e>

我们输入的第一个数不能大于 7。再向下看：

14db:	48 63 04 82          	movslq (%rdx,%rax,4),%rax
14df:	48 01 d0             	add    %rdx,%rax
14e2:	ff e0                	jmpq   *%rax
14e4:	e8 32 05 00 00       	callq  1a1b <explode_bomb>
14e9:	eb e0                	jmp    14cb <phase_3+0x2d>
14eb:	b8 ab 00 00 00       	mov    $0xab,%eax
14f0:	eb 3b                	jmp    152d <phase_3+0x8f>
14f2:	b8 ea 01 00 00       	mov    $0x1ea,%eax
14f7:	eb 34                	jmp    152d <phase_3+0x8f>
...
...
152d:	39 44 24 04          	cmp    %eax,0x4(%rsp)
1531:	75 15                	jne    1548 <phase_3+0xaa>
1533:	48 8b 44 24 08       	mov    0x8(%rsp),%rax
1538:	64 48 33 04 25 28 00 	xor    %fs:0x28,%rax
153f:	00 00 
1541:	75 0c                	jne    154f <phase_3+0xb1>
1543:	48 83 c4 18          	add    $0x18,%rsp
1547:	c3                   	retq   
1548:	e8 ce 04 00 00       	callq  1a1b <explode_bomb>
154d:	eb e4                	jmp    1533 <phase_3+0x95>
154f:	e8 2c fb ff ff       	callq  1080 <__stack_chk_fail@plt>

可以看到这是 switch 的特征，根据 %rax 中的地址进行跳转，跳转后将数存入 %eax 中，再跳入 152d 处将数与输入的第二个参数进行比较，如果不相等则触发炸弹爆炸。
在这里我输入的第一个参数为 1，用 gdb 调试，查看 %rax 存储的地址：

1 2	(gdb) i r rax rax 0x80014eb 134223083

这里将跳转至 14eb 处，即输入的第二个参数应该是 0xab，才不会发生爆炸。因此，第三关答案为 1 171 (答案不唯一)。

bomb 4

和 phase_3 一样，接收两个参数：

1 2	15b9: 83 f8 02 cmp $0x2,%eax 15bc: 75 06 jne 15c4 <phase_4+0x33>

并且第一个参数不能大于 15：

1 2	15be: 83 3c 24 0e cmpl $0xe,(%rsp) 15c2: 76 05 jbe 15c9 <phase_4+0x38>

func4 的返回值必须要等于 3，并且第二个参数也要等于 3，负责触发炸弹爆炸：

15c9:	ba 0e 00 00 00       	mov    $0xe,%edx
15ce:	be 00 00 00 00       	mov    $0x0,%esi
15d3:	8b 3c 24             	mov    (%rsp),%edi
15d6:	e8 79 ff ff ff       	callq  1554 <func4>
15db:	83 f8 03             	cmp    $0x3,%eax
15de:	75 07                	jne    15e7 <phase_4+0x56>
15e0:	83 7c 24 04 03       	cmpl   $0x3,0x4(%rsp)
15e5:	74 05                	je     15ec <phase_4+0x5b>
15e7:	e8 2f 04 00 00       	callq  1a1b <explode_bomb>
15ec:	48 8b 44 24 08       	mov    0x8(%rsp),%rax
15f1:	64 48 33 04 25 28 00 	xor    %fs:0x28,%rax

再看看 func4：

0000000000001554 <func4>:
    1554:	48 83 ec 08          	sub    $0x8,%rsp
    1558:	89 d0                	mov    %edx,%eax
    155a:	29 f0                	sub    %esi,%eax
    155c:	89 c1                	mov    %eax,%ecx
    155e:	c1 e9 1f             	shr    $0x1f,%ecx
    1561:	01 c1                	add    %eax,%ecx
    1563:	d1 f9                	sar    %ecx
    1565:	01 f1                	add    %esi,%ecx
    1567:	39 f9                	cmp    %edi,%ecx
    1569:	7f 0c                	jg     1577 <func4+0x23>
    156b:	b8 00 00 00 00       	mov    $0x0,%eax
    1570:	7c 11                	jl     1583 <func4+0x2f>
    1572:	48 83 c4 08          	add    $0x8,%rsp
    1576:	c3                   	retq   
    1577:	8d 51 ff             	lea    -0x1(%rcx),%edx
    157a:	e8 d5 ff ff ff       	callq  1554 <func4>
    157f:	01 c0                	add    %eax,%eax
    1581:	eb ef                	jmp    1572 <func4+0x1e>
    1583:	8d 71 01             	lea    0x1(%rcx),%esi
    1586:	e8 c9 ff ff ff       	callq  1554 <func4>
    158b:	8d 44 00 01          	lea    0x1(%rax,%rax,1),%eax
    158f:	eb e1                	jmp    1572 <func4+0x1e>

说几个值得注意的点：

155a:	29 f0                	sub    %esi,%eax
155c:	89 c1                	mov    %eax,%ecx
155e:	c1 e9 1f             	shr    $0x1f,%ecx
1561:	01 c1                	add    %eax,%ecx
1563:	d1 f9                	sar    %ecx
1565:	01 f1                	add    %esi,%ecx

这几行的意思是，如果 %eax < %esi 的话，%ecx = (%eax - %esi + 1) / 2 + %esi = (%eax + %esi + 1) / 2，否则 %ecx = (%eax + %esi) / 2。接下来再看：

1567:	39 f9                	cmp    %edi,%ecx
1569:	7f 0c                	jg     1577 <func4+0x23>
156b:	b8 00 00 00 00       	mov    $0x0,%eax
1570:	7c 11                	jl     1583 <func4+0x2f>
1572:	48 83 c4 08          	add    $0x8,%rsp

只有当 %edi = %ecx 时，函数才会退出。最后再看：

1577:	8d 51 ff             	lea    -0x1(%rcx),%edx
157a:	e8 d5 ff ff ff       	callq  1554 <func4>
157f:	01 c0                	add    %eax,%eax
1581:	eb ef                	jmp    1572 <func4+0x1e>
1583:	8d 71 01             	lea    0x1(%rcx),%esi
1586:	e8 c9 ff ff ff       	callq  1554 <func4>
158b:	8d 44 00 01          	lea    0x1(%rax,%rax,1),%eax
158f:	eb e1                	jmp    1572 <func4+0x1e>

这段代码也就是修改 %rcx 的值，传递参数，递归调用。整个 func4 汇编代码用 Python 表示可以是：

def func4(a, c, d):
    if d < c:
        b = (d + c + 1) / 2
    else:
        b = (d + c) / 2
    
    if b < a:
        return func4(a, b+1, d)*2 + 1
    if b > a:
        return func4(a, c, b-1)*2
    else:
        return 0

因此，第四关答案为 13 3 或者 12 3。

bomb 5

与 phase_3 类似，首先我们知道输入的数至少有2个：

1
2
3

1629:	e8 f2 fa ff ff       	callq  1120 <__isoc99_sscanf@plt>
162e:	83 f8 01             	cmp    $0x1,%eax
1631:	7e 5a                	jle    168d <phase_5+0x87>

然后我们输入的一个参数的二进制后四位不能为1111(15)：

1633:	8b 04 24             	mov    (%rsp),%eax
1636:	83 e0 0f             	and    $0xf,%eax
1639:	89 04 24             	mov    %eax,(%rsp)
163c:	83 f8 0f             	cmp    $0xf,%eax
163f:	74 32                	je     1673 <phase_5+0x6d>

接下来分析数组，用 gdb 调试：

(gdb) p/x *(int *)($rsi)@100
$1 = {0xa, 0x2, 0xe, 0x7, 0x8, 0xc, 0xf, 0xb, 0x0, 0x4, 0x1, 0xd, 0x3, 0x9, 0x6, 0x5, 0x79206f53, ......}
(gdb) p *$rsi@16
$2 = {10, 2, 14, 7, 8, 12, 15, 11, 0, 4, 1, 13, 3, 9, 6, 5}

这个数组一共有 16 位，数组中的元素为 {10, 2, 14, 7, 8, 12, 15, 11, 0, 4, 1, 13, 3, 9, 6, 5}。
接下来的汇编代码表示一个循环，寄存器 %edx 初值定为 0，每次循环加 1，根据后面 cmp 0xf, %edx 可以得出，循环必须执行 15 次；同时ecx寄存器不断的累加数，每次把一个数的值存到 %eax 寄存器中，并且作为下次取值的索引，即对于每对索引 i 和值 v 而言，下一个 v' 位于索引 v 处，相当于构成了一个环形链表。另外，传入的第一个参数不能为 15，并且在遍历过程中，根据 cmp $0xf,%eax，%eax 也不能等于15。索引 15 对应的值为 5，因此，传入的第一个参数必须是 5，累加循环从索引 5 对应的值开始，这样才能保证能够循环 15 次，对 15 个遍历到的值进行累加，累加和为 (0 + 15) * 16 / 2 -5 = 115。
因此，第五关答案为 5 115。

bomb 6

phase_6 汇编代码太多，需要花费一定的时间。
首先，进入函数做一些初始化的工作后，根据 jmpq 177a <phase_6+0xe1> ，函数会跳转到 177a 处执行：

175d:	48 83 c3 01          	add    $0x1,%rbx
1761:	83 fb 05             	cmp    $0x5,%ebx
1764:	7f 0c                	jg     1772 <phase_6+0xd9>
1766:	41 8b 44 9d 00       	mov    0x0(%r13,%rbx,4),%eax
176b:	39 45 00             	cmp    %eax,0x0(%rbp)
176e:	75 ed                	jne    175d <phase_6+0xc4>
1770:	eb e6                	jmp    1758 <phase_6+0xbf>
1772:	49 83 c7 01          	add    $0x1,%r15
1776:	49 83 c6 04          	add    $0x4,%r14
177a:	4c 89 f5             	mov    %r14,%rbp
177d:	41 8b 06             	mov    (%r14),%eax
1780:	83 e8 01             	sub    $0x1,%eax
1783:	83 f8 05             	cmp    $0x5,%eax
1786:	0f 87 47 ff ff ff    	ja     16d3 <phase_6+0x3a>
178c:	49 83 ff 06          	cmp    $0x6,%r15
1790:	0f 84 47 ff ff ff    	je     16dd <phase_6+0x44>
1796:	4c 89 fb             	mov    %r15,%rbx
1799:	eb cb                	jmp    1766 <phase_6+0xcd>

从 177a 处开始看起，首先会判断传入的第一个参数减 1 后是否大于 5，也就是这里需要保证参数不能超过 6。之后跳入 1766 处执行，1766 -> 176e -> 175d -> 1766 构成了一个循环，判断当前参数的后面几个参数是否与当前参数相等，相等则炸弹爆炸。然后跳到 1772 处执行，判断第二个参数减 1 后是否大于 5……即整个这一部分代码是一个大循环，来保证 6 个参数不大于 6，并且各不相等。
接下来，函数会跳到 16dd 处执行：

16dd:	48 8d 74 24 20       	lea    0x20(%rsp),%rsi
16e2:	49 8d 7c 24 18       	lea    0x18(%r12),%rdi
16e7:	41 8b 0c 24          	mov    (%r12),%ecx
16eb:	b8 01 00 00 00       	mov    $0x1,%eax
16f0:	48 8d 15 19 3b 00 00 	lea    0x3b19(%rip),%rdx        # 5210 <node1>
16f7:	83 f9 01             	cmp    $0x1,%ecx
16fa:	7e 0b                	jle    1707 <phase_6+0x6e>
16fc:	48 8b 52 08          	mov    0x8(%rdx),%rdx
1700:	83 c0 01             	add    $0x1,%eax
1703:	39 c8                	cmp    %ecx,%eax
1705:	75 f5                	jne    16fc <phase_6+0x63>
1707:	48 89 16             	mov    %rdx,(%rsi)
170a:	49 83 c4 04          	add    $0x4,%r12
170e:	48 83 c6 08          	add    $0x8,%rsi
1712:	4c 39 e7             	cmp    %r12,%rdi
1715:	75 d0                	jne    16e7 <phase_6+0x4e>

我们输入的参数数组存放在 %r12 中，根据 lea 0x18(%r12),%rdi 可以得出，%rdi 存放了数组的结束地址。这段代码做的就是根据我们输入的参数将 node 按照顺序放入栈中：

for(int i = 0; i < 6; i++)
{
    %rdx = 0x3b19(%rip);
    for(int j = 0; j < arr[i]; j++)
        %rdx = addr + 0x8;
    %rsi = *%rdx;
}

再看之后的代码，将 %rax 指向 %rbx 下一个链表节点：

1717:	48 8b 5c 24 20       	mov    0x20(%rsp),%rbx
171c:	48 8b 44 24 28       	mov    0x28(%rsp),%rax
1721:	48 89 43 08          	mov    %rax,0x8(%rbx)
...

最后，比较链表节点中第一个字段值的大小，如果前一个节点值小于后一个节点值，炸弹爆炸：

179b:	48 8b 5b 08          	mov    0x8(%rbx),%rbx
179f:	83 ed 01             	sub    $0x1,%ebp
17a2:	74 11                	je     17b5 <phase_6+0x11c>
17a4:	48 8b 43 08          	mov    0x8(%rbx),%rax
17a8:	8b 00                	mov    (%rax),%eax
17aa:	39 03                	cmp    %eax,(%rbx)
17ac:	7d ed                	jge    179b <phase_6+0x102>
17ae:	e8 68 02 00 00       	callq  1a1b <explode_bomb>

在此我们知道数据是根据每个节点中的第一个数升序排列。
所以只需要查看初始链表存储的数据即可得出答案：

(gdb) i r rdx
rdx            0x8005210        134238736
(gdb) p *134238736
$1 = 581
(gdb) p *134238752
$2 = 563
(gdb) p *134238768
$3 = 687
(gdb) p *134238784
$4 = 154
(gdb) p *134238800
$5 = 170
(gdb) p *134238808
$6 = 134238480
(gdb) p *134238480
$7 = 454

链表节点的值为 581 563 687 154 170 454，由大到小排序的节点编号 3 1 2 6 5 4 即为第六关答案。

最终结果

hey-kong@LAPTOP-9010T96A:/mnt/c/ubuntu/csapp/bomb65$ ./bomb
Welcome to my fiendish little bomb. You have 6 phases with
which to blow yourself up. Have a nice day!
I am just a renegade hockey mom.
Phase 1 defused. How about the next one?
0 1 1 2 3 5
That's number 2.  Keep going!
1 171
Halfway there!
13 3
So you got that one.  Try this one.
5 115
Good work!  On to the next...
3 1 2 6 5 4
Congratulations! You've defused the bomb!

repo

My solutions to CSAPP labs

Gepostet vor 2021-01-187 minutes lesen (Über 1065 Wörter)

从 Row Cache 的 Get 来看 Rocksdb LRUCache

本文简单介绍 RocksDB 6.7.3 版本的 LRUCache。

Row Cache

Row Cache 对查找的 key 在 SST 中对应的 value 进行 cache。如果 row_cache 打开，在 TableCache::Get 函数中，会调用 CreateRowCacheKeyPrefix 和 GetFromRowCache 获取 row cache 的 key（fd_number + seq_no + user_key），在 GetFromRowCache 中，会调用 row_cache->Lookup，得到 row cache 缓存的 row_handle，构造 found_row_cache_entry 指针指向 value，利用 Cleannable 类的特性，可以通过减少一次对 value 内存拷贝的方式来获取最终的结果。

Status TableCache::Get(const ReadOptions& options,
                       const InternalKeyComparator& internal_comparator,
                       const FileMetaData& file_meta, const Slice& k,
                       GetContext* get_context,
                       const SliceTransform* prefix_extractor,
                       HistogramImpl* file_read_hist, bool skip_filters,
                       int level) {
  ...
  if (ioptions_.row_cache && !get_context->NeedToReadSequence()) {
    auto user_key = ExtractUserKey(k);
    CreateRowCacheKeyPrefix(options, fd, k, get_context, row_cache_key);
    done = GetFromRowCache(user_key, row_cache_key, row_cache_key.Size(),
                           get_context);
    if (!done) {
      row_cache_entry = &row_cache_entry_buffer;
    }
  }
  ...
}

void TableCache::CreateRowCacheKeyPrefix(const ReadOptions& options,
                                         const FileDescriptor& fd,
                                         const Slice& internal_key,
                                         GetContext* get_context,
                                         IterKey& row_cache_key) {
  uint64_t fd_number = fd.GetNumber();
  uint64_t seq_no = 0;
  ...
  AppendVarint64(&row_cache_key, fd_number);
  AppendVarint64(&row_cache_key, seq_no);
}

bool TableCache::GetFromRowCache(const Slice& user_key, IterKey& row_cache_key,
                                 size_t prefix_size, GetContext* get_context) {
  bool found = false;

  row_cache_key.TrimAppend(prefix_size, user_key.data(), user_key.size());
  if (auto row_handle =
          ioptions_.row_cache->Lookup(row_cache_key.GetUserKey())) {
    // Cleanable routine to release the cache entry
    Cleanable value_pinner;
    auto release_cache_entry_func = [](void* cache_to_clean,
                                       void* cache_handle) {
      ((Cache*)cache_to_clean)->Release((Cache::Handle*)cache_handle);
    };
    auto found_row_cache_entry =
        static_cast<const std::string*>(ioptions_.row_cache->Value(row_handle));
    // If it comes here value is located on the cache.
    // found_row_cache_entry points to the value on cache,
    // and value_pinner has cleanup procedure for the cached entry.
    // After replayGetContextLog() returns, get_context.pinnable_slice_
    // will point to cache entry buffer (or a copy based on that) and
    // cleanup routine under value_pinner will be delegated to
    // get_context.pinnable_slice_. Cache entry is released when
    // get_context.pinnable_slice_ is reset.
    value_pinner.RegisterCleanup(release_cache_entry_func,
                                 ioptions_.row_cache.get(), row_handle);
    replayGetContextLog(*found_row_cache_entry, user_key, get_context,
                        &value_pinner);
    RecordTick(ioptions_.statistics, ROW_CACHE_HIT);
    found = true;
  } else {
    RecordTick(ioptions_.statistics, ROW_CACHE_MISS);
  }
  return found;
}

LRUCache 类

Cache
定义了 Cache 的接口，包括 Insert, Lookup, Release 等操作。
ShardedCache
支持对 Cache 进行分桶，分桶数量为 2^num_shard_bits，每个桶的容量相等。分桶的依据是取 key 的 hash 值的高 num_shard_bits 位。
LRUCache
实现了 ShardedCache，维护了一个 LRUCacheShard 数组，一个 shard 就是一个桶。
CacheShard
定义了一个桶的接口，包括 Insert, Lookup, Release 等操作，Cache 的相关调用经过分桶处理后，都会调用指定桶的对应操作。
LRUCacheShard
实现了 CacheShard，维护了一个 LRU list 和 hash table，用来实现 LRU 策略，他们的成员类型都是 LRUHandle。
LRUHandle
保存 key 和 value 的单元，并且包含前向和后续指针，可以组成双向循环链表作为 LRU list。
LRUHandleTable
hash table 的实现，根据 key 再次做了分组处理，并且尽量保证每个桶中只有一个元素，元素类型为 LRUHandle。提供了Lookup, Insert, Remove操作。

Lookup

在 GetFromRowCache 中，会调用 row_cache->Lookup，这里实际调用的是 ShardedCache::Lookup

Cache::Handle* ShardedCache::Lookup(const Slice& key, Statistics* /*stats*/) {
  uint32_t hash = HashSlice(key);
  return GetShard(Shard(hash))->Lookup(key, hash);
}

获取哈希值，根据 hash 值的高 num_shard_bits 位获取 shard，再调用 LRUCacheShard::Lookup

Cache::Handle* LRUCacheShard::Lookup(const Slice& key, uint32_t hash) {
  MutexLock l(&mutex_);
  LRUHandle* e = table_.Lookup(key, hash);
  if (e != nullptr) {
    assert(e->InCache());
    if (!e->HasRefs()) {
      // The entry is in LRU since it's in hash and has no external references
      LRU_Remove(e);
    }
    e->Ref();
    e->SetHit();
  }
  return reinterpret_cast<Cache::Handle*>(e);
}

LRUCacheShard::Lookup 中又会调用 LRUHandleTable::Lookup，在 FindPointer 中，hash 到特定位置后，如果当前位置的 hash 和当前 hash 不一样，或者 key 不一样，并且指针也不为空，则继续向下找，直到找到

LRUHandle* LRUHandleTable::Lookup(const Slice& key, uint32_t hash) {
  return *FindPointer(key, hash);
}

LRUHandle** LRUHandleTable::FindPointer(const Slice& key, uint32_t hash) {
  LRUHandle** ptr = &list_[hash & (length_ - 1)];
  while (*ptr != nullptr && ((*ptr)->hash != hash || key != (*ptr)->key())) {
    ptr = &(*ptr)->next_hash;
  }
  return ptr;
}

总结

LRUCache 就是把多个 LRUCacheShard 组合起来，每个 LRUCacheShard 维护了一个 LRUHandle list 和 hash table，LRUHandleTable 用拉链法实现哈希表。通过对缓存的 Lookup 调用链分析可以看到具体的实现非常简练。

参考

Gepostet vor 2021-01-1714 minutes lesen (Über 2102 Wörter)

RocksDB Get 流程

本文对 RocksDB 6.7.3 版本的 Get 流程进行分析。

概述

(1) 获取当前的 SuperVersion。SuperVersion 用于管理 CF 的元数据，如当前版本号、内存中的 MemTable 和 Immutable MemTable、SST 文件信息等:

Struct SuperVersion {
    MemTable*            mem;
    MemTableListVersion* imm;
    Version*             current;
    ...
}

(2) 从内存读: 尝试从第一步 SuperVersion 中引用的 MemTable 以及Immutable MemTable 中获取对应的值

(3) 从持久化设备读: 首先通过 Table cache 获取到文件的元数据，如布隆过滤器(Bloom Filters)和数据块索引(Indexes)，如果 Block cache 中缓存了 SST 的数据块，如果命中那就直接读取成功，否则便需要从 SST 中读取数据块并插入到 Block cache

源码分析

DBImpl::Get

Status DBImpl::Get(const ReadOptions& read_options,
                   ColumnFamilyHandle* column_family, const Slice& key,
                   PinnableSlice* value) {
  GetImplOptions get_impl_options;
  get_impl_options.column_family = column_family;
  get_impl_options.value = value;
  return GetImpl(read_options, key, get_impl_options);
}

Rocksdb 的 Get 接口 DBImpl::Get 其实现主要靠 DBImpl::GetImpl 函数调用。

DBImpl::GetImpl

Status DBImpl::GetImpl(const ReadOptions& read_options, const Slice& key,
  ...
  SuperVersion* sv = GetAndRefSuperVersion(cfd);
  ...
  SequenceNumber snapshot;
  if (read_options.snapshot != nullptr) {
    if (get_impl_options.callback) {
      snapshot = get_impl_options.
    } else {
      snapshot =
          reinterpret_cast<const SnapshotImpl*>(read_options.snapshot)->number_;
    }
  } else {
    ....
    snapshot = last_seq_same_as_publish_seq_
                   ? versions_->LastSequence()
                   : versions_->LastPublishedSequence();
    ...
    }
  }
  ...
  if (!skip_memtable) {
    ...
      if (sv->mem->Get(lkey, get_impl_options.value->GetSelf(), &s,
                       &merge_context, &max_covering_tombstone_seq,
                       read_options, get_impl_options.callback,
                       get_impl_options.is_blob_index)) {
        ...
      } else if ((s.ok() || s.IsMergeInProgress()) &&
                 sv->imm->Get(lkey, get_impl_options.value->GetSelf(), &s,
                              &merge_context, &max_covering_tombstone_seq,
                              read_options, get_impl_options.callback,
                              get_impl_options.is_blob_index)) {
        ...
      }
    ...
  }
  if (!done) {
    sv->current->Get(
        read_options, lkey, get_impl_options.value, &s, &merge_context,
        &max_covering_tombstone_seq,
        get_impl_options.get_value ? get_impl_options.value_found : nullptr,
        nullptr, nullptr,
        get_impl_options.get_value ? get_impl_options.callback : nullptr,
        get_impl_options.get_value ? get_impl_options.is_blob_index : nullptr,
        get_impl_options.get_value);
    ...
  }
  ...
}

DBImpl::GetImpl 获取 SuperVersion 的信息，如果用户未指定 snapshot，需要获取当前的 snapshot。读取时不对 key 加锁，能读到什么数据完全取决于 Options 传入的 snapshot。

SuperVersion 中按照数据的新旧程度排序 MemTable -> MemTableListVersion -> Version，依次按序查找，如果在新的数据中找到符合 snapshot 规则的结果，就可以立即返回，完成本次查找。

MemTable::Get

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s,
                   MergeContext* merge_context,
                   SequenceNumber* max_covering_tombstone_seq,
                   SequenceNumber* seq, const ReadOptions& read_opts,
                   ReadCallback* callback, bool* is_blob_index, bool do_merge) {
  ...

  if (bloom_filter_ && !may_contain) {
    // iter is null if prefix bloom says the key does not exist
    PERF_COUNTER_ADD(bloom_memtable_miss_count, 1);
    *seq = kMaxSequenceNumber;
  } else {
    if (bloom_filter_) {
      PERF_COUNTER_ADD(bloom_memtable_hit_count, 1);
    }
    GetFromTable(key, *max_covering_tombstone_seq, do_merge, callback,
                 is_blob_index, value, s, merge_context, seq,
                 &found_final_value, &merge_in_progress);
  }

  ...
}

void MemTable::GetFromTable(const LookupKey& key,
                            SequenceNumber max_covering_tombstone_seq,
                            bool do_merge, ReadCallback* callback,
                            bool* is_blob_index, std::string* value, Status* s,
                            MergeContext* merge_context, SequenceNumber* seq,
                            bool* found_final_value, bool* merge_in_progress) {
  ...
  table_->Get(key, &saver, SaveValue);
  *seq = saver.seq;
}

利用 MemTableRep 的 Get 函数进行查找（以 SkipListRep 实现为例，在 skiplist 中进行查找，从 seek 到的位置开始向后遍历，遍历 entry 是否符合SaveValue 定义的规则）。SaveValue 函数查看当前 entry 是否还是当前查找的 key，如果不是则返回；查看当前 entry 的 snapshot 是否小于或等于需要查找的 snapshot，不符合则继续循环。如果 entry 的snapshot 符合上述条件，那么则跳出循环，返回查找结果。

MemTableListVersion::Get

bool Get(const LookupKey& key, std::string* value, Status* s,
        MergeContext* merge_context,
        SequenceNumber* max_covering_tombstone_seq,
        const ReadOptions& read_opts, ReadCallback* callback = nullptr,
        bool* is_blob_index = nullptr) {
    SequenceNumber seq;
    return Get(key, value, s, merge_context, max_covering_tombstone_seq, &seq,
                read_opts, callback, is_blob_index);
}

MemTableListVersion 用链表的形式保存了所有 Immutable memtable 的结构，查找时，按时间序依次查找于每一个 memtable，如果任何一个 memtable 查找到结果则立即返回，即返回最新的返回值。具体 memtable 查找见上述 MemTable::Get 接口。

Version::Get

void Version::Get(const ReadOptions& read_options, const LookupKey& k,
                  PinnableSlice* value, Status* status,
                  MergeContext* merge_context,
                  SequenceNumber* max_covering_tombstone_seq, bool* value_found,
                  bool* key_exists, SequenceNumber* seq, ReadCallback* callback,
                  bool* is_blob, bool do_merge) {
  ...
  FilePicker fp(
      storage_info_.files_, user_key, ikey, &storage_info_.level_files_brief_,
      storage_info_.num_non_empty_levels_, &storage_info_.file_indexer_,
      user_comparator(), internal_comparator());
  FdWithKeyRange* f = fp.GetNextFile();

  while (f != nullptr) {
    ...
    *status = table_cache_->Get(
        read_options, *internal_comparator(), *f->file_metadata, ikey,
        &get_context, mutable_cf_options_.prefix_extractor.get(),
        cfd_->internal_stats()->GetFileReadHist(fp.GetHitFileLevel()),
        IsFilterSkipped(static_cast<int>(fp.GetHitFileLevel()),
                        fp.IsHitFileLastInLevel()),
        fp.GetCurrentLevel());
    ...
    f = fp.GetNextFile();
  }
  ...
}

GetNextFile 函数会遍历所有的 level，然后再遍历每个 level 的所有的文件，这里会对 level 0 的文件做一个特殊处理，这是因为只有 level 0 的 SST 的 range 不是有序的，因此我们每次查找需要查找所有的文件，也就是会一个个的遍历；而在非 level 0，我们只需要按照二分查找来得到对应的文件即可，如果二分查找不存在，那么我就需要进入下一个 level 进行查找。

调用 TableCache::Get 遍历单个 SST 文件，如果查找到结果立即返回。

TableCache::Get

Status TableCache::Get(const ReadOptions& options,
                       const InternalKeyComparator& internal_comparator,
                       const FileMetaData& file_meta, const Slice& k,
                       GetContext* get_context,
                       const SliceTransform* prefix_extractor,
                       HistogramImpl* file_read_hist, bool skip_filters,
                       int level) {
  auto& fd = file_meta.fd;
  std::string* row_cache_entry = nullptr;
  bool done = false;
#ifndef ROCKSDB_LITE
  IterKey row_cache_key;
  std::string row_cache_entry_buffer;

  // Check row cache if enabled. Since row cache does not currently store
  // sequence numbers, we cannot use it if we need to fetch the sequence.
  if (ioptions_.row_cache && !get_context->NeedToReadSequence()) {
    auto user_key = ExtractUserKey(k);
    CreateRowCacheKeyPrefix(options, fd, k, get_context, row_cache_key);
    done = GetFromRowCache(user_key, row_cache_key, row_cache_key.Size(),
                           get_context);
    if (!done) {
      row_cache_entry = &row_cache_entry_buffer;
    }
  }
#endif  // ROCKSDB_LITE
  Status s;
  TableReader* t = fd.table_reader;
  Cache::Handle* handle = nullptr;
  if (!done && s.ok()) {
    if (t == nullptr) {
      s = FindTable(
          file_options_, internal_comparator, fd, &handle, prefix_extractor,
          options.read_tier == kBlockCacheTier /* no_io */,
          true /* record_read_stats */, file_read_hist, skip_filters, level);
      if (s.ok()) {
        t = GetTableReaderFromHandle(handle);
      }
    }
    SequenceNumber* max_covering_tombstone_seq =
        get_context->max_covering_tombstone_seq();
    if (s.ok() && max_covering_tombstone_seq != nullptr &&
        !options.ignore_range_deletions) {
      std::unique_ptr<FragmentedRangeTombstoneIterator> range_del_iter(
          t->NewRangeTombstoneIterator(options));
      if (range_del_iter != nullptr) {
        *max_covering_tombstone_seq = std::max(
            *max_covering_tombstone_seq,
            range_del_iter->MaxCoveringTombstoneSeqnum(ExtractUserKey(k)));
      }
    }
    if (s.ok()) {
      get_context->SetReplayLog(row_cache_entry);  // nullptr if no cache.
      s = t->Get(options, k, get_context, prefix_extractor, skip_filters);
      get_context->SetReplayLog(nullptr);
    } else if (options.read_tier == kBlockCacheTier && s.IsIncomplete()) {
      // Couldn't find Table in cache but treat as kFound if no_io set
      get_context->MarkKeyMayExist();
      s = Status::OK();
      done = true;
    }
  }

#ifndef ROCKSDB_LITE
  // Put the replay log in row cache only if something was found.
  if (!done && s.ok() && row_cache_entry && !row_cache_entry->empty()) {
    size_t charge =
        row_cache_key.Size() + row_cache_entry->size() + sizeof(std::string);
    void* row_ptr = new std::string(std::move(*row_cache_entry));
    ioptions_.row_cache->Insert(row_cache_key.GetUserKey(), row_ptr, charge,
                                &DeleteEntry<std::string>);
  }
#endif  // ROCKSDB_LITE

  if (handle != nullptr) {
    ReleaseHandle(handle);
  }
  return s;
}

Status TableCache::FindTable(const FileOptions& file_options,
                             const InternalKeyComparator& internal_comparator,
                             const FileDescriptor& fd, Cache::Handle** handle,
                             const SliceTransform* prefix_extractor,
                             const bool no_io, bool record_read_stats,
                             HistogramImpl* file_read_hist, bool skip_filters,
                             int level,
                             bool prefetch_index_and_filter_in_cache) {
  ...
    std::unique_ptr<TableReader> table_reader;
    s = GetTableReader(file_options, internal_comparator, fd,
                       false /* sequential mode */, record_read_stats,
                       file_read_hist, &table_reader, prefix_extractor,
                       skip_filters, level, prefetch_index_and_filter_in_cache);
    if (!s.ok()) {
      assert(table_reader == nullptr);
      RecordTick(ioptions_.statistics, NO_FILE_ERRORS);
      // We do not cache error results so that if the error is transient,
      // or somebody repairs the file, we recover automatically.
    } else {
      s = cache_->Insert(key, table_reader.get(), 1, &DeleteEntry<TableReader>,
                         handle);
      if (s.ok()) {
        // Release ownership of table reader.
        table_reader.release();
      }
  ...
  return s;
}

如果 row_cache 打开，首先它会计算 row cache 的 key，再在row cache 中进行一次查找，如果有对应的值则直接返回结果，否则则将会在对应的 SST 读取传递进来的 key。

调用 FindTable，进行对应 table_reader 的读取以及进行 Table cache。

接下来调用 t->Get，从 Block cache 或者 SST 中读取数据。

最后，如果 row_cache 打开，把读取的数据插入到 row cache 中。

BlockBasedTable::Get

for (iiter->Seek(key); iiter->Valid()&&!done; iiter->Next()) {
    ...
    NewDataBlockIterator(&biter);
    for(; biter.Valid; biter.Next()) {
        ...
        get_context->SaveValue(biter->Value());
    }
}

在 Table Cache 中，假设最终缓存的 table reader 是一个 BlockBasedTable 对象，调用 BlockBasedTable::Get。

首先，根据 Table 的元数据信息（布隆过滤器，数据块Index）查找 SST 内部的 Block。

调用 NewDataBlockIterator，若 Block 在 Block Cache 当中，直接返回对象地址，否则，发生磁盘IO，读取 SST 的 Block，构造 Block 对象并缓存其地址在 Block Cache 中。

找到 key 对应的 value，调用 get_context->SaveValue，直接将 Block 中的数据地址赋给用户传进来的 PinnableSlice* 中，减少了一次数据拷贝，并用引用计数避免 Block 被淘汰值被清除。