Git Pack 文件中对象长度的变长编码的编解码解析

Git 使用了一种称为 "变长编码"（Variable-Length Encoding）的方案来编码 Pack 文件中的对象大小。这种编码方案的主要优点是它可以用较少的字节来表示较小的数值，同时也能表示较大的数值。

以下是 Git 变长编码的主要规则：

每个编码的数值都是由一个或多个字节表示的。每个字节由两部分组成：一个 标志位 和一个 大小部分。
标志位 是每个字节的最高位。如果这个字节后面还有更多的字节，那么继续位就是 1，否则就是 0。
大小部分 是每个字节的低 7 位。它表示的是编码数值的一部分，最低的部分在前，最高的部分在后。
为了得到编码的数值，需要将所有字节的大小部分合并起来，形成一个二进制数。

这种编码方案使得 Git 能够以一种非常紧凑的方式存储对象的大小，特别是对于较小的对象，只需要一个字节就可以表示。同时，它也能表示非常大的对象，只需要增加更多的字节即可。

到这里的时候以为可以自己写出来编码的代码，后来发现还是进行更详细的分析：

Git Pack 文件使用变长编码来表示对象大小。这意味着对象大小不定长,可以用 1 个或更多字节来表示。
每个字节的最高位(第 8 位)用于表示是否还有后续字节。如果最高位为 1, 表示后面还有更多字节,如果最高位为 0, 表示这个字节是最后一个字节。
每个字节的低 7 位(第 7 位到第 0 位)用于表示对象大小的一部分。
要编码一个对象大小,我们先将其转为二进制数字。然后从低位开始,每 7 位一组,如果最后不足 7 位,高位补 0。
对每组 7 位,我们设置最高位为 1 或 0, 表示是否还有更多组。如果后面还有组, 最高位设置为 1, 如果这是最后一组,最高位设置为 0。
把这些组合起来,就得到了变长编码后的字节序列。
举个例子,要编码 130, 130 的二进制是 10000010。分组后得到 0000001 和 0000010 两组。设置最高位后分别得到 10000010 和 00000001, 转为 16 进制就是 0x82 和 0x01。
所以 130 的变长编码后的字节序列是 [0x82, 0x01]。 0x82 的最高位为 1,表示后面还有字节 0x01 的最高位为 0, 表示这是最后一个字节。

//!
//! This module provides functions to encode and decode the size of Git objects in a Git Pack file.
//!
//! In a Git Pack file, the size of each object is encoded in a variable-length format. Each byte of the encoded size
//! consists of a continuation bit and 7 bits of size. The continuation bit is the highest bit of the byte, and it is set
//! to 1 if there are more bytes of the size. The lower 7 bits of the byte represent a part of the size, with the lowest
//! part first.
//!
//! For example, the size 130 is encoded as two bytes: 10000001 00000001. The first byte has the continuation bit set to 1
//! and the size part set to 1, and the second byte has the continuation bit set to 0 and the size part set to 1. When
//! decoded, the size parts are combined to form the size: 1 + (1 << 7) = 130.
//!
//! The `decode` function takes a reader that implements the `Read` trait, and returns the decoded size. It reads bytes
//! from the reader until it finds a byte with the continuation bit set to 0, and combines the size parts of the bytes to
//! form the size.
//!
//! The `encode` function takes a writer that implements the `Write` trait and a size, and writes the encoded size to the
//! writer. It splits the size into parts of 7 bits, and writes bytes with the continuation bit and the size part to the
//! writer.
//!

use std::io::{Read, Write};

// Function to decode the size of a Git object from a reader
pub fn decode<R: Read>(mut reader: R) -> std::io::Result<usize> {
    // Initialize the size and shift variables
    let mut size = 0;
    let mut shift = 0;

    // Buffer to hold the current byte
    let mut buffer = [0; 1];

    // Loop over the bytes from the reader
    while let Ok(_) = reader.read_exact(&mut buffer) {
        // Get the current byte
        let byte = buffer[0];
        // Update the size by bitwise OR with the lower 7 bits of the byte, shifted left by the shift amount
        size |= ((byte & 0x7f) as usize) << shift;
        // If the highest bit of the byte is 0, break the loop
        if byte & 0x80 == 0 {
            break;
        }
        // Increase the shift amount by 7 for the next byte
        shift += 7;
    }

    // Return the decoded size
    Ok(size)
}

// Function to encode the size of a Git object and write it to a writer
pub fn encode<W: Write>(mut writer: W, mut size: usize) -> std::io::Result<()> {
    // Buffer to hold the current byte
    let mut buffer = [0u8; 1];

    // Loop until the size is 0
    while size > 0 {
        // Get the lower 7 bits of the size
        buffer[0] = (size & 0x7f) as u8;
        // Shift the size right by 7 bits
        size >>= 7;
        // If there are more bits, set the highest bit of the byte
        if size > 0 {
            buffer[0] |= 0x80;
        }
        // Write the byte to the writer
        writer.write_all(&buffer)?;
    }

    Ok(())
}

#[cfg(test)]
mod tests {
    use super::*;
    use std::io::Cursor;

    #[test]
    fn test_decode() {
        let data = [0x82, 0x01];
        let cursor = Cursor::new(data);
        let result = decode(cursor).unwrap();
        assert_eq!(result, 130);
    }

    #[test]
    fn test_encode() {
        let size = 130;
        let mut data = Vec::new();
        encode(&mut data, size).unwrap();
        assert_eq!(data, [0x82, 0x01]);
    }
}

对于解码，我们按照之前的逻辑再逆向分析：

当读入一个 byte 的二进制是 10000010，首位是 1 说明还需要读取后面的一个 byte 。这里把这个字节的后 7 位存下来，就是 0000010。
继续读下一个 byte 的二进制 00000001，首位是 0 说明长度读取已经结束了。把这个字节的后 7 位和前面的 7 为拼接在一起，形成 00000010000010
前面补 0 去掉得到 10000010 的二进制，转换为十进制得到 130

计算为长度后，之后流中 130 个 byte 就是这个对象的数据了。通过这种方式当 Git Pack 以流的方式传入，就可以按照顺序读取解析出所有的对象。当然 Delta Object 和 Reference Object 对象还需要更详细的解析，

希望酒后再失眠的时候能和 AI 机器人们聊清楚 Delta 和 Reference 两种对象解析。