编码字节字符串的URL

URL-Encoding a Byte String?

本文关键字：URL 字符串字节编码更新时间：2023-10-16

我正在编写一个Bittorrent客户端。其中一个步骤要求程序向跟踪器发送HTTP GET请求，该请求包含torrent文件的一部分的SHA1哈希。我已经使用Fiddler2拦截了Azureus发送给跟踪器的请求。

Azureus发送的哈希是URL编码的，如下所示：%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR

哈希在URL编码之前应该是这样的：d90c3ce39418f0c5d98358e0334922b608cbf52

我注意到，这并不像每两个字符放置一个"%"符号那么简单，那么我该如何对这个BYTE字符串进行编码，以获得与Azureus相同的结果呢。

提前谢谢。

实际上，您可以每两个字符放置一个%符号。Azureus没有这样做，因为例如，R是URL中的安全字符，而52是R的十六进制表示，所以它不需要对其进行百分比编码。使用%52是等效的。

从左到右遍历字符串。如果遇到%，请输出接下来的两个字符，将大写转换为小写。如果遇到其他情况，请使用小写字母以十六进制输出该字符的ASCII代码。

%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R

X的ASCII码是0x58，因此它变为58。3的ASCII代码为0x33。

（我有点困惑你为什么要问。你的问题清楚地表明你已经识别出这是URL编码的。）

尽管我很清楚最初的问题是关于C++的，但它可能在某种程度上很有用，有时可以看到其他解决方案。因此，就其价值而言（10年后），以下是

Python 3.6中实现的替代解决方案+

import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
    # decode hex string as a Windows-1252 string
    win1252_str = binascii.unhexlify(hex_str).decode(encoding)
    # escape string and return
    return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
    # unescape the escaped string as a Windows-1252 string
    win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
    # encode string, hexlify, and return
    return win1252_str.encode('Windows-1252').hex()

两个基本测试：

esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True

备注

Windows-1252（又名cp1252）作为以下测试的结果出现为默认编码：

import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
    chardet.detect(
        binascii.unhexlify(hex_str)
    )
)

这提供了一个非常有力的线索：

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}