从二进制文件读取到数组:前面的任意数字

Read from binary file to array: Preceding arbitrary numbers

本文关键字：前面任意数字数组二进制文件读取更新时间：2023-10-16

我正在尝试从二进制文件读取到字符数组。打印数组条目时，将打印任意数字(换行符)和所需数字。我真的无法理解这一点。该文件的前几个字节是： 00 00 08 03 00 00 EA 60 00 00 00 1C 00 00 00 1C 00 00

我的代码：

void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
std::string labelsPath) {
std::ifstream is(imagesPath.c_str());
char *data = new char[12];
is.read(data, 12);
std::cout << std::hex  << (int)data[2] << std::endl;
delete [] data;
is.close();
}

例如，它打印：

fffffff9b
8

8 是正确的。前面的数字因执行而异。这个换行符从何而来？

您询问了从二进制文件中读取数据并将其保存到char[]中的问题，并向我们展示了您为问题提交的以下代码：

void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
std::string labelsPath) {
std::ifstream is(imagesPath.c_str());
char *data = new char[12];
is.read(data, 12);
std::cout << std::hex  << (int)data[2] << std::endl;
delete [] data;
is.close();
}

你想知道：

前面的数字因执行而异。这个换行符从何而来？

在你真正回答这个问题之前，你需要知道二进制文件。这就是文件内部的结构。当您从二进制文件中读取数据时，您必须记住某些程序已将数据写入该文件，并且该数据是以结构化格式写入的。重要的是，对于每个系列或文件类型的二进制文件都是唯一的这种格式。大多数二进制文件通常遵循一种通用模式，例如它们会容器一个header然后甚至可能sub headers然后是集群、数据包或块等，甚至是标头后面的原始数据，而一些二进制文件可能只是纯粹的原始数据。您必须知道文件在内存中的结构。

数据的结构是什么？
- 文件中第一个条目的数据类型是char = 1 byte、int = 4 bytes (32bit system) 8 bytes (64bit system)、float = 4bytes、double = 8bytes等吗？

根据您的代码，您有一个大小为12的chararray，并且知道内存中1 bytechar，您正在请求12 bytes。现在这里的问题是你连续拉出 12 个连续的字节，并且不知道文件结构，你怎么能确定第一个字节是实际写入的char还是unsigned char，还是int？

考虑这两种不同的二进制文件结构，它们由包含所有所需data的C++ structs创建，并且都以二进制格式写出到文件中。

两个文件结构都将使用的通用标头结构。

struct Header {
// Size of Header
std::string filepath;
std::string filename;
unsigned int pathSize;
unsigned int filenameSize;
unsigned int headerSize;
unsigned int dataSizeInBytes;
};

文件A 的唯一结构

struct DataA {
float width;
float length;
float height;
float dummy; 
}

文件 B 的文件 B的唯一结构

struct DataB {
double length;
double width;
}

内存中的文件通常是这样的：

前几个字节是路径和文件名以及存储的大小
- 这可能因文件而异，具体取决于字符数用于文件路径和文件名。
- 在字符串之后，我们确实知道接下来的 4 种数据类型是无符号的所以我们知道在 32 位系统上它将是 4 字节 x 4 = 16 总字节
- 对于 64 位系统，它将是 8 字节 x 4 = 32 总字节。
- 如果我们知道系统架构，那么我们可以很容易地克服这一点。
- 在这 4 个无符号中，前两个是路径和文件名的长度。现在，这些可能是从文件中读入的前两个路径，而不是实际路径。这些顺序可以颠倒过来。
- 接下来的 2 个未签名
- 接下来是标题的完整大小，可用于读入和跳过标头。
- 下一个告诉你现在要拉入的数据的大小，这些数据可能是块，有多少块，因为它可能是一系列相同的数据结构，但为了简单起见，我省略了块和计数并使用单个实例结构。
- 在这里，我们可以提取数据量(以字节为单位)提取多少字节。

让我们考虑两个不同的二进制文件，我们已经过去了所有的头信息，我们正在读取要解析的字节。我们得到以字节为单位的数据大小，对于我们4 floats = 16bytesFileA，对于FileB我们有2 doubles = 16bytes. 因此，现在我们知道如何调用该方法来读取xy类型的数据量。由于y现在是一个type，x是我们可以这样说的：y(x)好像y是一个内置类型，x是一个数字初始值设定项，用于默认内置类型的构造函数类型，要么它是一个int，float，double、char等

现在，假设我们正在读取这两个文件中的任何一个，但不知道数据结构及其信息以前如何存储到文件中，我们通过标头看到数据大小16 bytes内存中，但我们不知道它是存储为4 floats = 16 bytes还是2 doubles = 16 bytes。这两种结构都是 16 个字节，但具有不同数量的不同数据类型。

总结一下，不知道文件的数据结构，不知道如何解析二进制文件，确实会成为一种X/Y Problem

现在让我们假设您确实知道文件结构以尝试从上面回答您的问题，您可以尝试这个小程序并查看一些结果：

#include <string>
#include <iostream>
int main() {
// Using Two Strings
std::string imagesPath("ImagesPath\");
std::string labelsPath("LabelsPath\");
// Concat of Two Strings
std::string full = imagesPath + labelsPath;
// Display Of Both
std::cout << full << std::endl;
// Data Type Pointers 
char* cData = nullptr;
cData = new char[12];
unsigned char* ucData = nullptr;
ucData = new unsigned char[12];
// Loop To Set Both Pointers To The String
unsigned n = 0;
for (; n < 12; ++n) {
cData[n] = full.at(n);
ucData[n] = full.at(n);
}
// Display Of Both Strings By Character and Unsigned Character
n = 0;
for (; n < 12; ++n) {
std::cout << cData[n];
}
std::cout << std::endl;
n = 0;
for (; n < 12; ++n) {
std::cout << ucData[n];
}
std::cout << std::endl;
// Both Yeilds Same Result
// Okay lets clear out the memory of these pointers and then reuse them.
delete[] cData;
delete[] ucData;
cData = nullptr;
ucData = nullptr;
// Create Two Data Structurs 1 For Each Different File
struct A {
float length;
float width;
float height;
float padding;
};
struct B {
double length;
double width;
};
// Constants For Our Data Structure Sizes
const unsigned sizeOfA = sizeof(A);
const unsigned sizeOfB = sizeof(B);
// Create And Populate An Instance Of Each
A a;
a.length = 3.0f;
a.width = 3.0f;
a.height = 3.0f;
a.padding = 0.0f;
B b;
b.length = 5.0;
b.width = 5.0;
// Lets First Use The `Char[]` Method for each struct and print them
// but we need 16 bytes instead of `12` from your problem
char *aData = nullptr;  // FileA
char *bData = nullptr;  // FileB
aData = new char[16];
bData = new char[16];
// Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
aData[0] = a.length;
aData[4] = a.width;
aData[8] = a.height;
aData[12] = a.padding;
// Print Out Result but by individual bytes without casting for A
// Don't worry about the compiler warnings and build and run with the
// warning and compare the differences in what is shown on the screen 
// between A & B.
n = 0;
for (; n < 16; ++n) {
std::cout << aData[n] << " ";
}
std::cout << std::endl;
// Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
bData[0] = b.length;
bData[8] = b.width;
// Print out Result but by individual bytes without casting for B
n = 0;
for (; n < 16; ++n) {
std::cout << bData[n] << " ";
}
std::cout << std::endl;
// Let's Print Out Both Again But By Casting To Their Approriate Types
n = 0;
for (; n < 4; ++n) {
std::cout << reinterpret_cast<float*>(aData[n]) << " ";
}
std::cout << std::endl;
n = 0;
for (; n < 2; ++n) {
std::cout << reinterpret_cast<double*>(bData[n]) << " ";
}
std::cout << std::endl;
// Clean Up Memory
delete[] aData;
delete[] bData;
aData = nullptr;
bData = nullptr;
// Even By Knowing The Appropriate Sizes We Can See A Difference
// In The Stored Data Types. We Can Now Do The Same As Above
// But With Unsigned Char & See If It Makes A Difference.
unsigned char *ucAData = nullptr;
unsigned char *ucBData = nullptr;
ucAData = new unsigned char[16];
ucBData = new unsigned char[16];
// Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
ucAData[0] = a.length;
ucAData[4] = a.width;
ucAData[8] = a.height;
ucAData[12] = a.padding;
// Print Out Result but by individual bytes without casting for A
// Don't worry about the compiler warnings and build and run with the
// warning and compare the differences in what is shown on the screen 
// between A & B.
n = 0;
for (; n < 16; ++n) {
std::cout << ucAData[n] << " ";
}
std::cout << std::endl;
// Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
ucBData[0] = b.length;
ucBData[8] = b.width;
// Print out Result but by individual bytes without casting for B
n = 0;
for (; n < 16; ++n) {
std::cout << ucBData[n] << " ";
}
std::cout << std::endl;
// Let's Print Out Both Again But By Casting To Their Approriate Types
n = 0;
for (; n < 4; ++n) {
std::cout << reinterpret_cast<float*>(ucAData[n]) << " ";
}
std::cout << std::endl;
n = 0;
for (; n < 2; ++n) {
std::cout << reinterpret_cast<double*>(ucBData[n]) << " ";
}
std::cout << std::endl;
// Clean Up Memory
delete[] ucAData;
delete[] ucBData;
ucAData = nullptr;
ucBData = nullptr;
// So Even Changing From `char` to an `unsigned char` doesn't help here even
// with reinterpret casting. Because These 2 Files Are Different From One Another.
// They have a unique signature. Now a family of files where a specific application
// saves its data to a binary will all follow the same structure. Without knowing
// the structure of the binary file and knowing how much data to pull in and the big key
// word here is `what type` of data you are reading in and by how much. This becomes an (X/Y) Problem.
// This is the hard part about parsing binaries, you need to know the file structure. 
char c = ' ';
std::cin.get(c);
return 0;
}

运行上面的简短程序后，不要担心屏幕上显示的每个值是什么;只需查看用于比较两种不同文件结构的模式即可。这只是为了表明16 bytes宽的struct of floats与同样16 bytes宽的struct of doubles不同。因此，当我们回到您的问题并且您正在阅读12 individual consecutive bytes时，问题就变成了这些第一个12 bytes代表什么？如果在 32 位机器上或 64 位机器上2 ints或2 unsigned ints，3 floats上是3 ints还是3 unsigned ints，还是2 doubles和1 float等组合？您正在读取的二进制文件的当前数据结构是什么？

在我写的小程序中编辑;我确实忘记尝试或在打印语句中添加<< std::hex <<，也可以使用索引指针的每次打印来添加它们，但没有必要这样做，因为输出到显示器是完全相同的，因为这只显示或表达内存中两种数据结构的差异以及它们的模式是什么样子。