使用Boost进行网页抓取,返回十六进制而不是HTML

Web Scraping with Boost, Returning Hex not HTML

本文关键字:十六进制 HTML 返回 Boost 网页 抓取 使用      更新时间:2023-10-16

我正在构建一个网络抓取器,可以从网页下载 HTML,对其进行解析,并显示美国各个时区的时间。我从罗塞塔代码中获得了示例代码。但是,他们使用适用于Windows的Boost 1.46.1,而我使用的是适用于Mac OSX的Boost 1.60.0。下面是我从罗塞塔代码示例中修改的代码,以使其正常工作。

#include <cstdlib>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/regex.hpp>
#include <boost/asio.hpp>
#include <boost/system/config.hpp>
#include <boost/asio/ip/tcp.hpp>    
void GetTime()
{   
    boost::asio::ip::tcp::iostream s("tycho.usno.navy.mil","http");
    cout << s << "n";          //check to see what downloaded from URL
    if(!s){                     //if S = Null then nothing downloaded & connection not made
        cout << "Error! Not Connected." << endl;
        s << "Get /cgi-bin/timer.pl HTTP/1.0rn"
            << "host:tycho.usno.navy.milrn"
            << "Acceot:*/*rn"
            << "Connection:closerrnrn";//error information provided
    }
    int count = 0;
    for (string line; getline(s, line);){
        boost::smatch matches;
        if(boost::regex_search(line, matches, boost::regex("<BR>'(.+\s+UTC)'<BR>"))){
            cout << matches[count];//parse the HTML, if there is a match save it in matches[count]
            cout << ">> Matches" << count << "n";
            //++ count;
            break;
        }
    ++ count;   
    cout << "End of For Loop.n";//to check if the for loop ran
    }
    cout << "Finale Count: " << count << " End of Void GetTime.n";//to check if the void was completed
}

输出:

0x7fff5fbff430
Final Count: 0 End of Void GetTime.
RUN FINISHED; exit value 0; real time: 20s; user: 0ms; system: 0ms

根据最终计数为"0",我可以得出结论,该程序永远不会进入for循环。对于此应用程序,For-Loop 和 If 语句的条件是否正确?还是行 boost::asio::ip::tcp::iostream s("tycho.usno.navy.mil","http");正在调用网页并将 HTML 放在字符串 S 中的行?

看起来你在第一个if的错误条件中犯了错误。当我改变时

if(!s) //if S = Null then nothing downloaded & connection not made
{
cout << "Error! Not Connected." << endl;
s << "Get /cgi-bin/timer.pl HTTP/1.0rn"
  << "host:tycho.usno.navy.milrn"
  << "Acceot:*/*rn"
  << "Connection:closerrnrn";//error information provided
}

if(!s)
{
    cout << "Error! Not Connected." << endl;
    return;
}
s << "Get /cgi-bin/timer.pl HTTP/1.0rn"
  << "host:tycho.usno.navy.milrn"
  << "Accept:*/*rn"
  << "Connection:closerrnrn";//error information provided

我得到了以下输出:

0x7ffdc0a5d730
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
End of For Loop.
Finale Count: 16 End of Void GetTime.

我认为这是预期的输出。请注意,您的请求中的"接受单词"中也有拼写错误。