c++中的电子邮件爬网程序
Email Crawler in c++
我有一个无法理解的任务。我希望我的函数从html文件中获取一行并从中提取一封电子邮件。然后将电子邮件拆分为电子邮件、用户名和域。然后我想要第三个函数来获取html文件中的下一封电子邮件。
void get_line_emails(ifstream &in_stream, ofstream &out_stream, string email[], string users[], string domain[])
{
int location, end;
string mail;
getline(in_stream, mail);
location = mail.find("mailto:");
end = mail.find(">");
mail = mail.substr(location, (end - 1));
cout << mail << endl;
}
void get_next_email(ifstream &in_stream, string mail)
{
getline(in_stream, mail);
int location = mail.find("mailto:");
int end = mail.find(">");
mail = mail.substr(location, (end - 1));
}
void split_email(string email[], string domain[], string users)
{
int count = 300;
string mail;
for (int i = 1; i < count; ++i) //For loop to input stream.
{
mail = email[i];
int location = mail.find("@");
int end = mail.find(">");
string domain[i] = mail.substr(location, (end - 1));
string users[i] = mail.substr(0, location);
}
}
当我运行程序时,我也会遇到这个错误:
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 4294967295) > this->size() (which is 244)
Abort (core dumped)
如果它有帮助的话,这里是我的主要功能:
int main()
{
string email[1000];
string users[1000];
string domain[1000];
int count = 300;
string filename;
ifstream in_stream;
ofstream out_stream;
cout << "Enter input filename: " << endl;
cin >> filename; //Input of filename.
in_stream.open(filename.c_str()); //Opening the input file for population and other information.
if (in_stream.fail()) //Checking to see if file opens.
{
cout << "Error opening input/output files" << endl; //Telling user file isn't opening.
exit(1); //Exiting program.
}
out_stream.open("Emails.txt");//If it does not exist it will not be created. If it exists it will be overwritten.
out_stream << "Email " << right << setw(20) << "User " << right << setw(20) << "Domain" << endl;
out_stream << "_______________________________________________________________________________" << endl;
get_line_emails(in_stream, out_stream, email, users, domain);
//split_email(email, domain, users);
sort(email, users, domain, count);
in_stream.close(); //Closing the in stream.
out_stream.close(); //Closing the out stream.
cout << "A new file Emails has been created with the emails extracted. Thank you." << endl; //End message.
return 0;
}
我正在输入的HTML文件的一部分:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <!-- Content Copyright Ohio University Server ID: 2-->
<!-- Page generated 2016-03-22 14:55:21 by CommonSpot Build 9.0.3.119 (2015-08-14 15:00:01) -->
<!-- JavaScript & DHTML Code Copyright © 1998-2015, PaperThin, Inc. All Rights Reserved. --> <head>
<meta name="Description" id="Description" content="Faculty" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="Keywords" id="Keywords" content="engineering" />
<meta name="Generator" id="Generator" content="CommonSpot Content Server Build 9.0.3.119" />
<link rel="stylesheet" href="/style/ouws_0111_allin1_nonav.css" type="text/css" />
<link rel="stylesheet" href="/engineering/upload/engineeringEV.css" type="text/css" />
<link rel="stylesheet" href="/engineering/upload/gridpak.css" type="text/css" />
<style type="text/css">
.mw { color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none; }
a.mw:link {color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
a.mw:visited {color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
a.mw:hover {color:#0000FF;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
</style> <script type="text/javascript">
<!--
var gMenuControlID = 0;
var menus_included = 0;
var jsDlgLoader = '/engineering/about/people/loader.cfm';
var jsSiteID = 1;
var jsSubSiteID = 6148;
var js_gvPageID = 2177477;
var jsPageID = 2177477;
var jsPageSetID = 0;
var jsPageType = 0;
var jsControlsWithRenderHandlers = ",1366057,1407941,1408984,1409120,1409220,1463564,1653027,1464282,1484855,1663987,1703445,1714178,1719109,1716274,1719109,1719109,1722161,1748941,1743237,1767756,1771704,1240950,1795856,1799077,1806233,1814378,1814378,1814378,36,1156323,958270,959997,36,1239784,1239535,1240103,1264495,1264559,1240832,1241026,1268776,1269019,1365662,1365798,1367666,1367112,1367146,1403322,1236239,1644435,1707482,36,1707482,1708185,1708185,1707846,1718301,1718356,1722082,1735273,1156092,1736675,1738340,1758445,1487747,1740183,1750814,1755341,36,4,1241075,1320447,1410344,1440455,1462605,1463564,1642797,1644920,1644955,1659254,1656252,1707459,1692320,1290294,1705469,1705596,1707846,1708163,1708367,1719109,1719109,1719109,1728460,1718356,1706218,1725200,1739433,1193755,1782561,1806244,1781609,1783821,1784445,1783821,1788664,1750814,1781533,1781788,1812661,1810778,1822088,1644219,39,36,36,438722,443887,523857,542895,36,867909,671210,733944,1074794,671213,671222,671225,671231,671234,1190981,1190914,1190943,1193755,1236239,1239497,1280404,1284325,860732,860741,1080236,671204,1237273,671216,671219,671228,671237,671207,1190973,1243855,1264544,1264564,1241172,1267910,1240840,1240849,1241220,1264699,1241365,1264571,1289737,8,1290184,1321465,1322500,1363024,1365670,1365954,1365998,1366014,2214456,2068897,1837521,1190931,1190931,2239453,1992371,1967400,1992371,1808005,1792195,1792195,1156323,1716646,1967400,1763595,1080236,1971121,1960374,1290151,2007514,2013290,2012663,2012302,2012026,2012663,2021773,1191128,426028,1808005,2108357,426028,36,36,36,2145522,2145522,2186158,1792195,1827509,1827486,1827486,1840641,1843869,1843869,1843879,1843879,1827509,1827486,635375,1190931,1853586,1854295,1854509,1854614,1855117,1855125,1859942,1232520,996841,999747,1074782,801933,1156092,1231112,1240950,1264518,1264536,1240828,1241280,1241033,1241322,1265043,1268750,1269805,1287352,1290231,1321501,1322534,1368599,1407796,1407917,1408156,1408447,1461409,1463586,1466072,1660460,1704499,1701618,1704211,1701596,1707383,1706218,1713783,1713443,1715100,1716646,1714352,1723376,1706218,1717134,1717134,1759841,1740127,1740183,1737868,1755222,1763595,1750814,1812661,1784600,860732,1785700,1786558,1786640,1788366,1788803,1787835,1758851,1802116,1802116,1802116,1802116,1810778,1870892,1827509,1854528,1859942,1859942,1870780,1865837,1905202,1905202,1750814,1243855,1763595,1806295,1806280,860741,1893429,1893243,1893429,1898989,1913110,1915322,1921065,1871293,1872541,1900928,1708367,1874008,1827509,1808005,1948002,1708367,1859942,1827509,1243851,1959041,1243851,1746007,1243851,1243851,1967400,1967400,1191128,1780116,1960374,1960374,1780116,1827486,1156092,1153939,36,1827486,1859942,1974908,1156092,1156323,1763595,1080236,1763595,1854295,1854641,1865837,1867230,1867211,1869328,738180,8,1191128,1808005,1967400,1156323,2104541,2058309,2013290,2047047,2068897,2010928,2087246,2010928,2104541,2104541,2104578,2115265,1708185,2120941,426028,2129783,1663761,2166426,2068897,1967400,1967400,1967400,2068897,1808005,1716646,1833649,1827509,2010085,36,2167570,2068897,1706218,1156092,2012337,2186146,1191128,2191212,1190931,1156323,1716646,2012663,2508370,1992371,1080236,2280950,1808005,36,36,1156323,1808005,1819898,1191128,1243855,2281280,2013290,2239453,1837521,1156323,1644219,1849105,1849105,2376567,2381406,1808005,1808005,1156092,2552104,2552104,2281280,1805958,1967400,2068897,2390125,1808005,2444428,2459222,2013290,2568057,2508370,1661786,1763595,2349059,2349059,2438289,1708367,2120941,2508370,2120951,2596819,1156323,1191128,2239453,2367160,2012337,2451225,1808005,2615851,1808005,1849105,55,55,2734901,1191128,55,55,2012663,2734829,1967400,1967400,1996683,1992371,2013290,2018337,2012337,2018364,1156092,1363024,1967400,1888191,1888191,1805958,1967400,2057362,39,1153939,1708185,2010085,2010085,2010085,2079659,2079659,2010928,2010928,2087246,1808005,36,1190931,2369360,2380491,1808005,2120941,1153939,1708367,2511867,2540778,1704499,1787140,1758479,1716646,1827486,2239453,1808005,1808005,1080236,2451225,2120941,1808005,";
var jsDefaultRenderHandlerProps = ",,";
var jsAuthorizedControls = ",65684,62081,62169,62236,62658,67860,70371,70560,70645,70911,71567,71570,71579,71582,71585,71588,71630,71645,73051,73055,73135,73175,73177,73179,73181,73183,73185,75593,75596,75598,75600,75602,75604,75943,77337,77339,77367,77369,77371,77397,77399,77401,77403,77406,77408,77423,77425,77429,77431,77433,77435,77454,77456,77458,77460,77462,77464,77524,77526,77528,77530,77533,77535,77564,77566,77569,77572,77579,77581,77755,77759,77771,77940,78254,78304,78759,81449,81447,81452,81454,86430,95027,110992,112176,114559,122476,122590,122592,122594,122998,123000,123002,123004,123010,123012,123014,123016,123113,123115,123117,123119,123121,123123,123125,123127,123129,123131,123133,123135,123137,123139,123141,123143,123193,123217,123219,123221,543,1784,1786,1791,1829,1901,1903,3434,3062,10165,17470,19113,17964,17975,20458,18450,19246,20461,20532,20535,20631,22975,22976,29043,29065,29198,29497,29894,32565,37812,42989,50270,50283,51427,51770,51940,51987,52309,52306,52325,52338,52440,52727,52935,53585,53717,54936,55739,56170,57624,70375,57659,58549,60274,60859,65324,65375,65378,65630,341266,341268,341270,343681,344120,344123,344125,344127,344129,344131,344133,1155418,344136,344142,344918,344920,346066,349254,349260,353078,353096,353249,353368,353500,353518,356036,356519,356527,356534,359303,359315,359619,365645,365647,365651,372637,372642,373892,409046,385136,402687,408565,416225,423380,423445,423634,423934,424407,424503,426545,425757,425785,426028,426263,433478,438722,440105,440778,441424,441447,441488,441530,441743,441914,441917,441920,441923,442181,442184,442228,442231,442767,443887,444519,444536,448085,446524,447856,448121,450241,450489,450583,451031,123223,123225,123227,123229,123231,123233,123235,123237,123239,123241,123243,123245,123247,123249,133712,138458,138462,138472,138493,140917,152719,152941,155012,174553,176272,182475,185313,185545,185572,185600,185653,189527,189717,189912,189915,209638,190014,209612,209640,210772,233752,233754,240835,242005,245048,245061,246392,247905,253143,255217,258368,258370,258448,259352,259507,259535,259540,259557,259597,270079,272462,272484,273374,275946,276171,281359,281731,281886,285356,285362,285364,289279,290246,293573,293580,293990,306206,306372,307096,307117,1409047,1410292,1410344,1440455,1462692,1462605,1463206,1463358,1463363,1463559,1463575,1466067,1466072,1466949,565361,577664,577666,580782,580785,586106,593209,631308,631375,671204,671207,671210,671213,671216,671219,671222,671225,671228,630659,630928,631186,631230,671231,703507,703512,872630,872675,951724,1070639,1070773,1071579,1074782,1074794,1116648,1118602,1153954,1153962,310170,319781,325794,326607,326613,331241,331243,331248,338287,338305,338307,338805,340095,340098,341260,341264,523857,523883,540187,541324,542748,542895,543075,543442,543531,545031,545034,545925,550439,550694,551327,551342,551843,551848,554801,557468,563421,563522,564335,564350,564362,565392,565403,565430,565440,565460,578908,580751,589443,589691,589825,631522,631342,671234,704390,704500,730405,733189,733195,733931,733944,735045,721050,721061,720116,803640,807230,860741,867909,869754,878921,872399,911315,951437,952815,952921,954983,956036,958270,960899,960901,960903,960912,960914,960916,959997,990601,993320,996841,999438,999472,999741,999747,999871,1034551,1034553,1035679,1035681,1070829,1080236,1111202,1112587,1112594,1116088,1117180,566481,567951,635375,671237,705089,708277,738180,738270,738274,756640,808480,993241,993247,993326,998452,999162,1034549,1034793,1034795,1118837,1121340,1150407,1152064,1153928,1153933,1153939,1153948,1154637,1156092,1156320,753746,754822,754960,755002,755412,755426,755453,801854,801933,802037,802071,802077,802080,802083,802087,802091,802417,802525,804060,860732,753752,754885,753748,754422,802568,451785,453349,452911,452935,454345,454916,464533,465324,476013,469286,469308,470126,472222,476011,476015,489860,478066,482338,482852,492048,486517,489015,489681,492017,492050,492052,498151,516411,516413,516415,516417,516419,516422,1935063,1939712,1992371,1996683,2010928,2012302,2012840,2013290,2021773,2047047,2058309,2079659,2104541,2108357,2115265,2120941,2120951,2135749,2145522,2157693,2157775,1193061
<a href="http://www.youtube.com/user/OhioUnivRussCollege"><img border="0" alt="YouTube" title="YouTube" src="/engineering/images/icon_youtube.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://www.linkedin.com/groups?home=&gid=3000035&trk=anet_ug_hm"><img border="0" alt="LinkedIn" title="LinkedIn" src="/engineering/images/icon_linkedin.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://www.facebook.com/ohio.engineering"><img border="0" alt="Facebook" title="Facebook" src="/engineering/images/icon_fb.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="https://twitter.com/russcollege"><img border="0" alt="Twitter" title="Twitter" src="/engineering/images/icon_twitter.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://instagram.com/russcollege"><img border="0" alt="Instagram" title="Instagram" src="/engineering/images/russ_instagram.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
</div></div></div><div id="cs_control_2398199" class="cs_control CS_Element_Custom"></div></div></div><div id="cs_control_2142700" class="contentWrap col row"><div title="" id="CS_Element_2177477_2142700"><div id="cs_control_2142767" class="cs_control col pageTitle">
<!-- Portal Content -->
<div class="content-element">
<h2>Faculty</h2>
<p></p>
<br />
</div>
<!-- Portal Content -->
</div><div id="cs_control_2142762" class="mainContent col"><div title="" id="CS_Element_2177477_2142762"><div id="cs_control_2142772" class="cs_control CS_Element_Custom">
<!-- Portal Content -->
<div class="content-element">
<p> </p>
</div>
<!-- Portal Content -->
</div><div id="cs_control_2177314" class="cs_control">
<style type="text/css">
/* This fixes some issues with the anchor links from the A-Z bar at the top */
.group a[name]
{
position: absolute;
}
</style>
<div id="staffAlpha">
<ul class="azList">
<li class="children "><a href="#A">A</a></li>
<li class="children "><a href="#B">B</a></li>
<li class="children "><a href="#C">C</a></li>
<li class="children "><a href="#D">D</a></li>
<li class="children "><a href="#E">E</a></li>
<li class="children "><a href="#F">F</a></li>
<li class="children "><a href="#G">G</a></li>
<li class="children "><a href="#H">H</a></li>
<li class="children "><a href="#I">I</a></li>
<li class="children "><a href="#J">J</a></li>
<li class="children "><a href="#K">K</a></li>
<li class="children "><a href="#L">L</a></li>
<li class="children "><a href="#M">M</a></li>
<li class="children "><a href="#N">N</a></li>
<li class="children "><a href="#O">O</a></li>
<li class="children "><a href="#P">P</a></li>
<li>Q</li>
<li class="children "><a href="#R">R</a></li>
<li class="children "><a href="#S">S</a></li>
<li class="children "><a href="#T">T</a></li>
<li class="children "><a href="#U">U</a></li>
<li class="children "><a href="#V">V</a></li>
<li class="children "><a href="#W">W</a></li>
<li class="children "><a href="#X">X</a></li>
<li class="children "><a href="#Y">Y</a></li>
<li class="children last"><a href="#Z">Z</a></li>
</ul>
<div id="azContent">
<div class="group">
<a id="A" name="A"></a>
<h3 class="letter">A</h3>
<a href="profiles.cfm?profile=abukamai">Nasseef Abukamail</a><br />
Electrical Engineering and Computer Science <br />
Associate Lecturer <br />
<a href="mailto:abukamai@ohio.edu">abukamai@ohio.edu</a> <br />
740.593.1229
<div><br />
</div><a href="profiles.cfm?profile=alam">Khairul Alam</a><br />
Mechanical Engineering, Center for Advanced Materials Processing, ESP Lab <br />
Professor <br />
<a href="mailto:alam@ohio.edu">alam@ohio.edu</a> <br />
740.593.1558
<div><br />
</div><a href="profiles.cfm?profile=alim1">Muhammad Ali</a><br />
Biomedical Engineering, Mechanical Engineering, ESP Lab <br />
Associate Professor <br />
<a href="mailto:alim1@ohio.edu">alim1@ohio.edu</a> <br />
740.593.1389
<div><br />
</div><a href="profiles.cfm?profile=arch">Deak Arch</a><br />
Aviation <br />
Associate Professor, Assistant Chair <br />
<a href="mailto:arch@ohio.edu">arch@ohio.edu</a> <br />
740.597.2688
将问题划分为多个任务。你有四项任务,它们应该单独处理。在您知道当前任务完全符合您的要求之前,不要继续执行下一个任务。一次处理多个任务扩大了问题的范围,这不仅仅是一个几何扩展。Bug倾向于与其他Bug交互。任务1中的错误可能会使任务2中的错误看起来不同,从而导致您调试错误的症状。
考虑给每个任务赋予一个函数,或者如果任务很复杂,则赋予它自己的文件。这样,每个任务都可以很容易地单独测试。为什么?如果您更改了任务1中的代码并想知道它是否损坏了,该怎么办?当然你可以测试整个程序,但如果你破坏了两件事呢?如果你想用几百个地址测试splitter逻辑,以确保你正确处理所有奇怪的边缘情况,你可以用这几百个字符串调用splitter函数,而不必发明一个复杂的文件。
任务1:逐行读取文件
这是第一个,因为在你能做到这一点之前,你不能做太多其他事情。
std::string line;
while (std::getline(in_stream, line))
{
// output line to compare with source
}
将读取一个文件,直到它无法再被读取——文件的末尾、损坏的数据、某个小丑在读取时拔出USB驱动器,或者其他各种问题。你是如何测试的?一个简单的方法是从一个流中逐行读取文件,并将其打印到控制台。这是一个相当大的文件,eye在比较大量文本时非常有用,所以将所有接收到的行写入一个输出文件,然后对文件进行差异处理。如果他们匹配,你就赢了。继续执行任务2。如果没有,请调试。
任务2:查找"mailto"
这是从任务1开始的一行,并查找"mailto"
size_t loc = line.find("mailto:");
if (loc != std::string::npos)
{
std::cout << "found: " << line << std::endl;
}
这是一件更容易测试的事情,所以我们可以用mk 1眼球或记事本和ctrl+f来确认所有的mailto行都已打印出来。
任务3:隔离地址
您在任务2中发现一行包含"mailto"。现在你必须隔离这一行的地址。你有任务2的起始位置,你可以提取"mailto"后面的":"和下一个"\"之间的字符串。我不会在这里花太多时间,因为这是这项任务的重点。我在这里做得太多了,我通过了课程,而不是你,但基本上这是一个find
和substr
,类似于OP在他们的问题中的内容。
任务4:从任务3中拆分地址
这是对find
和substr
的更多工作,以隔离地址的各个部分。
您需要创建一个循环并测试每一行,直到找到一个包含字符串"mailto:"
的行。
以下是一些示例代码,让您了解如何做到这一点:
std::ifstream ifs("test.txt");
std::string line; // general buffer
// read each line
while(std::getline(ifs, line))
{
// try to find "mailto:"
std::string::size_type pos = line.find("mailto:");
// ignore if not found
if(pos == std::string::npos)
continue;
// we found it! extract address from line here
// remember that pos holds the start of the information
// ...
}
- 尝试从C++访问 UWP 的电子邮件邮件类会导致"REGDB_E_CLASSNOTREG类未注册"错误
- 如何使HTML5电子邮件验证正则表达式在C++中工作?
- Qt:使用'mailto:'打开用户的电子邮件客户端失败
- 电子邮件地址中的c++smtp服务器主机名
- 如何使用 c++ (curl) 发送电子邮件
- 在端口 587 (TLS) 上使用 CDO 发送电子邮件时出错
- 预览由Microsoft Outlook 对象库生成的电子邮件
- 如果不有效,如何重复用户输入电子邮件
- 使用REGEX进行电子邮件输入验证C
- 如何从 pjsip 发送电子邮件.是否有任何可用于发送电子邮件的默认方法
- 如何在使用 c++/COM 发送时在电子邮件中插入/嵌入图像文件(.png)
- 从电子邮件正文调用 Win32 应用程序
- 从 C++ 中的 Linux 应用程序发送电子邮件
- c++中的电子邮件爬网程序
- 使用 Unity 在 iOS 应用程序上发送电子邮件时出错
- 我无法让这个我发现有效的电子邮件程序
- IMAP:从C++程序发送电子邮件时"Unable to create selectable TCP socket"
- C++试图编写一个程序,从.dat文件接收电子邮件并将其导出到单独的.dat文件/
- 从linux中的C/C++程序发送电子邮件
- 是否可以使用winsock从c++应用程序发送电子邮件?