用于HTML图像标记的QRegExp

QRegExp for HTML Image Tags

本文关键字：QRegExp HTML 图像用于更新时间：2023-10-16

首先，我只想说，我理解将regexs用于HTML是个坏主意。我只是用它来获取<img>标签信息，所以我不在乎嵌套等。

话虽如此，我正在尝试获取网页中所有图像的src URL。然而，我似乎只得到了第一个结果。这是我的正则表达式，还是我使用它的方式？我的正则表达式技能有点生疏，所以我可能缺少一些明显的东西。

QRegExp imgTagRegex("(<img.*>)+", Qt::CaseInsensitive); //Grab the entire <img> tag
imgTagRegex.setMinimal(true);
imgTagRegex.indexIn(pDocument);
QStringList imgTagList = imgTagRegex.capturedTexts();
imgTagList.removeFirst();   //the first is always the total captured text
foreach (QString imgTag, imgTagList) //now we want to get the source URL
{
    QRegExp urlRegex("src="(.*)"", Qt::CaseInsensitive);
    urlRegex.setMinimal(true);
    urlRegex.indexIn(imgTag);
    QStringList resultList = urlRegex.capturedTexts();
    resultList.removeFirst();
    imageUrls.append(resultList.first());
}

当我进入foreach循环时，imgTagList只包含1个字符串。对于"古埃及的猫"维基百科页面，它包含：

<img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/220px-Egypte_louvre_058.jpg" width="220" height="407" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/330px-Egypte_louvre_058.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/1/13/Egypte_louvre_058.jpg 2x" />

这就是我想要的，但我知道页面上有更多的图像标签。。。你知道我为什么只得到第一个回来吗？

更新

在Sebastian Lange的帮助下，我走到了这一步：

QRegExp imgTagRegex("<img.*src="(.*)".*>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlMatches;
QStringList imgMatches;
int offset = 0;
while(offset >= 0)
{
    offset = imgTagRegex.indexIn(pDocument, offset);
    offset += imgTagRegex.matchedLength();
    QString imgTag = imgTagRegex.cap(0);
    if (!imgTag.isEmpty())
        imgMatches.append(imgTag); // Should hold complete img tag
    QString url = imgTagRegex.cap(1);
    if (!url.isEmpty())
    {
        url = url.split(""").first(); //ehhh....
        if (!urlMatches.contains(url))
            urlMatches.append(url); // Should hold only src property
    }
}

末尾的split是一种去除<img>标记中非src元素的方法，因为看起来我无法只获取src="..."段中的数据。它是有效的，但这只是因为我无法找到正确的方法。我还添加了一些东西来标准化

QRegExp通常只给出一个匹配。列表capturedTexts（）给出了这一匹配的所有捕获！在一个regex语句中可以有多个捕获括号。为了解决你的问题，你需要做一些类似的事情：

QRegExp imgTagRegex("\<img[^\>]*src\s*=\s*"([^"]*)"[^\>]*\>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlmatches;
QStringList imgmatches;
int offset = 0;
while( (offset = imgTagRegex.indexIn(pDocument, offset)) != -1){
    offset += imgTagRegex.matchedLength();
    imgmatches.append(imgTagRegex.cap(0)); // Should hold complete img tag
    urlmatches.append(imgTagRegex.cap(1)); // Should hold only src property
}

EDIT:已将捕获RegExpression更改为"\<img[^\>]*src="([^"]*)"[^\>]*\>"EDIT2：在src字符串中添加了可能的空格："\<img[^\>]*src\s*=\s*"([^"]*)"[^\>]*\>"