79200392

Date: 2024-11-18 14:55:44
Score: 3
Natty:
Report link

Seems like @JasonTrue 's answer is not working anymore due to the "//body//text()" XPath.

Acessing all the document's child nodes and then filtering out the empty text tags may be the way.

public static string StripInnerText(string html)
{
    if (string.IsNullOrEmpty(html))
    return null;

    HtmlAgilityPack.HtmlDocument doc = new();
    doc.LoadHtml(html);

    if (doc is null)
        return string.Empty;

    var texts = doc.DocumentNode.ChildNodes
        .Select(node => node.InnerText)
        .Where(text => !string.IsNullOrWhiteSpace(text))
        .Select(text => text.Trim())
        .ToList();

    var output = string.Join(Environment.NewLine, texts);

    string textOnly = HttpUtility.HtmlDecode(output.ToString());

    return textOnly;
}

Test it with the following fiddle: https://dotnetfiddle.net/NQC2Y5

Sorry for posting a new answer, it is because I don't have 50 reputation at the moment and this question and all the answers here was so useful for me that I felt like I have the duty to contribute.

Reasons:
  • RegEx Blacklisted phrase (1.5): I don't have 50 reputation
  • RegEx Blacklisted phrase (0.5): Sorry for posting
  • Long answer (-0.5):
  • Has code block (-0.5):
  • User mentioned (1): @JasonTrue
  • Low reputation (1):
Posted by: srjheam