Strip HTML tags in Swift

Posted by

There are many cases where we need to clean out text from an HTML source. Many people have solved this in their own way. But I have come across a simple solution to this problem. The solution was to use NSAttributedString with setting NSDocumentTypeDocumentAttribute to NSHTMLTextDocumentType.

Let’s start with the following HTML:

let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. "

Below is a working example:

let htmlStringData = htmlString.data(using: String.Encoding.utf8)!

let options: [String: Any] = [
    NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
    NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]

let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)

let stringWithoutHTMLTags = attributedString.string

This process is slower when working on a long HTML source. Which can be easily offloaded onto a background process:

let htmlStringData = htmlString.data(using: String.Encoding.utf8)!

let options: [String: Any] = [
    NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
    NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]

DispatchQueue.global(qos: .userInitiated).async {

    // perform in background thread
    let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)

    DispatchQueue.main.async {
        // handle text in main thread
        let stringWithoutHTMLTags = attributedString.string
    }

}

 

Also in gist:


// https://hashaam.com/2017/06/11/strip-html-tags-in-swift/
let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy&#39; class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk&#39; class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA&#39; class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance&#39; class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk&#39; class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco&#39; class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"
let htmlStringData = htmlString.data(using: String.Encoding.utf8)!
let options: [String: Any] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]
DispatchQueue.global(qos: .userInitiated).async {
// perform in background thread
let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
DispatchQueue.main.async {
// handle text in main thread
let stringWithoutHTMLTags = attributedString.string
}
}

Source: Stack Overflow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s