Strip HTML tags in Swift

There are many cases where we need to clean out text from an HTML source. Many people have solved this in their own way. But I have come across a simple solution to this problem. The solution was to use NSAttributedString with setting NSDocumentTypeDocumentAttribute to NSHTMLTextDocumentType.

Let’s start with the following HTML:

let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. "

Below is a working example:

let htmlStringData = htmlString.data(using: String.Encoding.utf8)!

let options: [String: Any] = [
    NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
    NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]

let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)

let stringWithoutHTMLTags = attributedString.string

This process is slower when working on a long HTML source. Which can be easily offloaded onto a background process:

let htmlStringData = htmlString.data(using: String.Encoding.utf8)!

let options: [String: Any] = [
    NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
    NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]

DispatchQueue.global(qos: .userInitiated).async {

    // perform in background thread
    let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)

    DispatchQueue.main.async {
        // handle text in main thread
        let stringWithoutHTMLTags = attributedString.string
    }

}

 

Also in gist:

Source: Stack Overflow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s