There are many cases where we need to clean out text from an HTML source. Many people have solved this in their own way. But I have come across a simple solution to this problem. The solution was to use NSAttributedString with setting NSDocumentTypeDocumentAttribute to NSHTMLTextDocumentType.
Let’s start with the following HTML:
let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. "
Below is a working example:
let htmlStringData = htmlString.data(using: String.Encoding.utf8)! let options: [String: Any] = [ NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue ] let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil) let stringWithoutHTMLTags = attributedString.string
This process is slower when working on a long HTML source. Which can be easily offloaded onto a background process:
let htmlStringData = htmlString.data(using: String.Encoding.utf8)! let options: [String: Any] = [ NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue ] DispatchQueue.global(qos: .userInitiated).async { // perform in background thread let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil) DispatchQueue.main.async { // handle text in main thread let stringWithoutHTMLTags = attributedString.string } }
Also in gist:
Source: Stack Overflow