There are many cases where we need to clean out text from an HTML source. Many people have solved this in their own way. But I have come across a simple solution to this problem. The solution was to use NSAttributedString with setting NSDocumentTypeDocumentAttribute to NSHTMLTextDocumentType.
Let’s start with the following HTML:
let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. "
Below is a working example:
let htmlStringData = htmlString.data(using: String.Encoding.utf8)!
let options: [String: Any] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]
let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let stringWithoutHTMLTags = attributedString.string
This process is slower when working on a long HTML source. Which can be easily offloaded onto a background process:
let htmlStringData = htmlString.data(using: String.Encoding.utf8)!
let options: [String: Any] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]
DispatchQueue.global(qos: .userInitiated).async {
// perform in background thread
let attributedString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
DispatchQueue.main.async {
// handle text in main thread
let stringWithoutHTMLTags = attributedString.string
}
}
Also in gist:
Source: Stack Overflow
Leave a comment