This scenario is quite common:
We want to do a linguistic, full-text search on text data, stored in a database.
The only problem: the text data is HTML formatted.
Therefore a target text like
<strong>f</strong>oo e.g. will not be matched by the pattern
So how can we implement an "HTML insensitive" search with the help of the Microsoft SQL Server?
While searching for a solution I could not find a complete guide or working sample of how to do this even though it's quite easy.
The trick is to create a full-text index using an HTML filter.
Here is a quick summary.
Checking full-text search installation
To check if full-text search is installed, we can run:
If this query returns 1, full-text search is installed.
If you are running SQL Server Express LocalDB, you have to switch to SQL Server Express because full-text search cannot be enabled for LocalDB.
Full-text search is included in the SQL Server Express with Advanced Services Edition.
Checking filter installation
To check if the HTML filter is installed, we can run:
SELECT * FROM sys.fulltext_document_types WHERE document_type = '.html'
This query should return something like this:
.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL12.SQLEXPRESS\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation
Otherwise the HTML filters are missing and have to be installed.
Implementing full-text search
Preparing the table
Let's say we want to implement full-text search for the table
Content contain HTML formatted text:
CREATE TABLE Posts ( [Id] INT NOT NULL, [Title] NVARCHAR(MAX) NOT NULL, [Content] NVARCHAR(MAX) NOT NULL, CONSTRAINT Pk_Posts PRIMARY KEY (Id) )
First (after we backup all the data) we have to change the type of
ALTER TABLE Posts DROP COLUMN [Title] ALTER TABLE Posts DROP COLUMN [Content] ALTER TABLE Posts ADD [Title] VARBINARY(MAX) NOT NULL ALTER TABLE Posts ADD [Content] VARBINARY(MAX) NOT NULL
Then we add a column
FileExtension to specify the right filter for the full-text search, in this case
ALTER TABLE Posts Add FileExtension As '.html'
Now we can (re-) populate the table with data.
INSERT INTO Posts VALUES (1, CONVERT(VARBINARY(MAX), 'Some title'), CONVERT(VARBINARY(MAX), '<p><strong>f</strong>oo</p>'))
Creating the index
A full-text catalog and index can be created like this:
CREATE FULLTEXT CATALOG [posts_catalog] CREATE FULLTEXT INDEX ON Posts ( [Title] TYPE COLUMN FileExtension, [Content] TYPE COLUMN FileExtension ) KEY INDEX Pk_Posts ON posts_catalog
Testing the search
SELECT Id, CONVERT(VARCHAR(MAX), [Title]) AS [Title], CONVERT(VARCHAR(MAX), [Content]) AS [Content] FROM Posts WHERE FREETEXT (*,'foo')
This produces this result: