How to implement a full-text search on HTML documents with Microsoft SQL Server

This scenario is quite common:

We want to do a linguistic, full-text search on text data, stored in a database.

The only problem: the text data is HTML formatted.

Therefore a target text like <strong>f</strong>oo e.g. will not be matched by the pattern foo.

So how can we implement an “HTML insensitive” search with the help of the Microsoft SQL Server?

While searching for a solution I could not find a complete guide or working sample of how to do this even though it’s quite easy.

The trick is to create a full-text index using an HTML filter.

Here is a quick summary.

Checking full-text search installation

To check if full-text search is installed, we can run:

SELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled')

If this query returns 1, full-text search is installed.

If you are running SQL Server Express LocalDB, you have to switch to SQL Server Express because full-text search cannot be enabled for LocalDB.

Full-text search is included in the SQL Server Express with Advanced Services Edition.

Checking filter installation

To check if the HTML filter is installed, we can run:

SELECT * FROM sys.fulltext_document_types WHERE document_type = '.html'

This query should return something like this:

.html   E0CA5340-4534-11CF-B952-00AA0051FE20    C:\Program Files\Microsoft SQL Server\MSSQL12.SQLEXPRESS\MSSQL\Binn\nlhtml.dll  12.0.6828.0 Microsoft Corporation

Otherwise the HTML filters are missing and have to be installed.

Implementing full-text search

Preparing the table

Let’s say we want to implement full-text search for the table Posts where Title and Content contain HTML formatted text:

CREATE TABLE Posts
(
    [Id] INT NOT NULL,
    [Title] NVARCHAR(MAX) NOT NULL,
    [Content] NVARCHAR(MAX) NOT NULL,
    CONSTRAINT Pk_Posts PRIMARY KEY (Id)
)

First (after we backup all the data) we have to change the type of Title and Content to VARBINARY(MAX).

ALTER TABLE Posts
DROP COLUMN [Title]

ALTER TABLE Posts
DROP COLUMN [Content]

ALTER TABLE Posts
ADD [Title] VARBINARY(MAX) NOT NULL

ALTER TABLE Posts
ADD [Content] VARBINARY(MAX) NOT NULL

Then we add a column FileExtension to specify the right filter for the full-text search, in this case .html.

ALTER TABLE Posts
Add FileExtension As '.html'

Inserting data

Now we can (re-) populate the table with data.

INSERT INTO Posts
VALUES (1, CONVERT(VARBINARY(MAX), 'Some title'), CONVERT(VARBINARY(MAX), '<p><strong>f</strong>oo</p>'))

Creating the index

A full-text catalog and index can be created like this:

CREATE FULLTEXT CATALOG [posts_catalog]

CREATE FULLTEXT INDEX ON Posts
(
    [Title] TYPE COLUMN FileExtension, [Content] TYPE COLUMN FileExtension
)
    KEY INDEX Pk_Posts ON posts_catalog

Testing the search

SELECT
    Id,
    CONVERT(VARCHAR(MAX), [Title]) AS [Title],
    CONVERT(VARCHAR(MAX), [Content]) AS [Content]
FROM Posts
WHERE FREETEXT (*,'foo')

This produces this result:

alt-text

Resources