asp:CaseStudy
Cybergroup Selects dtSearch
dtSearch s Text Retrieval Engine Powers Web-based Business
Intelligence Mining Library
By Greg Bean
Cybergroup s client requested that Cybergroup develop a
Web-based business intelligence mining library, including Web-based searching
seamlessly combining both its structured SQL database and its separate document
collection.
Project Requirements and Background
Cybergroup s client realized that database information, although
critical to its business intelligence, represented only a small portion of all
its corporate information. By the client s estimate, its corporate database
contained a mere 20% of business-decision information, while the remaining 80%
could be found in other sources Web site pages, Microsoft Office documents,
PDFs, etc. The client needed a single search to cover both the SQL database and
the file repository, as well as to return unified results from both sources.
To ensure that a search of the combined database and
document repository retrieve all relevant information, the client further
required not only basic search functionality, such as word and phrase
searching, but also advanced search features. The client wanted search features
like stemming and fuzziness for word misspellings, as well as phonic searching.
The client also wanted concept searching, including the capability for synonym
expansion using both pre-defined thesaurus terms and a user-defined
thesaurus/synonym list.
For sorting search results, the client wanted a variety of
advanced relevancy ranking options. Finally, for ease of browsing search
results, the client specified that the search must return retrieved SQL
database entries and documents with highlighted hits (as well as a preferably
WYSWYG display of Web pages like HTML, PDF, and XML, along with the highlighted
hits).
Going forward in terms of digital library management, the
client needed Cybergroup to develop a solution allowing multiple contributors
to be able to upload documents to the Web library. Upon document check-in, the
client further needed a mechanism to add to the client s main SQL database metadata
regarding the document.
Solution Overview
To meet all of the above requirements for the project s
search functionality, Cybergroup chose the dtSearch Text Retrieval Engine for
Win & .NET by dtSearch Corp.(http://www.dtsearch.com).
A single dtSearch index could include both the SQL database and the separate
document repository, including searching with all the above advanced search
features, ranking capabilities, and hit-highlighted display options.
To use these built-in capabilities, Cybergroup needed to
write custom VB.NET code to drag along certain fields from the database that
would be associated with each document and stored in the searchable index. Cybergroup
also needed to write a custom ASP.NET-based server control using the dtSearch
Engine APIs. Cybergroup called this application its dtResults Control ; screenshots
and a detailed description of Cybergroup s dtResults Control follow.
Figure 1: Cybergroup s dtResults
Control.
Figure 2: Cybergroup s dtResults
Control.
Cybergroup s Description of dtResults Control
Like any .NET control, a developer can drag and drop the
dtResults Control right into a development environment. Cybergroup implemented
the dtResults Control by inheriting from the datagrid control, leveraging the
existing power of the datagrid. Cybergroup chose the datagrid as a foundation
for its server control because it offers built-in paging and a robust
programming model.
The following code is from Cybergroup s sample application,
and appears when the user enters a search term or phrase and clicks the Search
button:
Private Sub GetResults()
'Setting the location of
the index
SearchResultList1.IndexPath
= "c:\dbconnectorindex"
'Mapping virtual path of
documents to physical path
Dim rptd As New
SearchResultList.SearchResultList.ResultPathTranslationDictionary
rptd.Add("c:\testdocs", "./testdocs")
'Setting various search
settings
SearchResultList1.RelativePathTranslations
= rptd
SearchResultList1.SortCaseInsensitive = cbCaseInsensitive.Checked
SearchResultList1.SortAscending() = ddAscendingFlag.SelectedValue
SearchResultList1.SearchType = ddSearchType.SelectedValue
SearchResultList1.SortType
= ddSort.SelectedValue
SearchResultList1.Stemming = cbStemming.Checked
If cbFuzzyness.Checked =
True Then
SearchResultList1.Fuzzy = True
SearchResultList1.FuzzLevel = ddFuzzyness.SelectedValue
Else
SearchResultList1.Fuzzy
= False
End If
SearchResultList1.Phonic
= cbPhonic.Checked
SearchResultList1.Synonyms = cbSynonyms.Checked
'Defining dtSearch custom
fields to be displayed
Dim cfn As String() =
{"SupplierID", "CompanyName", "Region"}
SearchResultList1.CustomFieldNames
= cfn
Dim cffn As String() =
{"Supplier ID #", "Company Name"}
SearchResultList1.CustomFieldFriendlyNames = cffn
If
chkSearchWithin.Checked = True Then
SearchResultList1.SearchWithin = True
SearchResultList1.PreviousSearchFilter = Session("psf")
End If
'Executing the search and
binding the results
SearchResultList1.GetResults(tbSearch.Text)
'Storing the
"previous search filter" to be used later if user clicks "Search
Within Results"
Session("psf")
= SearchResultList1.PreviousSearchFilter
Literal1.Text =
"Search: <B><I>" & tbSearch.Text & "</I></B>
returned: " & CType(SearchResultList1.DataSource,
DataTable).Rows.Count & " results"
End Sub
The following provides a flavor of the development and
functionality behind Cybergroup s development of the dtResults Control.
Using the GetResults method of the dtResults Control,
Cybergroup reduced the task of creating the search and results display to one
line of code in the simplest case. We can execute a search and then display
results by passing a search string input by the user on the search form, as in
this example:
SearchResultList1.GetResults(tbSearch.Text) 'ONLY ONE LINE OF
CODE
Of course, a developer can also leverage the power of the dtResults
Control though its properties. Take for example the SortType property. Simply
put, the SortType property allows the developer to sort the information in
results display. Let s say the developer wants to have the most recently
modified documents appear first in the results display. The developer would set
the SortType property to date and the Ascending property to false ; for example:
SearchResultList1.SortType = "date"
SearchResultList1.Ascending = false
On the internal side of the control, a canned set of
strings like date , hits , and title are checked, and the Ascending
variable is checked. Then the control produces a hex variable containing
dtSearch flags encoded in a certain way to be passed to its sort function. However,
the binary manipulations are abstracted, and the developer can even bind the
variables, by single lines of code, to checkboxes or dropdown lists.
Here s the code in the dtResults Control for the SortType
property:
Dim flags As New dtengine.SortType
If Not (sortf = 0) Then
flags = sortf
ElseIf sortt =
"hits" Then
flags =
dtengine.SortType.stSortByHits
ElseIf sortt =
"index" Then
flags =
dtengine.SortType.stSortByIndex
ElseIf sortt =
"date" Then
flags =
dtengine.SortType.stSortByDate
ElseIf sortt =
"timeofday" Then
flags = dtengine.SortType.stSortByTime
ElseIf sortt =
"title" Then
flags =
dtengine.SortType.stSortByTitle
ElseIf sortt =
"name" Then
flags =
dtengine.SortType.stSortByName
ElseIf sortt =
"filetype" Then
flags =
dtengine.SortType.stSortByType
ElseIf sortt =
"size" Then
flags =
dtengine.SortType.stSortBySize
Else
flags = dtengine.SortType.stSortByUserField
End If
If sascend Then
flags +=
dtengine.SortType.stSortAscending
End If
If cinsens Then
flags += dtengine.SortType.stSortCaseInsensitive
End If
res.Sort(flags, sortt)
Critically important to our project is the ability to
extract custom field data from the dtSearch index. Custom fields are columns
that we have extracted from the database during the indexing process and now
wish to present in a search results display.
Through the use of the CustomFieldNames and the CustomFieldFriendlyNames
properties, a developer can easily and attractively display database
information in the results display.
The CustomFieldNames property is a string array of the
names of custom fields (i.e., database columns) in the index that the developer
wishes to include in the results. When defined, the strings in it should appear
exactly as they do in the index. For example, { SupplierID , CompanyName , Region }.
The CustomFieldFriendlyNames property is a string array
that represents the names of the fields that the developer would like to have
appear in the control. This provides for a high degree of customization in results
presentation. Rather than display cryptic database column names, the developer
can display understandable labels. These names are connected to actual custom
fields by their position in the array, with regard to the CustomFieldNames
property above. If the string is longer than CustomFieldNames, then the end is
discarded. If shorter, then the names of the remaining custom fields default to
their actual names. For example, { ID # of Supplier , Supplier Name , Supplier s
Region }.
To return the Custom Field information in the results
display the developer would simply set the properties as in the following
example:
Dim cfn As String() = {"SupplierID",
"CompanyName", "Region"}
SearchResultList1.CustomFieldNames = cfn
Dim cffn As String() = {"Supplier ID #", "Company
Name"}
SearchResultList1.CustomFieldFriendlyNames = cffn
Following is a complete list of the dtResults Control
properties and methods:
Ascending: If
true, the results will be sorted in ascending order by whatever criterion is
specified in SortType. If false, results are sorted in descending order. Defaults
to false.
CustomFieldNames:
This string array represents the names of custom fields in the index that the
developer chooses to include in the results. The strings in it should appear
exactly as they do in the index; for example, { SupplierID , CompanyName , Region }.
CustomFieldFriendlyNames:
This string array represents the names of the fields that the developer wants
to appear in the control. These names are connected to actual custom fields by
their position in the array, with regard to CustomFieldNames. If longer than
CustomFieldNames, then the end is discarded. If shorter, then the names of the
remaining custom fields default to their actual names. For example, { ID # of
Supplier , Supplier Name , Supplier s Region }.
Fuzzy and
FuzzLevel: These control the tolerance of the search; for example,
searching for alphabet with Fuzzy = True and FuzzLevel = 1 would also search
for alphaqet or albhabet . Searching for alphabet with Fuzzy on and FuzzLevel
at 3 would also find alpkaqet .
IndexPath: This
is the location of the dtSearch index files to use for searching. If it is not
set, then SearchResults will look for an IndexPath key in Web.config.
Phonic:
Controls phonic searching; for example, with Phonic = True, searching for Smith
would also find Smythe .
PreviousSearchFilter:
This allows the developer to create Search Within Results functionality, in
conjunction with the SearchWithin property, described below. This property
should be saved to a session variable after the initial search, and restored
from it when the user triggers a Search Within Results .
RelativePathTranslations:
A SearchResultList.ResultPathTranslationDictionary containing the relative
paths of the absolute paths to documents stored in the dtSearch index. This
allows a URL to be generated for the link to the document, given only an
absolute path on the server. For example, one might include the following in an
initialization method:
Dim rptd As New SearchResultList.SearchResultList.ResultPathTranslationDictionary
rptd.Add( c:/Inetpub/website/search/documents , documents )
rptd.Add( c:/Inetpub/website/tutorials , ../tutorials )
SearchResultList1.RelativePathTranslations = rptd
SearchType: A string.
Valid values are allwords , anywords , phrase , and boolean . In the allwords
setting, dtSearch will search for any document containing each word in the
search, in any order or proximity. In the anywords setting, dtSearch will
search for any documents containing any of the words in the search query, not
necessarily all of them in the same document. In the phrase setting, dtSearch
will consider the entire search query like a single word, and search for
documents containing the exact query. In the boolean setting, the user can
use Boolean logic to specify a query. dtSearch provides the following guidance:
- tart apple pie - the entire phrase must be
present
- apple pie and pear tart - both phrases must be
present
- apple pie or pear tart - either phrase must be
present
- apple pie and not pear tart - only apple must be
present
- apple w/5 pear - apple must occur within 5 words
of pear
- apple not w/27 pear - apple must not occur
within 27 words of pear
- subject contains apple pie - finds apple pie in
a subject field
- use parenthesis if the query contains more than
one connector
SearchWithin:
If this property is set to True, and the PreviousSearchFilter property is set
to a value obtained from it after a previous GetResults call, then the results
of the current search will be a subset of the results of the previous search.
SortType: A string.
Meaningful values are hits , date , name , and size . If set to hits , the
documents containing the most occurrences of the search query, or the highest
score, will appear on top. If set to date , the most recently modified
documents will appear on top. If set to name , the documents will be sorted in
alphabetical order of their title. If set to size , the documents with the
largest file sizes will appear on top. If the field has a different value than
any of these, it is assumed to be the name of a custom field in the index by
which to sort.
Stemming:
Controls the word stemming capability of dtSearch. For example, if Stemming =
True, searches for apply , applying , applier , or applies are all
equivalent.
Synonyms: Uses
an English thesaurus to search for synonyms of the search query in addition to
the search query itself.
GetResults(SearchText
As String): Simply put, this function evaluates a search with the arguments
determined by properties on the query string passed, and displays the results
in a human-readable format, with 10 results per page and a pager control. Until
this method is called, the control is invisible to the user.
Greg Bean is
President of Cybergroup, Inc., a developer of advanced Internet and intranet
developer search tools in Baltimore, MD. E-mail him at mailto:gbean@cybergroup.com.
dtSearch
dtSearch offers over a decade of experience in text search
and retrieval. Large enterprises typically use dtSearch products for general
information retrieval, Internet and Intranet site searching, access to
technical documentation, and embedding in applications for distribution.
dtSearch is also on the US Government s GSA Schedule. The company has
distributors worldwide, including coverage on six continents. For more
information visit http://www.dtsearch.com.