sparey diesel |
"sparey diesel" |
Ticking the Checkbox for all terms will enable AND logic.
This means that a page must contain all of the terms you specify in order
to qualify. Note that even with AND logic, unless you use phrases as shown
above, you are only guarenteed that returned pages contain all the terms
somewhere without any regard to order, or proxmity. So using the first
example shown above with the "all" checkbox ticked will return the pages
that contain both words sparey and diesel on them
someplace, at least once each, with no regard to where or when they appear.
Naturally, AND searches will take longer.
In order to give back results as fast as possible, the pages found are
simply sorted alphabetically. However, you have the option to "rank" the
results by checking the second Checkbox. This causes the listed result
pages to be ordered using a technique knows as
"term weighted Inverse Doucment Frequency" (IDF).
Briefly, this technique assumes that terms that do not appear very
often on any page will be more interesting than those that appear
frequently on lots of pages, hence these terms are "weighted" to rank
their page(s) higher. So a page with a few interesting words may be
ranked higher than one with lots of common words.
Each record is allocated a "score", being the sum of the number
of times each term occurs on the page, times the IDF value for that term.
The number displayed for each result is a percentage value calculated
derived by dividing the page's score by the maximum score
recorded for all of the pages found. Hence the one at the top will be 100%
and so on down. I think you can guess that this takes longer--and a ranked,
AND logic query will take longest of all.
Searches are not case sensitive, so the terms Sparey and
sparey (or SPAREY) will all return the same result.
Apart from rampaging feature-itus, I can see no real benefit in providing
an option for a case sensitive search. Please let me know if you can
provide a scenario where this would help.
By default, term and phrase searches are made without consideration
for word boundaries, so a search for carb would match things like
carb, carburator, carbon and bicarbonate!!
Checking the "Force whole word match" box will restrict matches to whole
word boundaries (surprise). The cost, time-wise, is not a lot more. I've
made the default "match partials" to increase the chance of hits. Turn it
off if you're being deluged with false positives.
For the techos, I must confess that the term parsing is rather simplistic
and I imagine it will be possible to break it in several ways. Under the
hood, the work is being done by a CGI Perl script using REs. If you don't
know a RE from a hole in the road, consider yourself fortunate, and possibly
still relatively sane. If you do, you can enter one as a term and really
cause chaos with my poor script! (although several of the RE metacharacters
are escaped--someone searching for an "ETA .29" really wants ".29", not any
character followed by "29", right?)
Currently, this site is hosted on an old Solaris box with an old Perl
implementation. Tests have shown a modern Linux server runs the search an
order of magnitude faster--I suspect that file globbing and lack of support
for pre-compiled REs are the culprits. Sad, but that's life.
Finally, if you manage to break the script in some way, indicated by
an Apache error page being returned that mumbles about
"internal server error"
and suggesting that you contact the webmaster,
please contact me--not our poor,
besieged Web Mistress (or you'll be sorry ).
Ranked Searches
Case Sensitivity
Word Matching
Under the Hood
Oops, I broke it!