You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Ralph Rönnquist 25cbcb24c1 deb-ized 5 months ago
debian deb-ized 5 months ago
Bigram.h initial version 5 months ago
List.h initial version 5 months ago
Makefile deb-ized 5 months ago
README.adoc deb-ized 5 months ago
bigram.c initial version 5 months ago
bigrep.8.adoc deb-ized 5 months ago
bigrep.c deb-ized 5 months ago
list.c initial version 5 months ago

README.adoc

bigrep

The is a text search tool using "bigram matching". It here means that both the search string and the text files are treated as successions of overlapping pairs of characters. A match allows the search string bigrams to be separated or dropped, with a "badness points" cost attached to each such action.

For example, assume the search string is "hello" and the target text line is "also here we are mellow", then bigrep would render the following match options:

*** 12: he:5 el:18 ll:19 lo:20
*** 13: he:5 el:18 ll:19
*** 14: he:5 ll:19 lo:20
*** 14: he:5 el:18
*** 14: he:5 el:18 lo:20
*** 15: he:5 ll:19
*** 16: he:5 lo:20

The first result above would place all search bigrams into the text line, at a total cost of 12 "badness points", which is due to the displacement of the "el" bigram. The second result has same placements but drops the last search bigram, which attracts the total "badness points" cost of 13 instead. And so forth.

The default cost parameter settings are:

drop_cost = 1; // Cost of dropping another bigram
space_cost = 0; // Cost of dropping a bigram starting with space
displace_cost = 1; // Cost of displacing bigram other than the first
threshold_cost = 20; // Threshold for keeping option

displace_cost is a factor that gets multiplied with the amount of displacement.