David Hardcastle

In the near future I intend to make some of the software and data sources from Enigma available on this page, for now I have just added some lemma/inflection mapping tables.

Lemma Tables

These lemma/inflection tables were derived from CUVPlus lexicon which is freely available from the Oxford Text Archive. I extracted them using a guess/check algorithm (described in my thesis) which provided a candidate list of lemmas for each word in the lexicon and then used the lexicon to filter them. For example, the input putting/VVG must be reduced to a base lexical verb form, so the system tries removing the suffix to form putt/VVB and removing the suffix and splitting the double consonant to form put/VVB. In this case both lemmas are in the lexicon and so both are mapped to putting/VVG. Given the input letting/VVG the same algorithm returns lett/VVB and let/VVB but only the latter is listed and so only this mapping is added. I constructed a large list of exceptional cases and irreducible forms using a mixture of heuristic algorithms and hand annotation. The exceptions are included in the main lemma/inflection mapping files and the irreducible forms are contained in a separate file. Details of the formatting are in a README file in the zip.

 

BNC TEI titles and keywords

This file contains the keyword terms and title information from the TEI header for each file in the BNC World Edition. It is encoded as a tab-delimited text file, the first item is the filename, the second a comma delimited list of keywords and the last is the title. The file is useful if you want to cut subcorpora from the BNC for analysis. Right-click the link below to download, or follow the link to view the file in your browser.

 

top[top]