vendor/wikimedia/textcat/README.md

   1 # TextCat
   2
   3      /\_/\
   4     ( . . )
   5     =\_v_/=
   6
   7 This is a PHP port of the TextCat language guesser utility.
   8
   9 Please see also the [original Perl
  10 version](http://odur.let.rug.nl/~vannoord/TextCat/), and an [updated
  11 Perl version](https://github.com/Trey314159/TextCat).
  12
  13 ## Contents
  14
  15 The package contains the classifier class itself and two tools—for
  16 classifying the texts and for generating the ngram database. The code
  17 now assumes the text encoding is UTF-8, since it's easier to extract
  18 ngrams this way. Also, everybody uses UTF-8 now and I, for one, welcome
  19 our new UTF-8-encoded overlords.
  20
  21 ### Classifier
  22
  23 The classifier is the script `catus.php`, which can be run as:
  24
  25     echo "Bonjour tout le monde, ceci est un texte en français" | php catus.php -d LM
  26
  27 or
  28
  29     php catus.php -d LM -l "Bonjour tout le monde, ceci est un texte en français"
  30
  31 The output would be the list of the languages, e.g.:
  32
  33     fr OR ro
  34
  35 Please note that the provided collection of language models includes a
  36 model for Oriya (ଓଡ଼ିଆ), which has the language code `or`, so results
  37 like `or OR sco OR ro OR nl` are possible.
  38
  39 ### Generator
  40
  41 To generate the language model database from a set of texts, use the
  42 script `felis.php`. It can be run as:
  43
  44     php felis.php INPUTDIR OUTPUTDIR
  45
  46 And will read texts from `INPUTDIR` and generate ngrams files in
  47 `OUTPUTDIR`. The files in `INPUTDIR` are assumed to have names like
  48 `LANGUAGE.txt`, e.g. `english.txt`, `german.txt`, `klingon.txt`, etc.
  49
  50 If you are working with sizable corpora (e.g., millions of characters),
  51 you should set `$minFreq` in `TextCat.php` to a reasonably small value,
  52 like `10`, to trim the very long tail of infrequent ngrams before they
  53 are sorted. This reduces the CPU and memory requirements for generating
  54 the language models. When *evaluating* texts, `$minFreq` should be set
  55 back to `0` unless your input texts are fairly large.
  56
  57 ## Models
  58
  59 The package comes with a default language model database in the `LM`
  60 directory and a query-based language model database in the `LM-query`
  61 directory. However, model performance will depend a lot on the text
  62 corpus it will be applied to, as well as specific modifications—e.g.
  63 capitalization, diacritics, etc. Currently the library does not modify
  64 or normalize either training texts or classified texts in any way, so
  65 usage of custom language models may be more appropriate for specific
  66 applications.
  67
  68 Model names use [Wikipedia language
  69 codes](https://en.wikipedia.org/wiki/List_of_Wikipedias), which are
  70 often but not guaranteed to be the same as [ISO 639 language
  71 codes](https://en.wikipedia.org/wiki/ISO_639).
  72
  73 When detecting languages, you will generally get better results when you
  74 can limit the number of language models in use, especially with very
  75 short texts. For example, if there is virtually no chance that your text
  76 could be in Irish Gaelic, including the Irish Gaelic language model
  77 (`ga`) only increases the likelihood of mis-identification. This is
  78 particularly true for closely related languages (e.g., the Romance
  79 languages, or English/`en` and Scots/`sco`).
  80
  81 Limiting the number of language models used also generally improves
  82 performance. You can copy your desired language models into a new
  83 directory (and use `-d` with `catus.php`) or specify your desired
  84 languages on the command line (use `-c` with `catus.php`).
  85
  86 You can also combine models in multiple directories (e.g., to use the
  87 query-based models with a fallback to Wiki-Text-based models) with a
  88 comma-separated list of directories (use `-d` with `catus.php`).
  89 Directories are scanned in order, and only the first model found with a
  90 particular name will be used.
  91
  92 ### Wiki-Text models
  93
  94 The 70 language models in `LM` are based on text extracted from randomly
  95 chosen articles from the Wikipedia for that language. The languages
  96 included were chosen based on a number of criteria, including the number
  97 of native speakers of the language, the number of queries to the various
  98 wiki projects in the language (not just Wikipedia), the list of
  99 languages supported by the original TextCat, and the size of Wikipedia
 100 in the language (i.e., the size of the collection from which to draw a
 101 training corpus).
 102
 103 The training corpus for each language was originally made up of ~2.7 to
 104 ~2.8M million characters, excluding markup. The texts were then lightly
 105 preprocessed. Preprocessing steps taken include: HTML Tags were removed.
 106 Lines were sorted and `uniq`-ed (so that Wikipedia idiosyncrasies—like
 107 "References", "See Also", and "This article is a stub"—are not
 108 over-represented, and so that articles randomly selected more than once
 109 were reduced to one copy). For corpora in Latin character sets, lines
 110 containing no Latin characters were removed. For corpora in non-Latin
 111 character sets, lines containing only Latin characters, numbers, and
 112 punctuation were removed. This character-set-based filtering removed
 113 from dozens to thousands of lines from the various corpora. For corpora
 114 in multiple character sets (e.g., Serbo-Croatian/`sh`, Serbian/`sr`,
 115 Turkmen/`tk`), no such character-set-based filtering was done. The final
 116 size of the training corpora ranged from ~1.8M to ~2.8M characters.
 117
 118 These models have not been thoroughly tested and are provided as-is. We
 119 may add new models or remove poorly-performing models in the future.
 120
 121 These models have 10,000 ngrams. The best number of ngrams to use for
 122 language identification is application-dependent. For larger texts
 123 (e.g., containing hundreds of words per sample), significantly smaller
 124 ngram sets may be best. You can set the number to be used by changing
 125 `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-m` with
 126 `catus.php`.
 127
 128 ### Wiki Query Models.
 129
 130 The 30 language models in `LM-query` are based on query data from
 131 Wikipedia which is less formal (e.g., fewer diacritics are used in
 132 languages that have them) and has a different distribution of words than
 133 general text. The original set of languages considered was based on the
 134 number of queries across all wiki projects for a particular week. The
 135 text has been preprocessed and many queries were removed from the
 136 training sets according to a process similar to that used on the
 137 Wiki-Text models above.
 138
 139 In general, query data is much messier than Wiki-Text—including junk
 140 text and queries in unexpected languages—but the overall performance on
 141 query strings, at least for English Wikipedia—is better.
 142
 143 The final set of models provided is based in part on their performance
 144 on English Wikipedia queries (the first target for language ID using
 145 TextCat). For more details see our [initial
 146 report](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/
 147 Language_Detection_with_TextCat) on TextCat. More languages will be
 148 added in the future based on additional performance evaluations.
 149
 150 These models have 10,000 ngrams. The best number of ngrams to use for
 151 language identification is application-dependent. For larger texts
 152 (e.g., containing hundreds of words per sample), significantly smaller
 153 ngram sets may be best. For short query seen on English Wikipedia
 154 strings, a model size of 3000 to 9000 ngrams has worked best, depending
 155 on other parameter settings. You can set the number to be used by
 156 changing `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-m`
 157 with `catus.php`.
 158
 159
 160 [![Build Status](https://travis-ci.org/smalyshev/textcat.svg?branch=master)](https://travis-ci.org/smalyshev/textcat)