Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component.
添加一個(gè)完整的語(yǔ)言支持涉及很多不同部分的spaCy庫(kù),本文針對(duì)如何融合所有內(nèi)容,并說(shuō)明每個(gè)組件的工作流程。
WORKING ON SPACY'S SOURCE(使用spaCy資源)
To add a new language to spaCy, you'll need to modify the library's code. The easiest way to do this is to clone the repository and build spaCy from source. For more information on this, see the installation guide. Unlike spaCy's core, which is mostly written in Cython, all language data is stored in regular Python files. This means that you won't have to rebuild anything in between –you can simply make edits and reload spaCy to test them.
要為spaCy添加新語(yǔ)言,需要修改library的代碼,最簡(jiǎn)單的方法是克隆repository(https://github.com/explosion/spaCy),之后從源碼build。參見(jiàn)安裝指南中關(guān)于此方法的詳細(xì)內(nèi)容。spaCy的核心代碼基本上都是用Cython寫(xiě)的,不過(guò)所有的語(yǔ)言數(shù)據(jù)都是以常規(guī)的Python文件。這樣就不需要重建任何代碼,只需簡(jiǎn)單的修改和重新調(diào)用spaCy就可以進(jìn)行語(yǔ)言測(cè)試。
Obviously, there are lots of ways you can organise your code when you implement your own language data. This guide will focus on how it's done within spaCy. For full language support, you'll need to create a Language subclass, define custom language data, like a stop list and tokenizer exceptions and test the new tokenizer. Once the language is set up, you can build the vocabulary, including word frequencies, Brown clusters and word vectors. Finally, you can train the tagger and parser, and save the model to a directory.
部署自定義語(yǔ)言數(shù)據(jù)時(shí)有很多方法可以組織代碼。本文將聚焦于如何用spaCy完成。完整的語(yǔ)言支持,需要?jiǎng)?chuàng)建Language子集,聲明自定義語(yǔ)言數(shù)據(jù),比如停用詞列表和例外分詞,并且測(cè)試新的分詞器。語(yǔ)言設(shè)置完成,就可以創(chuàng)建詞匯表,包括詞頻、布朗集(Brown Cluster)和詞向量。然后就可以訓(xùn)練并保存Tagger和Parser模型了。
For some languages, you may also want to develop a solution for lemmatization and morphological analysis.
對(duì)于有的語(yǔ)言,可能還可以開(kāi)發(fā)詞形還原和詞型分析的方案。
Language data 語(yǔ)言數(shù)據(jù)
Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang? module contains all language-specific data, organised in simple Python files. This makes the data easy to update and extend.
每一種語(yǔ)言都不相同 – 而且通常都有很多例外和特殊情況,尤其是最常見(jiàn)的詞。其中一些例外情況是各語(yǔ)言間通用的,但其他的則是完全特殊的– 經(jīng)常是特殊到需要硬編碼。spaCy中的lang模塊包含了大多數(shù)特殊語(yǔ)言數(shù)據(jù),以簡(jiǎn)單的Python文件進(jìn)行組織,以便于升級(jí)和擴(kuò)展數(shù)據(jù)。
The shared language data in the directory root includes rules that can be generalised? across languages – for example, rules for basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and ”. This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of? putting together all components and creating the Language subclass – for example,English or German.
在根目錄中的通用語(yǔ)言數(shù)據(jù)包含了廣義的跨語(yǔ)言規(guī)則,例如:基本的標(biāo)點(diǎn)、表情符號(hào)、情感符號(hào)、單字母縮寫(xiě)的規(guī)則以及不同拼寫(xiě)的等義標(biāo)記,比如“and”。這樣有助于模型作出更準(zhǔn)確的預(yù)測(cè)。子模塊中的特定語(yǔ)言數(shù)據(jù)包含的規(guī)則僅與特定語(yǔ)言相關(guān),還負(fù)責(zé)整合所有組件和創(chuàng)建語(yǔ)言子集– 例如:英語(yǔ) 或 德語(yǔ)。
from spacy.lang.en import English
from spacy.lang.de import German
nlp_en = English() # includes English data
nlp_de = German() # includes German data

Stop?words stop_words.py
List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return?True?for?is_stop.
停用詞語(yǔ)言中進(jìn)行數(shù)據(jù)處理之前或之后通常會(huì)自動(dòng)過(guò)濾掉某些字或詞的列表
Tokenizer?exceptions?tokenizer_exceptions.py
Special-case rules for the tokenizer, for example,? contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
例外分詞特殊分詞,例如:縮寫(xiě)can’t和帶標(biāo)點(diǎn)的縮寫(xiě)詞U.K.(常規(guī)中文好像沒(méi)這情況)
Norm? exceptions?norm_exceptions.py
Special-case rules for normalising tokens to improve the? model's predictions, for example on American vs. British spelling.??
Punctuation?rules??punctuation.py
Regular expressions for splitting tokens, e.g. on? punctuation or special characters like emoji. Includes rules for prefixes,? suffixes and infixes.
標(biāo)點(diǎn)規(guī)則標(biāo)點(diǎn)或特殊字符(如表情)等等正則表達(dá)式,包括前綴、后綴和連接符的規(guī)則。
Character?classes?char_classes.py
Character classes to be used in regular expressions, for? example, latin characters, quotes, hyphens or icons.
字符集正則表達(dá)式中所用的字符集,例如:拉丁、引用、連字符或圖標(biāo)等
Lexical Attributes?lex_attrs.py
Custom functions for setting lexical tributes on tokens, e.g.?like_num, which? includes language-specific words like “ten” or “hundred”.
詞性例如:like_num:包括十、百、千等特殊詞。
Syntax?iterators?syntax_iterators.py
Functions that compute views of a?Doc?object based on its syntax. At the moment, only used for?noun-chunkes。
Lemmatizer?lemmatizer.py
Lemmatization rules or a lookup-based lemmatization table? to assign base forms, for example "be" for "was".
詞型還原英語(yǔ)討厭的時(shí)態(tài)、單復(fù)數(shù),偉大的中文不這么土
Tag?map tag_map.py
Dictionary mapping strings in your tag set to?Universal Dependencies tags.
Morph?rules morph_rules.py
Exception rules for morphological analysis of irregular? words like personal pronouns.
詞變形規(guī)則?
The individual components?expose variables?that can be imported within a language module, and added to the language's?Defaults. Some components, like the punctuation rules, usually don't need much customisation and can simply be imported from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model.
個(gè)別組件可以到語(yǔ)言模塊中,被加入到語(yǔ)言的Defaults。有些組件比如標(biāo)點(diǎn)符號(hào)規(guī)則,通常不需要很多自定義,而是簡(jiǎn)單的引入通用規(guī)則。其他的如tokenizer和norm exceptions很特別,會(huì)較大程度上影響spaCy對(duì)特定語(yǔ)言和訓(xùn)練語(yǔ)言模型的性能效果。

SHOULDI EVER UPDATE THE GLOBAL DATA?
Reuseable language data is collected as atomic pieces in the root of the?spacy.lang??package. Often, when a new language is added, you'll find a pattern or symbol that's missing. Even if it isn't common in other languages, it might be best to add it to the shared language data, unless it has some conflicting interpretation. For instance, we don't expect to see guillemot quotation symbols (??and??) in English text. But if we do see them, we'd probably prefer the tokenizer to split them off.
是否應(yīng)更新全局?jǐn)?shù)據(jù)
可復(fù)用的語(yǔ)言數(shù)據(jù)作為原子碎片被置于spacy.lang包的根節(jié)點(diǎn)。通常,添加新語(yǔ)言后,會(huì)發(fā)現(xiàn)有圖案或符號(hào)缺失。即使在其他語(yǔ)言中并不常見(jiàn),或許最好還是將其加入通用語(yǔ)言數(shù)據(jù)中,除非有沖突。
FORLANGUAGES WITH NON-LATIN CHARACTERS
In order for the tokenizer to split suffixes, prefixes and infixes, spaCy needs to know the language's character set. If the language you're adding uses non-latin characters, you might need to add the required character classes to the global?char_classes.py?. spaCy uses the?regex?library?to keep this simple and readable. If the language requires very specific punctuation rules, you should consider overwriting the default regular expressions with your own in the language's?Defaults.
中文的全角標(biāo)點(diǎn)符號(hào)需要定義。
The Language subclass 語(yǔ)言子集
Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a directory spacy/lang/es, which can be imported as spacy.lang.es.
特定語(yǔ)言代碼和資源應(yīng)組織為spaCy的子包,以語(yǔ)言標(biāo)準(zhǔn)編碼(ISO)命名,例如:中文應(yīng)位于spacy/lang/zh目錄,就能夠以 spacy.lang.zh 引入了。
To get started, you can use our templates for the most important files. Here's what the class template looks like:
最重要的文件的模版:
__INIT__.PY (EXCERPT)
# import language-specific data
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
# create Defaults class in the module scope (necessary for pickling!)
class XxxxxDefaults(Language.Defaults):
? ? lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
? ? lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code
? ? # optional: replace flags with custom functions, e.g. like_num()
? ? lex_attr_getters.update(LEX_ATTRS)
? ? # merge base exceptions and custom tokenizer exceptions
? ? tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
? ? stop_words = STOP_WORDS
# create actual Language class
class Xxxxx(Language):
? ? lang = 'xx' # language ISO code
? ? Defaults = XxxxxDefaults # override defaults
# set default export – this allows the language class to be lazy-loaded
__all__ = ['Xxxxx']
WHY LAZY-LOADING?
Some languages contain large volumes of custom data, like lemmatizer lookup tables, or complex regular expression that are expensive to compute. As of spaCy v2.0, Language classes are not imported on initialisation and are only loaded when you import them directly, or load a model that requires a language to be loaded. To lazy-load languages in your application, you can use the util.get_lang_class()? helper function with the two-letter language code as its argument.
為什么要延遲加載
有些語(yǔ)言包含大量的定制數(shù)據(jù),復(fù)雜規(guī)則等,計(jì)算成本很高。spaCy2.0中,語(yǔ)言集不在初始化時(shí)引入,僅于import的時(shí)候才加載,或者加載包含語(yǔ)言的模型。在應(yīng)用中延遲加載語(yǔ)言,使用util.get_lang_class(),參數(shù)為兩位語(yǔ)言編碼。
Stop words 停用詞
A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model. To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string.
停用詞意義不用多說(shuō)了,一切為了效率和質(zhì)量。
WHAT DOES SPACY CONSIDER ASTOP WORD?
There's no particularly principled logic behind what words should be added to the stop list. Make a list that you think might be useful to people and is likely to be unsurprising. As a rule of thumb, words that are very rare are unlikely to be useful stop words.
spaCy是如何考慮停用詞的?
關(guān)于什么詞應(yīng)該加入停用詞表,沒(méi)有什么特別的原則性邏輯。建議直接參考使用復(fù)旦或哈工大的停用詞,比較成熟。
關(guān)鍵還是怎么定義怎么用,看定義樣例:
EXAMPLE
STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before before hand behind being below beside besides between beyond both bottom but by """).split())
樣例中引號(hào)里的一堆詞就是停用詞們,把中文的停用詞表加進(jìn)去就OK了。
IMPORTANT NOTE
When adding stop words from an online source, always include the link in a comment. Make sure to proofread and double-check the words carefully. A lot of the lists available online have been passed around for years and often contain mistakes, like unicode errors or random words that have once been added for a specific use case, but don't actually qualify.
重要?。?!
一定要反復(fù)校對(duì)那些詞,網(wǎng)上的很多詞表已經(jīng)過(guò)時(shí)了,而且經(jīng)常有錯(cuò)誤(最常見(jiàn)unicode錯(cuò)誤)。
Tokenizer exceptions 例外分詞
spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA orTAG.
spaCy的分詞算法可以處理空格和tab分隔。很容易定義特殊情況規(guī)則,不需擔(dān)心與其他分詞器的相互影響。一旦key string匹配,規(guī)則就會(huì)生效,給出定義好的分詞序列。也可以附加屬性覆蓋特殊情況定義,例如 LEMMA 或 TAG。
IMPORTANTNOTE
If an exception consists of more than one token, the ORTH values combined always need to match the original string. The way the original string is split up can be pretty arbitrary sometimes –for example "gonna" is split into"gon" (lemma "go") and "na" (lemma"to"). Because of how the tokenizer works, it's currently not possible to split single-letter strings into multiple tokens.
重要?。?!
例如:Gonna 定義為 gon(go)和 na(to),單個(gè)字母不可能再split。中文沒(méi)這么垃圾的東西吧。
Unambiguous abbreviations, like month names or locations in English, should be added to exceptions with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}. Since the exceptions are added in Python, you can use custom logic to generate them more efficiently and make your data less verbose. How you do this ultimately depends on the language. Here's an example of how exceptions for time formats like"1a.m." and "1am" are generated in the English tokenizer_exceptions.py:
縮寫(xiě)問(wèn)題,月份縮寫(xiě),地點(diǎn)縮寫(xiě)等,例如:Jan. 還原為 January,那么中文就還原為 一月吧,具體情況取決于語(yǔ)言,比如定制中文時(shí),忽略 Jan這種情況。以下是英文時(shí)間的定義樣例tokenizer_exceptions.py:
# use short, internal variable for readability
_exc = {}
for h in range(1, 12 + 1):
??? for period in["a.m.", "am"]:
??????? # always keep an eye onstring interpolation!
??????? _exc["%d%s" %(h, period)] = [
??????????? {ORTH: "%d"% h},
??????????? {ORTH: period, LEMMA:"a.m."}]
??? for period in["p.m.", "pm"]:
??????? _exc["%d%s" %(h, period)] = [
??????????? {ORTH: "%d"% h},
??????????? {ORTH: period, LEMMA:"p.m."}]
# only declare this at the bottom
TOKENIZER_EXCEPTIONS = _exc
GENERATINGTOKENIZER EXCEPTIONS
Keep in mind that generating exceptions only makes sense if there's a clearly defined and finite number of them, like common contractions in English. This is not always the case –in Spanish for instance, infinitive or imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). Incases like this, spaCy shouldn't be generating exceptions for all verbs.Instead, this will be handled at a later stage during lemmatization.
生成TOKENIZER EXCEPTIONS
要注意,只有明確定義的和有限數(shù)量的例外定義才合理,比如英文中的常見(jiàn)縮寫(xiě)。其他語(yǔ)言視具體情況不同,spaCy不能夠?yàn)樗性~匯生成例外規(guī)則??梢栽囋嚭笪奶岬降膌emmatization(詞干提取)。
When adding the tokenizer exceptions to theDefaults, you can use the update_exc()?helper function to merge them with the global base exceptions (including one-letter abbreviations and emoticons). The function performs a basic check to make sure exceptions are provided in the correct format. It can take any number of exceptions dicts as its arguments, and will update and overwrite the exception in this order. For example, if your language's tokenizer exceptions include a custom tokenization pattern for "a.", it will overwrite the base exceptions with the language's custom one.
在缺省定義中添加tokenizer exceptions時(shí),可以使用update_exc() 輔助函數(shù)以合并至全局設(shè)置(包括單字符縮寫(xiě)和表情)。該函數(shù)執(zhí)行基本的格式合法性檢驗(yàn),且可以使用多個(gè)例外字典作為參數(shù),并且將更新覆蓋原定義。
EXAMPLE
from ...util import update_exc
BASE_EXCEPTIONS =?{"a.": [{ORTH: "a."}], ":)": [{ORTH:":)"}]}
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA:"all"}]}
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# {"a.": [{ORTH: "a.", LEMMA: "all"}],":}": [{ORTH: ":}"]]}
ABOUTSPACY'S CUSTOM PRONOUN LEMMA
Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of "me" be "I", or should we normalize person as well, giving "it" —or maybe "he"? spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.
關(guān)于spaCy的代詞詞元定義
不同于動(dòng)詞和常規(guī)名詞,人稱(chēng)代詞沒(méi)有基本格式。中文比英文好些,拿英文說(shuō)事吧,me應(yīng)該是I,或者應(yīng)該規(guī)范為人也行,還有it或者也可以是he?spaCy的解決方案是引入一個(gè)專(zhuān)有標(biāo)志 –PRON- ,用來(lái)標(biāo)記所有人稱(chēng)代詞。
Norm exceptions 例外規(guī)范
In addition to ORTH or LEMMA, tokenizer exceptions can also set a NORM attribute. This is useful to specify a normalised version of the token –for example, the norm of "n't" is "not". By default, a token's norm equals its lowercase text. If the lowercase spelling of a word exists, norms should always be in lowercase.
除了ORTH和詞元之外,tokenizer exceptions也可以設(shè)置一個(gè)規(guī)范屬性。指定一個(gè)標(biāo)準(zhǔn)版本的token很有用,例如,還是英文舉例(中文好像沒(méi)這么亂吧):n’t是not。默認(rèn)情況下,一個(gè)token的規(guī)范是小寫(xiě)文本。如果一個(gè)詞的小寫(xiě)存在,規(guī)范應(yīng)該一直是小寫(xiě)(中文的小寫(xiě)大寫(xiě)問(wèn)題好像只有數(shù)字吧,該不該算進(jìn)去呢?)。
NORMS VS. LEMMAS
doc = nlp(u"I'm gonna realise")
norms = [token.norm_ for token in doc]
lemmas = [token.lemma_ for token in doc]
assert norms == ['i', 'am', 'going', 'to', 'realize']
assert lemmas == ['i', 'be', 'go', 'to', 'realise']
spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words –for example, "realize" and "realise", or "thx" and"thanks".
spaCy通常會(huì)嘗試將同一個(gè)詞的不同拼寫(xiě)規(guī)范化,常規(guī)化(這就是拼寫(xiě)文字和象形文字的不同了)。這在其他token屬性或一般tokenization中沒(méi)有效果,但是這確保等效tokens得到類(lèi)似的表述。這樣就能夠提升模型對(duì)那些在訓(xùn)練數(shù)據(jù)中不常見(jiàn),但是同其他詞差不多的詞的預(yù)測(cè)能力,例如:realize和realizse,或者thx和thanks。(中文有啥?謝了 – 謝謝了 – 謝謝您了 – 太謝謝您了 ……中文有這必要嗎)
Similarly, spaCy also includes global base norms for normalising different styles of quotation marks and currency symbols. Even though $ and €are very different, spaCy normalises them both to $. This way, they'll always be seen as similar, no matter how common they were in the training data.
同樣的,spaCy也包括將不通類(lèi)型的引號(hào)和貨幣符號(hào)規(guī)范化的全局基本規(guī)范(https://github.com/explosion/spaCy/blob/master/spacy/lang/norm_exceptions.py)。即使 $ 和¥ 有很大差別,spaCy會(huì)將它們統(tǒng)一規(guī)范為 $。這樣,不論它們?cè)谟?xùn)練數(shù)據(jù)中有多常見(jiàn),都將被同等處理。
Norm exceptions can be provided as a simple dictionary. For more examples, see the English norm_exceptions.py .
Norm exceptions可以被作為一個(gè)簡(jiǎn)單的字典。更多樣例參見(jiàn)英文語(yǔ)言中的norm_exceptions.py
EXAMPLE
NORM_EXCEPTIONS = {
??? "cos":"because",
??? "fav":"favorite",
??? "accessorise":"accessorize",
??? "accessorised":"accessorized"
}
To add the custom norm exceptions lookup table, you can use the?add_lookups()?helper functions. It takes the default attribute getter function as its first argument, plus a variable list of dictionaries. If a string's norm is found in one of the dictionaries, that value is used – otherwise, the default function is called and the token is assigned its default norm.
通過(guò)add_lookups()輔助函數(shù)來(lái)添加自定義norm exceptions查詢(xún)表。它使用默認(rèn)屬性的getter函數(shù)作為其第一個(gè)參數(shù),外加一個(gè)字典變量表。如果在某個(gè)字典中發(fā)現(xiàn)了一個(gè)字符串的規(guī)范,則取值– 否則,調(diào)用默認(rèn)函數(shù)并且將默認(rèn)規(guī)范賦值給token。
lex_attr_getters[NORM] =add_lookups(Language.Defaults.lex_attr_getters[NORM],? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NORM_EXCEPTIONS, BASE_NORMS)
The order of the dictionaries is also the lookup order –so if your language's norm exceptions overwrite any of the global exceptions, they should be added first.Also note that the tokenizer exceptions will always have priority over the attribute getters.
字典的排序也是查詢(xún)排序– 所以,如果語(yǔ)言的norm exceptions覆蓋了任何全局exceptions,將被首先添加。同時(shí)注意tokenizer exceptions總是優(yōu)先于屬性getter。
Lexical attributes 詞性
spaCy provides a range of Token attributes that return useful information on that token –for example, whether it's uppercase or lowercase, a left or right punctuation mark, or whether it resembles a number or email address. Most of these functions, like is_lower or like_url should be language-independent. Others, like like_num(which includes both digits and number words), requires some customisation.
spaCy提供了一堆Token屬性來(lái)返回token的有用信息,例如:無(wú)論大寫(xiě)還是小寫(xiě)形式,左右引號(hào),或不論是類(lèi)似于數(shù)字或email地址。大部分函數(shù),比如:is_lower或者like_urls都應(yīng)該是獨(dú)立語(yǔ)言的。其他的像like_num(包括數(shù)字和大寫(xiě)數(shù)字),則需要進(jìn)行定制。
BEST PRACTICES
English number words are pretty simple, because even large numbers consist of individual tokens, and we can get away with splitting and matching strings against a list. In other languages, like German, "two hundred and thirty-four" is one word, and thus one token. Here, it's best to match a string against a list of number word fragments (instead of a technically almost infinite list of possible number words).
最佳方案
英文數(shù)字單詞非常簡(jiǎn)單,因?yàn)榧词勾髷?shù)字也是由獨(dú)立的tokens組成的,我們可以避免分隔和靠列表匹配字符串。其他語(yǔ)言中,比如德語(yǔ),two hundred and thirty-four是一個(gè)詞,也是一個(gè)token。這里最好是基于一個(gè)數(shù)字單詞片段的列表(而不是技術(shù)上幾乎無(wú)限的可能的數(shù)字單詞的列表)進(jìn)行字符串匹配。(這一塊中文也應(yīng)該是一樣原理了)
英文詞性定義樣例:
LEX_ATTRS.PY
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six','seven',
????????????? 'eight', 'nine','ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
????????????? 'fifteen','sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
????????????? 'thirty', 'forty','fifty', 'sixty', 'seventy', 'eighty', 'ninety',
????????????? 'hundred','thousand', 'million', 'billion', 'trillion', 'quadrillion',
????????????? 'gajillion','bazillion']
def like_num(text):
??? text = text.replace(',','').replace('.', '')
??? if text.isdigit():
??????? return True
??? if text.count('/') == 1:
??????? num, denom =text.split('/')
??????? if num.isdigit() anddenom.isdigit():
??????????? return True
??? if text.lower() in _num_words:
??????? return True
??? return False
LEX_ATTRS = {
??? LIKE_NUM: like_num
}
By updating the default lexical attributeswith a custom LEX_ATTRS dictionary in the language's defaults vialex_attr_getters.update(LEX_ATTRS), only the new custom functions are overwritten.
通過(guò)lex_getters.update(LEX_ATTRS)使用一個(gè)定制LEX_ATTRS字典更新語(yǔ)言默認(rèn)詞性屬性,只有新定義的函數(shù)會(huì)被覆蓋。
Syntax iterators 語(yǔ)法迭代器
Syntax iterators are functions that compute views of a Doc object based on its syntax. At the moment, this data is only used for extracting noun chunks, which are available as the Doc.noun_chunks property.Because base noun phrases work differently across languages, the rules to compute them are part of the individual language's data. If a language does not include a noun chunks iterator, the property won't be available. For examples, see the existing syntax iterators:
語(yǔ)法迭代器是計(jì)算基于語(yǔ)法的DOC對(duì)象的視圖的函數(shù)。目前,數(shù)據(jù)僅用來(lái)提取詞塊,其屬性為Doc.noun_chunks。因?yàn)榛久~短語(yǔ)的工作各語(yǔ)言不同,計(jì)算規(guī)則為各語(yǔ)言數(shù)據(jù)的一部分。如果語(yǔ)言不包含一個(gè)詞塊迭代器,則沒(méi)有noun_chunks屬性。如下例:
NOUN CHUNKS EXAMPLE
doc = nlp(u'A phrase with another phrase occurs.')
chunks = list(doc.noun_chunks)
assert chunks[0].text == "A phrase"
assert chunks[1].text == "another phrase"

Lemmatizer詞形還原器
As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data:
截至v2.0,spaCy支持簡(jiǎn)單的基于查詢(xún)的詞形還原。這一般是最快最簡(jiǎn)單的入門(mén)方法。字典數(shù)據(jù)映射詞形字符串。要判定一個(gè)token的詞形,spaCy會(huì)于查詢(xún)表中快速查找。西班牙文樣例:
LANG/ES/LEMMATIZER.PY (EXCERPT)
LOOKUP = {
??? "aba":"abar",
??? "ababa":"abar",
??? "ababais":"abar",
??? "ababan":"abar",
??? "ababanes":"ababán",
??? "ababas":"abar",
??? "ababoles":"ababol",
??? "ababábites":
"ababábite"
}
To provide a lookup lemmatizer for your language, import the lookup table and add it to the Language class as lemma_lookup:
引入查詢(xún)表到語(yǔ)言子集的lemma_lookup,為語(yǔ)言提供詞型還原器,方法如下例:
lemma_lookup = dict(LOOKUP)
Tag map
Most treebanks define a custom part-of-speechtag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.
多數(shù)樹(shù)庫(kù)都聲明一個(gè)自定義詞類(lèi)標(biāo)簽體系,打破細(xì)節(jié)和易預(yù)測(cè)性水平之間的平衡(沒(méi)搞明白)。自定義標(biāo)簽體系很有用,常規(guī)體系也很有用,其中更多的標(biāo)簽可以關(guān)聯(lián)起來(lái)。標(biāo)記器能夠以任意符號(hào)學(xué)習(xí)一個(gè)標(biāo)簽體系。不過(guò)需要定義這些符號(hào)映射到Universal Dependencies tag set(這玩意兒很有用)。這就要通過(guò)提供一個(gè)tag map做到了。
The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.
Tag map的keys應(yīng)該是標(biāo)簽集中的字符串。Value應(yīng)該是字典。字典必須有POS記錄,其值為Universal Dependencies tags中的一個(gè)。另外,還可以在tag map中包含詞法特征或者token的其他屬性,這樣就可以進(jìn)行簡(jiǎn)單的基于規(guī)則的形態(tài)分析了。
下面看樣例:
EXAMPLE
from ..symbols import POS, NOUN, VERB, DET
TAG_MAP = {
??? "NNS":? {POS: NOUN, "Number":"plur"},
??? "VBG":? {POS: VERB, "VerbForm":"part", "Tense": "pres", "Aspect":"prog"},
??? "DT":?? {POS: DET}
}
Morph rules 形態(tài)規(guī)則
The morphology rules let you set token attributes such as lemmas, keyed by the extended part-of-speech tag and token text. The morphological features and their possible values are language-specific and based on the Universal Dependencies scheme.
形態(tài)規(guī)則設(shè)置token的屬性,比如詞形,鍵的擴(kuò)展詞性標(biāo)簽和token的文本。詞法(形態(tài))特征及其可能的值為語(yǔ)言特征,且基于Universal Dependencies體系。
EXAMPLE
from ..symbols import LEMMA
MORPH_RULES = {
??? "VBZ": {
??????? "am": {LEMMA:"be", "VerbForm": "Fin", "Person":"One", "Tense": "Pres", "Mood":"Ind"},
??????? "are": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},
??????? "is": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"},
??????? "'re": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},
??????? "'s": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"}
??? }
}
上例中“am”的屬性如下:

IMPORTANT NOTE
The morphological attributes are currently not all used by spaCy. Full integration is still being developed. In the meantime, it can still be useful to add them, especially if the language you're adding includes important distinctions and special cases. This ensures that as soon as full support is introduced, your language will be able to assign all possible attributes.
重要?。?!
形態(tài)屬性目前沒(méi)有完全應(yīng)用于spaCy,完整內(nèi)容還在開(kāi)發(fā)中。其間,加上該屬性還是很有用的,特別是如果添加的語(yǔ)言包含重要區(qū)別和特殊情況。這樣就確保了當(dāng)完整支持完成后,就可以快速引入所有可能的屬性了。
Testing the language 測(cè)試語(yǔ)言
Before using the new language or submitting a pull request to spaCy, you should make sure it works as expected. This is especially important if you've added custom regular expressions for token matching or punctuation –you don't want to be causing regressions.
在使用一個(gè)新的語(yǔ)言或者向spaCy提交更新請(qǐng)求前,應(yīng)確定它能達(dá)到預(yù)期。特別重要的是如果添加了自定義token匹配或標(biāo)點(diǎn)符號(hào)的正則表達(dá)式,省的后悔。。。
SPACY'STEST SUITE
spaCy uses the pytest framework for testing.For more details on how the tests are structured and best practices for writing your own tests, see our tests documentation.
spaCy的測(cè)試包
spaCy使用pytest框架進(jìn)行測(cè)試。關(guān)于更多的測(cè)試結(jié)構(gòu)和制作自己的測(cè)試的最佳方案,參見(jiàn)測(cè)試文檔https://github.com/explosion/spaCy/blob/master/spacy/tests
The easiest way to test your new tokenizer is to run the language-independent "tokenizer sanity" tests located in tests/tokenizer . This will test for basic behaviours like punctuation splitting, URL matching and correct handling of whitespace. In the conftest.py, add the new language ID to the list of _languages:
測(cè)試新tokenizer的最簡(jiǎn)單方法是運(yùn)行“tokenizer sanity”,位于tests/tokenizer。這將對(duì)一些基本功能進(jìn)行測(cè)試,如標(biāo)點(diǎn)符號(hào)分隔,URL匹配以及空格的正確處理(中文還空格?)。在conftest.py文件的_languages列表中添加新語(yǔ)言的ID。
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'it','nb',
????????????? 'nl', 'pl', 'pt','sv', 'xx'] # new language here
GLOBAL TOKENIZER TEST EXAMPLE
# use fixture by adding it as an argument
def test_with_all_languages(tokenizer):
??? # will be performed on ALL language tokenizers
??? tokens = tokenizer(u'Some texthere.')
The language will now be included in the tokenizer test fixture, which is used by the basic tokenizer tests. If you want to add your own tests that should be run over all languages, you can use this fixture as an argument of your test function.
現(xiàn)在語(yǔ)言已經(jīng)被包含到tokenizer測(cè)試fixture里了,用來(lái)進(jìn)行基本的tokenizer測(cè)試。如果想用自己的測(cè)試運(yùn)行所有語(yǔ)言,可以將這個(gè)fixture以參數(shù)形式加入測(cè)試函數(shù)。
Writing language-specific tests 寫(xiě)一個(gè)特定語(yǔ)言的測(cè)試
It's recommended to always add at least some tests with examples specific to the language. Language tests should be located in tests/lang? in a directory named after the language ID. You'll also need to create a fixture for your tokenizer in the conftest.py . Always use the get_lang_class() helper function within the fixture, instead of importing the class at the top of the file. This will load the language data only when it's needed. (Otherwise, all data would be loaded every time you run a test.)
強(qiáng)烈推薦為定制的語(yǔ)言添加測(cè)試集。語(yǔ)言測(cè)試集應(yīng)位于tests/lang路徑內(nèi)的以語(yǔ)言ID命名的目錄中。同時(shí),需要在conftest.py中創(chuàng)建一個(gè)fixture。在fixture內(nèi)使用get_lang_class()函數(shù),不要在文件頭import class。這樣就會(huì)僅在需要時(shí)加載語(yǔ)言數(shù)據(jù)。(否則,在每次執(zhí)行測(cè)試時(shí)都會(huì)加載所有數(shù)據(jù),就累了。。。)
@pytest.fixture
def en_tokenizer():
??? returnutil.get_lang_class('en').Defaults.create_tokenizer()
When adding test cases, always parametrize them –this will make it easier for others to add more test cases without having to modify the test itself. You can also add parameter tuples, for example, a test sentence and its expected length, or a list of expected tokens. Here's an example of an English tokenizer test for combinations of punctuation and abbreviations:
添加測(cè)試案例時(shí),使其參數(shù)化,以便于別人方便的添加更多測(cè)試案例,而不用去修改測(cè)試主體。還可以添加參數(shù)元祖,比如:一條測(cè)試語(yǔ)句及其預(yù)期長(zhǎng)度,或者預(yù)期tokens的列表。下面例子是一個(gè)英文的標(biāo)點(diǎn)符號(hào)和縮寫(xiě)組合的tokenizer測(cè)試:
EXAMPLE TEST
@pytest.mark.parametrize('text,length', [
??? ("The U.S. Army likesShock and Awe.", 8),
??? ("U.N. regulations arenot a part of their concern.", 10),
??? ("“Isn't
it?”", 6)])
def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):
??? tokens = en_tokenizer(text)
??? assert len(tokens) == length
Training訓(xùn)練一個(gè)語(yǔ)言模型
spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train word vectors. To generate the word frequencies from a large, raw corpus, you can use the word_freqs.py? script from the spaCy developer resources.
spaCy認(rèn)為一般詞匯都可以在詞匯表實(shí)例中獲得。詞匯獲得詞性標(biāo)注,使用模型為標(biāo)記的文本信息也變得簡(jiǎn)單了。特別是收集詞頻,訓(xùn)練詞向量。從一個(gè)又大又新大語(yǔ)料中生成詞頻,可以使用spaCy developer resources中的word_freqs.py。
Note that your corpus should not be preprocessed (i.e. you need punctuation for example). The word frequencies should be generated as a tab-separated file with three columns:
1、The number of times the word occurred in your language sample.
2、The number of distinct documents the word occurred in.
3、The word itself.
注意:語(yǔ)料需未經(jīng)預(yù)處理(即要為樣本加上標(biāo)點(diǎn)符號(hào))。詞頻文件應(yīng)被生成為tab分隔的三列內(nèi)容:
第一列:詞條在語(yǔ)言樣品出現(xiàn)的次數(shù)。
第二列:出現(xiàn)詞條的文檔數(shù)
第三列:詞條內(nèi)容
ES_WORD_FREQS.TXT
6361109?????? 111 Aunque
23598543???? 111 aunque
10097056???? 111 claro
193454? 111 aro
7711123?????? 111 viene
12812323???? 111 mal
23414636???? 111 momento
2014580?????? 111 felicidad
233865? 111 repleto
15527??? 111 eto
235565? 111 deliciosos
17259079???? 111 buena
71155??? 111 Anímate
37705??? 111 anímate
33155??? 111 cuéntanos
2389171?????? 111 cuál
961576? 111 típico
BROWN CLUSTERS 布朗聚類(lèi)
Additionally, you can use distributional similarity features provided by the Brown clustering algorithm.You should train a model with between 500 and 1000 clusters. A minimum frequency threshold of 10 usually works well.
另外,可以使用布朗聚類(lèi)算法提供的分布相似性特征??梢杂?xùn)練一個(gè)500-1000clusters的模型,最低頻的閥值為10通常效果不錯(cuò)。
You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy'sEnglish tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".
你應(yīng)該確定要用spaCy的tokenizer為你的語(yǔ)言進(jìn)行詞頻的分詞。這樣就可以確保在運(yùn)行時(shí),詞頻參考相同的分詞標(biāo)準(zhǔn)。比如說(shuō),spaCy的英文tokenizer將can’t分詞為兩個(gè)tokens。如果用空格處理詞頻計(jì)數(shù),結(jié)果將出現(xiàn)ca和n’t的錯(cuò)誤詞頻計(jì)數(shù)。
Training the word vectors 訓(xùn)練詞向量
Word2vec and related algorithms let you train useful word similarity models from unlabelled text.This is a key part of using deep learning for NLP with limited labelled data.The vectors are also useful by themselves – they power the .similarity()methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match.You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.
Word2vec以及相關(guān)算法能夠從未標(biāo)記文本中訓(xùn)練有用的詞條相似度模型,這是對(duì)有限標(biāo)記數(shù)據(jù)NLP的關(guān)鍵部分。向量本身也是很有用的-power了spaCy中的.similarity()函數(shù)。為了最佳結(jié)果,訓(xùn)練word2vec模型之前應(yīng)該先用spaCy對(duì)文本進(jìn)行預(yù)處理,這就確保了tokenizer能夠匹配??梢灾苯佑胹paCy的vector訓(xùn)練腳本(https://github.com/explosion/spacy-dev-resources/blob/master/training/word_vectors.py),對(duì)定制語(yǔ)言文本tokenizer進(jìn)行預(yù)處理,并且用Gensim(https://radimrehurek.com/gensim/)訓(xùn)練模型。vectors.bin文件的每一行包含一個(gè)詞條和向量值。
Training the tagger and parser 訓(xùn)練標(biāo)簽器和解釋器
You can now train the model using a corpus for your language annotated with Universal Dependencies.If your corpus uses the CoNLL-U format, i.e. files with the extension .conllu, you can use the convert command to convert it to spaCy's JSON format for training. Once you have your UD corpus transformed into JSON, you can train your model use the using spaCy's train?command.
現(xiàn)在可以用定制語(yǔ)言的語(yǔ)料和Universal Dependencies訓(xùn)練模型了。如果語(yǔ)料使用CoNLL-U格式,即以.conllu為擴(kuò)展名的文件,可以用convert命令將其轉(zhuǎn)換為spaCy的JSON格式進(jìn)行訓(xùn)練。UD語(yǔ)料轉(zhuǎn)換為JSON后就可以用spaCy的train命令訓(xùn)練模型了。
For more details and examples of how to train the tagger and dependency parser, see the usage guide on training.
更關(guān)于訓(xùn)練tagger和parser的細(xì)節(jié)和樣例請(qǐng)看分析模型訓(xùn)練指南。