NER with word vectors.
Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.
INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.
As my RAM was restricted to 16 GB, I found the Data Streaming technique is especially useful in my case. (Helpful introductions on training Word Vectors using Gensim : https://rare-technologies.com/word2vec-tutorial/ , https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)
import os import sys import re from gensim.models import Word2Vec import logging logging.basicConfig( format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) #Display the progress of training word vectors dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for f_dir in os.listdir(self.dirname): for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])): #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])) for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)): if fname.endswith(".txt"): for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'): yield line.split() sentences_ = MySentences(dirname) model = Word2Vec(min_count = 10, size = 200) model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it #model.save('D:/mltool/word2vec/LMLM_word2vec_MIN_10') #model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10') sentences_ = MySentences(dirname) model.window = 5 model.workers = 5 model.size = 200 model.sg = 0 # allows faster training model.train(sentences__, total_examples = model.corpus_count, epochs = 10) model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model
Now, finally build spacy model using keyed vector constructed right before.
Type into CMD window:
python -m spacy init-model en your_spacy_model_name --vectors-loc keyed vector location