Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.


INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.



As my RAM was restricted to 16 GB, I found Data Streaming is useful in my case. (Helpful introductions on training Word Vectors using Gensim : ,

import os
import sys
import re
from gensim.models import Word2Vec

import logging
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it'D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200 = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:
python -m spacy init-model en your_spacy_model_nmae –vectors-loc keyed vector location



[NK Provocation Index][0] Intro

  • North Korean’s military provocations and nuclear threats are likely to hamper Korean Economic Growth
  • Possible Channel : Increased risk lead to Investment, Saving to decrease
  • X(N.K. Provocation)  (–> X'(Investment, saving(consumption) rate)  –> Y(Economic Growth)
  • Identification 1 : Measuring the degree of N.K. Provocation by number of articles belong to ‘Provocation/Nuclear threats’ topic (LDA topic model)
  • Identification 2 : Causality? VAR may be helpful