Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.


INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.



As my RAM was restricted to 16 GB, I found Data Streaming is useful in my case. (Helpful introductions on training Word Vectors using Gensim : ,

import os
import sys
import re
from gensim.models import Word2Vec

import logging
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it'D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200 = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:
python -m spacy init-model en your_spacy_model_nmae –vectors-loc keyed vector location



[NK Provocation Index][0] Intro

  • North Korean’s military provocations and nuclear threats are likely to hamper Korean Economic Growth
  • Possible Channel : Increased risk lead to Investment, Saving to decrease
  • X(N.K. Provocation)  (–> X'(Investment, saving(consumption) rate)  –> Y(Economic Growth)
  • Identification 1 : Measuring the degree of N.K. Provocation by number of articles belong to ‘Provocation/Nuclear threats’ topic (LDA topic model)
  • Identification 2 : Causality? VAR may be helpful

[NKPR 0] Building Caffe on Window (for anaconda environment)

I recall that installing Caffe on Window was one of the hardest steps on this project.


  • However,  some (small) problems arises depending on the different environment one has.  For me, installing VS 2015 raised error ;  a setup package is either missing or damaged, but no perfect help for this problem exists on the web. (Spend two days repeating shredding the whole VS 2015/reinstalling)


  • In addition, building PyCaffe requires python 3.5, while I have been using python 3.6 (anaconda) for my previous works. Since I do not want to change my working environment, I tried to install PyCaffe using anaconda environment setting(python 3.5). There are some settings that should be modified before installing.


  1. Create new environment for python 3.5. (e.g. conda create -n py35 python = 3.5.0 anaconda)
  2. Before using cmd, call the anaconda environment (e.g. conda activate py35)
  3. When modifying caffe\caffe\scripts\build_win.cmd according to the video above, set CONDA_ROOT variable as location to the python 3.5. environment conda
  4. Now follow the video!
  5. Done!

[USVC] Drawing Supply Chain 2 – US Listed Domestic Firms

이 슬라이드 쇼에는 JavaScript가 필요합니다.

<Histogram of firm out nodes, clockwise from left top: 2000, 2005, 2010, 2015>


  • Total Sample number decreased through 1998 to 2016:{1998: 9062, 1999: 8906, 2000: 8512, 2001: 8167, 2002: 7692, 2003: 7447, 2004: 7498, 2005: 7015, 2006: 6690, 2007: 6676, 2008: 7448, 2009: 7792, 2010: 7427, 2011: 7223, 2012: 6943, 2013: 6783, 2014: 6640, 2015: 6231, 2016: 5850}
  • The number of edges(linking firms), however, further decreased in the same period
  • The average shortest length of all possible linkages: {2000: 1.638, 2005: 1.531, 2010: 1.284, 2015: 1.322}


Possible explanations

  • The (trained natural language) model may be over-fitted to early 2000’s
  • Supply Chain among U.S. firms might be actually decreasing due to economic uncertainty


10 Firms with the most in-nodes

(year 2000) : [[‘Walmart Inc’, 0.027717626678215677], [‘Lucent Technologies Inc’, 0.026634906886097875], [‘Hewlett Packard Enterprise Co’, 0.023170203551320916], [‘AT&T Corp’, 0.018189692507579038], [‘Ford Inc’, 0.0173235166738848], [‘Cisco Systems Inc’, 0.01602425292334344], [‘Siemens AG’, 0.013858813339107838], [‘Boeing Corp’, 0.013642269380684278], [‘Intel Corp’, 0.012126461671719359], [‘Target Inc’, 0.01169337375487224]]

(year 2015) : [[‘Walmart Inc’, 0.020942408376963352], [‘AT&T Corp’, 0.010732984293193719], [‘Ford Inc’, 0.010209424083769635], [‘Shell Oil Co’, 0.009947643979057593], [‘Target Inc’, 0.00968586387434555], [‘Home Depot Inc. ‘, 0.009162303664921467], [‘Cisco Inc’, 0.008638743455497384], [‘Microsoft Corp’, 0.008638743455497384]]




I found that customer information is stated in two forms.

One is in the sentence type and the other is the table type.

Hence, I started from dividing the 10-k text into two categories; text and table. (by using html tags)

The methods to deal with them , however, are similar: TEXT CLASSIFICATION

*GOAL 1 (sentence form):

Classifying sentences whether they are relevant to the customer information or not.

Example Sentences:

Net sales to the Company’s three major customers, Staples, Inc., Office Max, and United Stationers, Inc., represented approximately 43% in 2004, 46% in 2003 and 46% in 2002.

For fiscal 2003, Fujitsu accounted for approximately 31 percent of our consolidated accounts receivable and approximately 13 percent of our consolidated gross sales.

In 2004, Matyep in Mexico represented 11.0.% of our consolidated revenues and Burlington Resources Inc. represented 10.1%.

Fleetwood was the Company’s largest customer in 2004, representing approximately 31% of total sales.

I hoped that there would be some rules or sentence structures that can cover the whole customer information in 10-k. I tried manually finding those rules, ended up finding 24 kinds of sentences. Although they can help me find every sentences that contains customer information listed on Compustat data(used as a reference point during my whole research), some of the sentences filtered by those 24 rules are have nothing to do with the revenue information.

To get rid of those irrelevant sentences I adopted the machine learning techniques.

*Annotation (GOLD-STANDARD);









[USVC 1] Scraping 10-k disclosure DATA from SEC

  • 아래의 사이트를 참고하면 보다 정밀한 방법으로 10-k 데이터를 얻을 수 있을 것으로 생각.



  • python 언어를 배우기 시작한 초기에 진행하여 그냥 가지고 있는 CIK number set을  10-k with CIK number에 번갈아가며 대입하여 다운 받는 방식 사용.10-k.png