Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.


INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.



As my RAM was restricted to 16 GB, I found the Data Streaming technique is especially useful in my case. (Helpful introductions on training Word Vectors using Gensim : ,

import os
import sys
import re
from gensim.models import Word2Vec

import logging
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it'D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200 = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:

python -m spacy init-model en your_spacy_model_name --vectors-loc keyed vector location




[USVC] Drawing Supply Chain 2 – US Listed Domestic Firms

이 슬라이드 쇼에는 JavaScript가 필요합니다.

<Histogram of firm out nodes, clockwise from left top: 2000, 2005, 2010, 2015>


  • Total Sample number decreased through 1998 to 2016:{1998: 9062, 1999: 8906, 2000: 8512, 2001: 8167, 2002: 7692, 2003: 7447, 2004: 7498, 2005: 7015, 2006: 6690, 2007: 6676, 2008: 7448, 2009: 7792, 2010: 7427, 2011: 7223, 2012: 6943, 2013: 6783, 2014: 6640, 2015: 6231, 2016: 5850}
  • The number of edges(linking firms), however, further decreased in the same period
  • The average shortest length of all possible linkages: {2000: 1.638, 2005: 1.531, 2010: 1.284, 2015: 1.322}


Possible explanations

  • The (trained natural language) model may be over-fitted to early 2000’s
  • Supply Chain among U.S. firms might be actually decreasing due to economic uncertainty


10 Firms with the most in-nodes

(year 2000) : [[‘Walmart Inc’, 0.027717626678215677], [‘Lucent Technologies Inc’, 0.026634906886097875], [‘Hewlett Packard Enterprise Co’, 0.023170203551320916], [‘AT&T Corp’, 0.018189692507579038], [‘Ford Inc’, 0.0173235166738848], [‘Cisco Systems Inc’, 0.01602425292334344], [‘Siemens AG’, 0.013858813339107838], [‘Boeing Corp’, 0.013642269380684278], [‘Intel Corp’, 0.012126461671719359], [‘Target Inc’, 0.01169337375487224]]

(year 2015) : [[‘Walmart Inc’, 0.020942408376963352], [‘AT&T Corp’, 0.010732984293193719], [‘Ford Inc’, 0.010209424083769635], [‘Shell Oil Co’, 0.009947643979057593], [‘Target Inc’, 0.00968586387434555], [‘Home Depot Inc. ‘, 0.009162303664921467], [‘Cisco Inc’, 0.008638743455497384], [‘Microsoft Corp’, 0.008638743455497384]]



[USVC] Drawing Supply Chain 1 – Small Sample

— Title : [Python; NetworkX] Supply Chain analysis
— Key word : networkx, Node, Edge, Centrality, Supply Chain, Value Chain


  • About 200 major firms listed on Compustat data
  • Data Set will soon encompass all the firms with CIK code
  • Customer information extracted from 10-k disclosure data


  • Drawn from the basic networkx graph tool (nx.draw())
  • year : ordered in years ; 2000, 2005, 2010, 2015
  • Size of node : in_degree_centrality
  • Color of node : out_degree_centrality


Sample Code :

(Reference :

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

def draw_G(G, year):
    oc = nx.out_degree_centrality(G)
    for key in oc.keys():
        oc[key] = oc[key]*10
    nx.set_node_attributes(G, name= 'cent', values = oc)
    ic = nx.in_degree_centrality(G)
    nx.set_node_attributes(G, name= 'in', values = ic)
    node_size = [float(G.node[v]['in'])*20000 + 1 for v in G]
    node_color = [float(G.node[v]['cent']) for v in G]
    pos = nx.spring_layout(G, k=30, iterations=8)
    nodes = nx.draw_networkx_nodes(G, pos, node_size=node_size, node_color = node_color, alpha=0.5)
#nodes = nx.draw_networkx_nodes(G, pos, node_color=node_color, alpha=0.5)
    edges = nx.draw_networkx_edges(G, pos, edge_color='black', arrows=True, width=0.3)
    nx.draw_networkx_labels(G, pos, font_size=5)
    plt.text(0,-1.2, 'Node color is out_degree_centrality', fontsize=7)
    plt.title('Compustat firms Supply Chain (year : ' + str(year) + ')', fontsize=12)
    cbar = plt.colorbar(mappable=nodes, cax=None, ax=None, fraction=0.015, pad=0.04)
    cbar.set_clim(0, 1)
    plt.savefig(str(year)+ 'Supply Chain.png', dpi=1000)


  • Longest path :


I found that customer information is stated in two forms.

One is in the sentence type and the other is the table type.

Hence, I started from dividing the 10-k text into two categories; text and table. (by using html tags)

The methods to deal with them , however, are similar: TEXT CLASSIFICATION

*GOAL 1 (sentence form):

Classifying sentences whether they are relevant to the customer information or not.

Example Sentences:

Net sales to the Company’s three major customers, Staples, Inc., Office Max, and United Stationers, Inc., represented approximately 43% in 2004, 46% in 2003 and 46% in 2002.

For fiscal 2003, Fujitsu accounted for approximately 31 percent of our consolidated accounts receivable and approximately 13 percent of our consolidated gross sales.

In 2004, Matyep in Mexico represented 11.0.% of our consolidated revenues and Burlington Resources Inc. represented 10.1%.

Fleetwood was the Company’s largest customer in 2004, representing approximately 31% of total sales.

I hoped that there would be some rules or sentence structures that can cover the whole customer information in 10-k. I tried manually finding those rules, ended up finding 24 kinds of sentences. Although they can help me find every sentences that contains customer information listed on Compustat data(used as a reference point during my whole research), some of the sentences filtered by those 24 rules are have nothing to do with the revenue information.

To get rid of those irrelevant sentences I adopted the machine learning techniques.

*Annotation (GOLD-STANDARD);









[USVC 1] Scraping 10-k disclosure DATA from SEC

  • 아래의 사이트를 참고하면 보다 정밀한 방법으로 10-k 데이터를 얻을 수 있을 것으로 생각.



  • python 언어를 배우기 시작한 초기에 진행하여 그냥 가지고 있는 CIK number set을  10-k with CIK number에 번갈아가며 대입하여 다운 받는 방식 사용.10-k.png





[Ongoing Project] Drawing Supply Chain from SEC 10-k DATA (using python)

제목은 패기있게 영어로 적었으나… 우선 내용은 한글로…

  • 목표 :  미국 내 기업들 사이의 Supply Chain 그리기


  • 자료 : 미국 기업들이 SEC에 보고하는 10-k DATA

기업 매출 정보를 보다 투명하게 관리하고 향후 기업의 매출액 집중에 따른 위험을 알리기 위해 특정 기업이 한 소비자(기업)에게 얻는 매출액이 전체 매출액의 10%를 넘으면 10-k 데이터에 보고하도록 되어있다.

관련 법안은 현재 기억이 잘 나지 않아 공란으로 남겨둔다.  (2000 년대 초반 시행으로 기억)

  • 필요성 : 기업 사이 역학 관계 파악