I found that customer information is stated in two forms.
One is in the sentence type and the other is the table type.
Hence, I started from dividing the 10-k text into two categories; text and table. (by using html tags)
The methods to deal with them , however, are similar: TEXT CLASSIFICATION
*GOAL 1 (sentence form):
Classifying sentences whether they are relevant to the customer information or not.
Net sales to the Company’s three major customers, Staples, Inc., Office Max, and United Stationers, Inc., represented approximately 43% in 2004, 46% in 2003 and 46% in 2002.
For fiscal 2003, Fujitsu accounted for approximately 31 percent of our consolidated accounts receivable and approximately 13 percent of our consolidated gross sales.
In 2004, Matyep in Mexico represented 11.0.% of our consolidated revenues and Burlington Resources Inc. represented 10.1%.
Fleetwood was the Company’s largest customer in 2004, representing approximately 31% of total sales.
I hoped that there would be some rules or sentence structures that can cover the whole customer information in 10-k. I tried manually finding those rules, ended up finding 24 kinds of sentences. Although they can help me find every sentences that contains customer information listed on Compustat data(used as a reference point during my whole research), some of the sentences filtered by those 24 rules are have nothing to do with the revenue information.
To get rid of those irrelevant sentences I adopted the machine learning techniques.