Is this even doable?

riho

New member
So I want to write a script in PHP which parses SEC filings on www.sec.gov

The problem with the filings is that each company uses different structure for their filings and different layout. Some are in html and some are in plaintext.
But some keywords in the text are always the same. Like "Net Income" and "Total current assets" etc.

Here are some sample links:
http://www.sec.gov/Archives/edgar/data/40730/000095012407001502/0000950124-07-001502.txt
http://www.sec.gov/Archives/edgar/data/1050797/0000893877-99-000199.txt
 
If every filling is different you may have a problem. But if there is some consistency you can create a text parser that handles the different formats. You need to identiify the patterns for text field recognition. Worst case is you may have to save some confidence level with each parse output and manually verify low confidence entries. The confidence level contains the number of successfully parsed fields.

You could also try and tlak the government into providing this information as an XML stream!
 
Back
Top