Is this even doable?

riho · Nov 21, 2007

So I want to write a script in PHP which parses SEC filings on www.sec.gov

The problem with the filings is that each company uses different structure for their filings and different layout. Some are in html and some are in plaintext.
But some keywords in the text are always the same. Like "Net Income" and "Total current assets" etc.

Here are some sample links:
http://www.sec.gov/Archives/edgar/data/40730/000095012407001502/0000950124-07-001502.txt
http://www.sec.gov/Archives/edgar/data/1050797/0000893877-99-000199.txt

SD-[Inc] · Nov 21, 2007

If every filling is different you may have a problem. But if there is some consistency you can create a text parser that handles the different formats. You need to identiify the patterns for text field recognition. Worst case is you may have to save some confidence level with each parse output and manually verify low confidence entries. The confidence level contains the number of successfully parsed fields.

You could also try and tlak the government into providing this information as an XML stream!

Is this even doable?

riho

New member

SD-[Inc]

Well-known member