anshoomehra/parsing10k.ipynb

Last active September 1, 2025 06:50

Star (144) You must be signed in to star a gist
Fork (29) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2.js"></script>
Save anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2 to your computer and use it in GitHub Desktop.

Download ZIP

How to Parse 10-K Report from EDGAR (SEC)

Raw

parsing10k.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

monashjg commented Jun 6, 2023

May I know how to remove the footer information, "Apple Inc. | 2018 Form 10-K |" as well as page number from the generated text?

AlessandroVentisei commented Jun 28, 2023

Thanks for this! I've followed the steps to get historic numeric data and made a free API in case anyone else wants the data for training AI etc.
https://rapidapi.com/alexventisei2/api/sec-api2

thegallier commented Sep 24, 2023

i think the line below assumes same number of entries for all items, which is not necessarily the case for example nyt. in that case there are more item 1A items then 1B and the approach does not work. I would also add re.IGNORECASE to the re.compile

pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

VadarVillage commented Sep 26, 2023

This was very helpful, thank you for taking the time to post this

niravsatani24 commented Sep 30, 2023

Amazing! Thanks for sharing.

rabsher commented Dec 4, 2023

i have Html url i dont know how to get txt url of 10k file after that I am able to use above notebook code

any one can help me please

versatile712 commented Mar 19, 2024

Jesus, you saved my life!

Tarun3679 commented Mar 24, 2025

I just tried this, and it does not seem to return anything for the example above?

rabsher commented Mar 24, 2025

I just tried this, and it does not seem to return anything for the example above?

import requests
url = "https://www.sec.gov/Archives/edgar/data/1571996/000157199624000036/dell-20240202.htm" must be .htm
  headers = {
       "User-Agent": 'get it from sec website',  # by SEC website
       'Accept-Encoding': 'gzip, deflate',
       'Host': 'www.sec.gov'
   }
   response = requests.get(file_url, headers=headers)
   html_content = response.text.replace('\xa0', ' ')

you can use this code to parse a 10kfile Once you have HTML you can create your regex function to parse specific content from HTML, or you can get a complete 10k filing as text

Tarun3679 commented Mar 28, 2025

Does anyone know any such similar script to retrieve 10-Q?

john-friedman commented Apr 16, 2025

@Tarun3679
https://github.com/john-friedman/datamule-python

from datamule import Portfolio

portfolio = Portfolio('10q')
portfolio.download_submissions(submission_type='10-Q',ticker='MSFT')

for document in portfolio.document_type('10-Q'):
  document.parse()
  print(document.data)

anshoomehra/parsing10k.ipynb

monashjg commented Jun 6, 2023

Uh oh!

AlessandroVentisei commented Jun 28, 2023

Uh oh!

thegallier commented Sep 24, 2023

Uh oh!

VadarVillage commented Sep 26, 2023

Uh oh!

niravsatani24 commented Sep 30, 2023

Uh oh!

rabsher commented Dec 4, 2023

Uh oh!

versatile712 commented Mar 19, 2024

Uh oh!

Tarun3679 commented Mar 24, 2025

Uh oh!

rabsher commented Mar 24, 2025

Uh oh!

Tarun3679 commented Mar 28, 2025

Uh oh!

john-friedman commented Apr 16, 2025

Uh oh!