2024-12-15
As a machine learning nerd, I’ve always dreamed of outsmarting the market and proving my genius. (Spoiler: it’s harder than it looks.) Without a unique edge, beating the market is nearly impossible. That’s where Ontario’s geological data comes in. In recent months I came across through a source of data from one of DocuPanda's Enterprise users called VerAI Discoveries. They use machine learning to predict where to drill for minerals, and in their collaboration with DocuPanda, are standardizing vast amounts of publicly available documents that are available via the Canadian stock exchange. Cue the light bulb moment: What if I could finally make some money trading the market? A mountain of largely untapped data, perhaps this can be leveraged to predict the stock prices of mining companies.
OK, so the source of data can be found here: https://www.hub.geologyontario.mines.gov.on.ca
It's a publicly available database of mineral exploration data in Ontario, Canada. It's a huge unstructured dataset, but that's part of our edge! Using DocuPanda, we can make sense of this giant mess of messy, handwritten data. Why is this even here? Turns out, the Canadian government requires mining companies to submit their findings to this public database. Lucky for us, their bureaucracy is our treasure trove. Let’s get cracking!
By browsing the website, we can find that they let you download a file called OGSDataListing.kml - which is a KML file that contains the locations of all the mineral exploration sites in Ontario. If we install Google Earth, we can view a map of all the sites:
Now, by clicking on any one of these squares, we get a webpage displaying inside Google Earth, with a link to an Excel file that contains a bunch of metadata with all the reports and filings for that region. This is a great start: if we can somehow grab all these Excel files, we can use them as a launching point for systematically scraping all the data. So how do we scrape these Excels? Obviously I don't plan to manually click on every rectangle in Canada and download one by one. There's like... thousands. So here's where ChatGPT comes in.
I open my trusty IDE, where I have multiple LLMs set up via their APIs. I personally find this more effective than copy-pasting into the playgrounds or web interfaces, because I can add entire files to the prompt much more easily. So I start by explaining what the situation is, what we're trying to do, and loading KML file into the prompt. I basically ask the AI hivemind (in this case, my go-to LLM is either o1-preview or claude sonnet 3.5) how can I scrape all these Excel files from the website.
And lo and behold, it solves it. It gives me a script that knows how to load this obscure KML file and make sense of it. Check out this snippet:
def extract_kmz_urls(kml_file): tree = et.parse(kml_file) root = tree.getroot() ns = {'kml': 'http://www.opengis.net/kml/2.2'} # Find all NetworkLink/Link/href elements that point to kmz files urls = [] for href in root.findall('.//kml:NetworkLink/kml:Link/kml:href', ns): if href.text and href.text.endswith('.kmz'): urls.append(href.text) return urls
Do I have any clue what et.parse is? Of course not. But that's the point - with LLM assistants, I don't really need to know everything. I simply run the script, and it works (more or less) on the first try. When something inevitably breaks, I just copy-paste the error back into ChatGPT, and voilà—problem solved. Now, my only problem is it takes forever. So I ask Claude to make my script parallel, and it does - giving me something like this:
# Create output directories output_dir = params.Paths.data_dir.joinpath("mining", "OGSData") output_dir.mkdir(parents=True, exist_ok=True) # Get KMZ URLs from main KML file kmz_urls = extract_kmz_urls(Paths.resources_dir.joinpath( "mining", "OGSDataListing.kml")) print(f"Found {len(kmz_urls)} KMZ files") # Process each KMZ file for kmz_url in kmz_urls: path_parts = Path(kmz_url).parts source_name = path_parts[path_parts.index('OGSDataListing')] source_output_dir = output_dir.joinpath(source_name) source_output_dir.mkdir(parents=True, exist_ok=True) print(f"Processing {kmz_url}") html_urls = extract_html_urls_from_kmz(kmz_url) print(f"Found {len(html_urls)} HTML pages") # Parallelize the download process for each HTML page with ThreadPoolExecutor(max_workers=5) as executor: futures = [executor.submit( download_excel_files, html_url, source_output_dir) for html_url in html_urls] for future in as_completed(futures): try: future.result() except Exception as e: print(f"Error processing HTML page: {e}")
So within just a few minutes, my laptop is diligently downloading all the Excel files from the website. I sit back and have some coffee while over five thousand files are downloaded. There's something satisfying about watching your script systematically collect tons of files that would have taken weeks to download manually. It is very neat that ChatGPT can unlock a lot of coding abilities that previously felt somewhat hard (or at least would require a lot of trial and error + googling), but now are largely just a few prompts away.
Now that we have all these Excel files, let's open a few to see what is in there:
Some of the Excel sheets contain mostly junk, but three seem to have substantial data:
OK, so we have thousands of Excel files, and each Excel file has multiple tabs, and each tab has a bunch of data. In short, the Excel files are our treasure maps, pointing us to web pages where we can scrape metadata or download the linked PDFs. This is the kind of thing that used to take real know-how to do, but now with ChatGPT, we can short circuit the process and get moving in no time. We will conclude the data scraping in the next episode, and then move onto the fun stuff - exploring and standardizing the data with DocuPanda!
Curious how we’ll turn this mountain of PDFs into stock market gold? Continue to the next episode: Web Scraping OGS.