DIMMiN Notes
My working notes related to the development of dimmin.com.
2025-10-01-Wednesday
- Finished applying the data validation hierarchy idea, now working with a composite dataset called
composite_raw.csvthat I can use to start building my ML-ready dataset - Normalized
score,rank,year,country(certainly needs work), andfarm_size_square_meters(the rows I checked seemed ok but it could also use a validation check) - Started extracting the different
varietys from each row. Decided to stick with 12 "canonical" coffees since we don't want to explode to70different varieties when we apply One Hot Encodeing. - Tried multi-shot calling on ollama's API to see if a consensus could be reached about which coffee varieties were present for a given
lot. Little success so far but the code is established for other rows if they need to be cleaned.
2025-09-30-Tuesday
- Re-defined the goals of the data cleaning step for this analysis (mainly establishing
competitionpages as a backup and re-orienting my focus on thelot-level data) - Extracted relevant data from existing
lotpages inparse_lot_data.py - Combined this relevant data with
competitiondata based on my competition data parser, applying a Left Join on thelot'surlfield - Established the start of a data validation hierarchy where missing data is filled in using data that is known in a specific order (
LLM consensus<auction_table<score_table<<lot)
2025-09-29-Monday
- Found a quick way to merge the most relevant auction and score data at the
competitionlevel (however it might be useful to keep in mind that this data will be supplementary to the data provided in thelotpage) - Tried using ollama API to extract data from raw HTML of
competitionpage in an attempt to side-step this whole data cleaning process. Didn't work. - Found that I'm missing some pages (like this one) on my local machine even though I grepped the
20210921_coe_spider.loglogs and found that a GET request was successfully made. This points to a potential for additional data that needs to be collected.
2025-09-26-Friday
- Developed a function to identify primary tables (tables that link between
competitionpages andlots) - Found out I was missing most of the lot competition data (A total of 240
lots) from2020 - Also found out that there are some
competitionpages that don't have anylotlinks (indicating that I don't actually have the data I need to use as input for those coffees) - Parsed the Wayback Machine for missing
competition/lotdata from 2020 and found some of it. - Found out that the Wayback Machine has a useful REST API. Tried working with Claude to build some Middleware for the
coeScrapy Spider so that404or unknown links are sent to the most recent snapshot of the HTML page from the Wayback Machine. If I can get this to work it could be a really useful little backup plan for any Web Scraping project.
2025-09-25-Thursday
- Found out how to call the local ollama API to run local models at scale using
Leviathan - Fixed a corrupted
competitionHTML file for Guatemala 2013 - Found that Brazil 2025 and Peru 2025
competitiondo not have any tables of data yet - Identified all unique feature names across my different
competitiontables, used Claude to create a mapping between the relevant features I need from the competition page and those from my JSON schema that I developed.
2025-09-24-Wednesday
- Adjusted offline pipeline to handle files by directory rather than CSS selector
- Created a database schema based on existing data that can be used as input to my algorithm
- Found a weird bug where the
filefeature does not seem to consist of onlyfile:///paths (this may have to do more with Python Pandas or LibreOffice than my JSON structure though) - Created a prompt that I can send to an LLM that can take the coffee description as input and produce an output of the schema (this could change the emphasis / direction of CupUp - An Analysis of Optimal Coffee, having a section oriented towards LLMs for data cleaning).
2025-09-23-Tuesday
- Rebuilt my coe Scrapy Spider so that original response URLs are saved as comments
- Re-ran the coe spider to establish a local directory structure that mimics that of the site (for better offline processing)
- Built a component of the offline pipeline to extract data from each of the competition pages so I can cross-reference data at the individual / lot level
2025-09-22-Monday
- Extracted links from Scrapy Spider log files to prevent duplicate queries
- Collected data on
5,612different lots of coffee from the Cup of Excellence website - Built a pipeline for both the
/farm-directory/and/listing/to store semi-structured JSON formatted data - Created first semi-structured dataset of COE data
2025-09-21-Sunday
- Fixed Blog Post's
PostDjango model so that it doesn't send email notifications when in the dev environment andPost.publish_dateis updated whenis_activeis toggled fromFalsetoTrue(Resolving issue 120 and issue 133). - Started working on the next Blog Post, CupUp - An Analysis of Optimal Coffee
- Built a quick web scraper using Scrapy to gather all of the data I could from the Cup of Excellence competition, collecting ~
97%of all coffee lots and their associated competition data