DIMMiN Notes

My working notes related to the development of dimmin.com.

Clear Filters

2025-10-01-Wednesday

  • Finished applying the data validation hierarchy idea, now working with a composite dataset called composite_raw.csv that I can use to start building my ML-ready dataset
  • Normalized score, rank, year, country (certainly needs work), and farm_size_square_meters (the rows I checked seemed ok but it could also use a validation check)
  • Started extracting the different varietys from each row. Decided to stick with 12 "canonical" coffees since we don't want to explode to 70 different varieties when we apply One Hot Encodeing.
  • Tried multi-shot calling on ollama's API to see if a consensus could be reached about which coffee varieties were present for a given lot. Little success so far but the code is established for other rows if they need to be cleaned.

Read more

2025-09-30-Tuesday

  • Re-defined the goals of the data cleaning step for this analysis (mainly establishing competition pages as a backup and re-orienting my focus on the lot-level data)
  • Extracted relevant data from existing lot pages in parse_lot_data.py
  • Combined this relevant data with competition data based on my competition data parser, applying a Left Join on the lot's url field
  • Established the start of a data validation hierarchy where missing data is filled in using data that is known in a specific order (LLM consensus < auction_table < score_table << lot)

Read more

2025-09-29-Monday

  • Found a quick way to merge the most relevant auction and score data at the competition level (however it might be useful to keep in mind that this data will be supplementary to the data provided in the lot page)
  • Tried using ollama API to extract data from raw HTML of competition page in an attempt to side-step this whole data cleaning process. Didn't work.
  • Found that I'm missing some pages (like this one) on my local machine even though I grepped the 20210921_coe_spider.log logs and found that a GET request was successfully made. This points to a potential for additional data that needs to be collected.

Read more

2025-09-26-Friday

  • Developed a function to identify primary tables (tables that link between competition pages and lots)
  • Found out I was missing most of the lot competition data (A total of 240 lots) from 2020
  • Also found out that there are some competition pages that don't have any lot links (indicating that I don't actually have the data I need to use as input for those coffees)
  • Parsed the Wayback Machine for missing competition / lot data from 2020 and found some of it.
  • Found out that the Wayback Machine has a useful REST API. Tried working with Claude to build some Middleware for the coe Scrapy Spider so that 404 or unknown links are sent to the most recent snapshot of the HTML page from the Wayback Machine. If I can get this to work it could be a really useful little backup plan for any Web Scraping project.

Read more

2025-09-25-Thursday

  • Found out how to call the local ollama API to run local models at scale using Leviathan
  • Fixed a corrupted competition HTML file for Guatemala 2013
  • Found that Brazil 2025 and Peru 2025 competition do not have any tables of data yet
  • Identified all unique feature names across my different competition tables, used Claude to create a mapping between the relevant features I need from the competition page and those from my JSON schema that I developed.

Read more

2025-09-24-Wednesday

  • Adjusted offline pipeline to handle files by directory rather than CSS selector
  • Created a database schema based on existing data that can be used as input to my algorithm
  • Found a weird bug where the file feature does not seem to consist of only file:/// paths (this may have to do more with Python Pandas or LibreOffice than my JSON structure though)
  • Created a prompt that I can send to an LLM that can take the coffee description as input and produce an output of the schema (this could change the emphasis / direction of CupUp - An Analysis of Optimal Coffee, having a section oriented towards LLMs for data cleaning).

Read more

2025-09-23-Tuesday

  • Rebuilt my coe Scrapy Spider so that original response URLs are saved as comments
  • Re-ran the coe spider to establish a local directory structure that mimics that of the site (for better offline processing)
  • Built a component of the offline pipeline to extract data from each of the competition pages so I can cross-reference data at the individual / lot level

Read more

2025-09-22-Monday

  • Extracted links from Scrapy Spider log files to prevent duplicate queries
  • Collected data on 5,612 different lots of coffee from the Cup of Excellence website
  • Built a pipeline for both the /farm-directory/ and /listing/ to store semi-structured JSON formatted data
  • Created first semi-structured dataset of COE data

Read more

2025-09-21-Sunday

Read more

2025-09-20-Saturday

Read more