10 minute read

Capstone Project - AI Yearbook Digitization

Description of Capstone

For my capstone project, I chose to help design and build an automated data extraction and analysis tool that helps collect and organize large sets of raw information from past yearbooks at Charlotte Latin and turn it into a preservable digital database. The goal of the project is to reduce the amount of manual work required when trying to access data from decades of yearbooks and creating a digital database to preserve the information throughout time.

I chose this project because it allows me to combine the theoretical coding I have been learning to real-world applications. This project enables me to use my abilties to create a useful tool for the community my school and community.

A successful project would:

  • Automatically extract relevant data from past yearbook pdfs
  • Clean and standardize the extracted data
  • Output the results in a structured format ready for analysis
  • Increase efficiency by reducding human manual labor
  • Create a searchable AI database with the extracted data

Ultimately, success for this project means creating a tool that saves time, improves accuracy, and can realistically be continously used in future years to make an efficient way to continue adding data to a digital searchable database.

Tools Used

Software Tools:

  • Python – language used to build the extraction and processing logic
  • VS Code – editor for writing and debugging scripts
  • GitHub – file control and management
  • Claude AI – AI assistant for code development and debugging

Workflow of Extraction:

Step 1: Configuration Setup

A YAML file is created for each yearbook yearbook for metadata: page ranges for each section (e.g. clubs, sports, etc.) and expected counts of names for validation. This YAML file serves as a helpful guide for the AI throughout the entire extraction process.

Example MetaData File:

yaml

Step 2: Parallel Extractions

The system uses Python multiprocessing to extract multiple yearbook pages simultaneously with each parallel script extracting a different secrtion and page at the same time to increase speed and efficiency. These scripts are created to extract different sections of the yearbook like lower schoolers vs sports.

Example of Different Sections of Parallel Scripts:

parallel

Step 3: Orchestration

A universal orchestration script runs all extraction scripts in parallel based with the help of the pre-determined page number sections in the YAML file. It successfully extracts information and outputs into a unified database for review. Along with this, name splitting is run, so that full names are automatically parsed into components (first, middle, last, suffix).

Step 4: YAML File Review

Extracted results are validated by comparing the number of detected records against expected counts in the YAML file. Pages with mismatches are logged while the successful outputs are created into JSON files as the final product.

Step 6: Matching Pipeline

The system automatically matches names from sports and activities sections to person records using different matching strategies. This helps to keep records of different activies and clubs students have done.

Step 7: Manual Review

Remaining unmatched names are exported to a CSV for human review using a spreadsheet. This allows fast bulk corrections for nicknames and abbreviations without building a tedious custom script. Additionally, human review is need for the matching pipeline. It is hard to catch everything, so the manual review helps to catch the little amount that the AI extraction misses.

Step 8: Apply Corrections

Manual corrections from the CSV are validated and merged back into the database with the previously approved data. The final output is then ready for use.

Example of Student in JSON Final Output:

student_output

Struggles

One of the main challenges I faced was validating and correcting the YAML configuration files that control the entire extraction workflow. The initial YAML files were partially generated using AI, but when I manually reviewed the 1975 yearbook, I found that many page numbers, name counts, and even names were incorrect. In several sections, the AI had generated names that did not exist in the yearbook at all, while sports sections were only occasionally accurate. To fix this, I went page by page through the PDF and compared it against the YAML file. Instead of tracking individual names, I instead ensured that page ranges and expected person counts were correct since those are critical for validation whearas specific names weren’t as important.

Photo Extraction Feature

In addition to extracting text data from yearbooks, I developed a photo extraction feature that automatically extracts individual portrait photos from yearbook pages and links them to the corresponding person records. This allows the digital database to include not just names and information, but also the actual photos of students from each yearbook.

The photo extraction system uses DPT-2 (Document Parsing Transformer), an AI model that can analyze document pages and identify distinct visual elements like figures, text blocks, and images. The system sends each yearbook page to the DPT-2 API, which analyzes the page and returns “chunks” representing different elements it detects. Each chunk includes a bounding box with normalized coordinates (0-1) indicating where on the page that element is located. Then, the system filters the chunks to find only figure/image elements, which represent the portrait photos on the page. Simultaneously, the system uses the DPT-2 extract function with a custom schema to pull out the student names that appear as labels under each portrait. The schema specifically instructs the AI to only extract names that label actual photos, not names from senior quotes or attributions. The detected photo bounding boxes are matched to the extracted names using spatial positioning. Both photos and names are sorted by their position on the page (top-to-bottom, left-to-right), and then paired together in order. Using PyMuPDF, the system crops each individual photo from the PDF page using the bounding box coordinates, adds a small padding, and saves it as a JPEG file with a unique ID that links it to the person record.

Daily Journal

12/10: Today, I went through the 1975 final yearbook pdf to comapre the page numbers and name counts with the 1975 YAML file. When doing this, I first noticed that the names, name counts, and page numbers were incorrect majority of the time(AI attempted the first try). While it sometimes got the page numbers right(more for sports groups) and names correct, a lot of the names were made up and not within the yearbook (this happened for the class pictures and names a lot).

12/11: I continued working on the 1975 YAML and finished fixing it this class. Instead of tracking names, the main focus was getting the page number and count of names right.

12/12: Today, I started checking the 1977 YAML file. In this one, the librarians also had a name count with page numbers. After checking, majority of the name counts were correct, the only thing was that for each page, the librarian counted duplicate names on the same page.

12/15: Today, I modified the 1979 yearbook to have the page numbers of the pdf align with the actual page numbers of the yearbook. Additioanlly, I checked the librarians count in the 1979 YAML file.

1/5: Updated my GitHub page for the capstone project to reflect the current state of the project and prepare for the new semester.

1/6: Worked on status report including design specs, bill of materials, and functional page documentation for the capstone review.

1/7: Created a task analysis breaking down what still needs to be done for the project. Identified remaining yearbooks to process (1980, 1982, 1989) and new features to implement like photo extraction.

1/8: Met with Mr. Dubick to get approval on the project plan and next steps for the semester.

1/9: Created a Gantt chart to visualize the project timeline and milestones for completing the remaining yearbook extractions.

1/12: Learned about PCB milling with Gerber files as part of the engineering curriculum.

1/14: Caught Jacob up to speed on the project status, explaining the extraction pipeline and what work remained to be done on the 1980s yearbooks.

1/15: Created the YAML configuration file for the 1980 yearbook. This involved setting up all the page ranges for different sections (seniors, roster grades 1-11, sports, activities, teachers, administrators) and expected name counts for validation.

1/20: Began the extraction process for the 1980 yearbook. Ran the orchestrator script to kick off parallel extraction of all sections defined in the YAML file.

1/21: Added middle school sports and activities extraction scripts to handle sections that weren’t covered by the existing HS/LS extractors. Also updated the teacher and trustee extractors to handle the 1980 format variations.

1/22: Spent the session testing and validating the extraction scripts. Ran test extractions on sample pages to verify the output format and catch any edge cases before running the full extraction.

1/23: Completed the 1980 extraction with the new middle school scripts. All sections extracted successfully and the outputs were merged into the unified person records database.

1/26: Reviewed and documented the extraction outputs from 1980. Checked the record counts against expected values and noted any sections that needed manual review.

1/28: Added the 1980 yearbook name splitting output. Ran the universal name splitter to parse all full names into first/middle/last/suffix components for the 971 person records.

1/29: Completed the 1980 matching pipeline. The system matched names from sports and activities sections to person records, identifying who participated in which extracurriculars.

2/2: Reviewed the matching pipeline results for 1980. Analyzed the unmatched names to understand why they didn’t match and what manual corrections would be needed.

2/3: Manual review of name splitting corrections. Went through flagged records where the name splitter had low confidence and fixed parsing errors (like suffixes being parsed as middle names).

2/4: Applied 66 manual corrections to the 1980 yearbook data. These corrections fixed matching errors, name parsing issues, and a few missing records that were caught during review.

2/5: Finished the 1980 yearbook extraction - all 971 records complete with sports/activities matched and corrections applied. Began setting up the 1982 yearbook extraction by creating the YAML config.

2/6: Added 1982 yearbook extraction results. Ran the full orchestrator pipeline and got initial extraction of 1,189 person records from the 1982 yearbook.

2/10: Added 1982 name splitting and matching pipeline results. The matching pipeline identified 77 unmatched names that would need manual review.

2/11: Created the DPT-1 to DPT-2 migration plan. The Landing AI API was updating to a new version (DPT-2) with different bounding box formats, so I documented the changes needed to update our extraction scripts. Completed the 1982 extraction pipeline.

2/18: Built the automated page classifier (CP1) - a new script that automatically analyzes yearbook PDF pages and classifies them by content type. Also improved the matching pipeline with 8 new matching strategies, achieving a 40% reduction in unmatched names compared to the original approach.

2/19: Started photo extraction development - a new feature to automatically extract portrait photos from yearbook pages. Ran the first fully automated pipeline on CLS1989 (using CP1 page classifier through matching), demonstrating the new automation. Completed 1982 matching updates.

2/20: Tested and tried to implement the photo extractor for DPT-2 bounding box API for grades 1-11 roster pages. The challenge was detecting individual photo boundaries in the grid layouts where photos are arranged in tight rows.

2/23: Fixed the photo extractor for the DPT-2 API. Updated the bounding box parsing to work with the new API response format. Continued work on the fully automated pipeline for CLS1989 and completed 1982 matching updates.

2/24: Tried to have the photo extractor work with bounding boxes and aspect ratio constraints for width-to-length borders. The goal was to filter out non-portrait detections by enforcing portrait-like aspect ratios.

2/26: Tried to implement fixes with white bounding borders along with DPT-2. The approach was to detect the white borders between photos in roster grids to help separate individual portraits.

2/27: Switched focus to trying for senior photos instead of roster grids. Senior portrait pages have larger, more separated photos that are easier to detect reliably.

3/3: Implemented fixes for senior photos including white bounding detection, auto-calibration for different page layouts, and improved photo bounding boxes. Successfully extracted 137 photos from the 1982 senior section.

3/4: Found the best place to integrate the photo extractor into the overall pipeline. Decided to keep it as a separate optional step that runs after the main extraction since not all yearbooks need photo extraction.

3/5: Created the presentation slides for the capstone evaluation, summarizing the project goals, workflow, accomplishments, and challenges overcome.

3/6: Presentation and evaluations day. Presented the AI Yearbook Digitization project to evaluators, demonstrating the extraction pipeline and showing the results from processing 10 yearbooks with 7,750+ person records.

Updated: