This program is part of an exploratory project to evaluate the quality of LLM-generated feedback in assisting with assignment grading and enhancing student learning. This program processes either the code sections, text sections, or images of a student's submission to programming assignments, based on the provided arguments. It generates output into a markdown file or standard output.
The large language models used and implementation logic vary depending on whether the selected scope is 'image', 'code' or 'text'.
For the code scope, the program takes three files:
- An assignment's solution file
- A student's submission file
- A test file
For the text scope, the program takes two files:
- An assignment's solution file
- A student's submission file
For the image scope, the program takes up to two files, depending on the prompt used:
- A student's submission file
- (Optional) An assignment's solution file
- Handles image, text and code scopes.
- Reads pre-defined prompts specified in JSON files.
- Uses an argument parser for structured command-line input.
- Supports various Large Language Models to evaluate student assignment submissions.
- Saves response output in Markdown format with a predefined template or prints to stdout.
| Argument | Description | Required | 
|---|---|---|
| --submission_type | Type of submission (from arg_options.FileType) | ❌ | 
| --prompt | Pre-defined prompt name or file path to custom prompt file | ❌ ** | 
| --prompt_text | String prompt | ❌ ** | 
| --scope | Processing scope ( imageorcodeortext) | âś… | 
| --submission | Submission file path | âś… | 
| --question | Specific question to evaluate | ❌ | 
| --model | Model type (from arg_options.Models) | âś… | 
| --output | File path for where to record the output | ❌ | 
| --solution | File path for the solution file | ❌ | 
| --test_output | File path for the file containing the results from tests | ❌ | 
| --submission_image | File path for the submission image file | ❌ | 
| --solution_image | File path for the solution image file | ❌ | 
| --system_prompt | Pre-defined system prompt name or file path to custom system prompt | ❌ | 
| --llama_mode | How to invoke deepSeek-v3 (choices in arg_options.LlamaMode) | ❌ | 
| --output_template | Output template file (from `arg_options.OutputTemplate) | ❌ | 
| --json_schema | File path to json file for schema for structured output | ❌ | 
| --marking_instructions | File path to marking instructions/rubric | ❌ | 
| --model_options | Comma-separated key-value pairs of model options and their values | ❌ | 
| ** One of either --promptor--prompt_textmust be selected. If both are provided,--prompt_textwill be appended to the contents of the file specified by--prompt. | 
The program supports three scopes: code or text or image. Depending on which is selected, the program supports different models and prompts tailored for each option.
If the "code" scope is selected, the program will identify student errors in the code sections of the assignment, comparing them to the solution code. Additionally, if the --scope code option is chosen, the --question option can also be specified to analyze the code for a particular question rather than the entire file. Currently, you can specify a question number if the file type is jupyter notebook.  In order to use the --question option, the question code in both the solution and submission file must be delimited by '## Task {#}'. See the File Formatting Assumptions section.
If the "text" scope is selected, the program will identify student errors in the written responses of the assignment, comparing them to the solution's rubric for written responses. If the 'text' scope is chosen, then 'pdf' must be chosen for the submission type.
If the "image" scope is selected, the program will identify issues in submission images, optionally comparing them to reference solutions. Question numbers can be specified by adding the tag markus_question_name: <question name> to the metadata for the code cell that generates the submission image. The previous cell's markdown content will be used as the question's context.
The program automatically detects submission type based on file extensions in the assignment directory:
- Files ending with _submission.ipynb→ jupyter notebook
- Files ending with _submission.py→ python file
- Files ending with _submission.pdf→ PDF document
The user can also explicitly specify the submission type using the --submission_type argument if auto-detection is not suitable.
Currently, jupyter notebook, pdf, and python assignments are supported.
The --prompt argument accepts either pre-defined prompt names or custom file paths:
To use pre-defined prompts, specify the prompt name (without extension). Pre-defined prompts are stored as markdown (.md) files in the ai_feedback/data/prompts/user/ directory.
To use custom prompt files, specify the file path to your custom prompt. The file should be a markdown (.md) file.
Prompt files can contain template placeholders with the following structure:
Consider this question:
{context}
{submission_image}
Do the graphs in the attached image solve the problem? Do not include an example solution.Prompt files are now stored as markdown (.md) files in the ai_feedback/data/prompts/user/ directory. Each prompt can contain template placeholders that will be automatically replaced with relevant content.
Prompt Naming Conventions:
- Prompts to be used when --scope code is selected are prefixed with code_{}.md
- Prompts to be used when --scope image is selected are prefixed with image_{}.md
- Prompts to be used when --scope text is selected are prefixed with text_{}.md
Scope validation (prefix matching) only applies to pre-defined prompts. Custom prompt files can be used with any scope.
All prompts are treated as templates that can contain special placeholder blocks, the following template placeholders are automatically replaced:
- {context}- Question context
- {file_references}- List of files being analyzed with descriptions
- {file_contents}- Full contents of files with line numbers
- {submission_image}- Student submission image
- {solution_image}- Reference solution image
| Prompt Name | Description | 
|---|---|
| code_explanation.md | Outputs paragraph explanation of errors. | 
| code_hint.md | Outputs short hints on what errors are. | 
| code_lines.md | Outputs only code lines where errors are caused. | 
| code_table.md | Outputs a table which shows the question requirement, the student’s attempt, and potential issue. | 
| code_template.md | Outputs a template format specified to include error type, description, solution. | 
| code_annotation.md | Outputs a json object of a list of annotation objects to display student errors on MarkUs. This is intended for markus integration usage. | 
| Prompt Name | Description | 
|---|---|
| image_analyze.md | Outputs whether the submission image answers the question provided by the context. | 
| image_analyze_annotations.md | Outputs whether the submission image answers the question provided by the context as a list of JSON objects, each with a description of the issue and a location on the image. Intended for MarkUs integration usage. | 
| image_compare.md | Outputs table comparing style elements between submission and solution graphs. | 
| image_style.md | Outputs table checking the style elements in a submission graph. | 
| image_style_annotations.md | Outputs evaluations of style elements in a submission graph as a list of JSON objects, each with a description of the issue and a location on the image. Intended for MarkUs integration usage. | 
| Prompt Name | Description | 
|---|---|
| text_pdf_analyze.md | Outputs whether the submission written response matches all the criteria specified in the solution. | 
Additonally, the user can pass in a string through the --prompt_text argument. This will either be concatenated to the prompt if --prompt is used or fed in as the only prompt if --prompt is not used.
The --system_prompt argument accepts either pre-defined system prompt names or custom file paths:
To use pre-defined system prompts, specify the system prompt name (without extension). Pre-defined system prompts are stored as markdown (.md) files in the ai_feedback/data/prompts/system/ directory.
To use custom system prompt files, specify the file path to your custom system prompt. The file should be a markdown (.md) file.
System prompts define the AI model's behavior, tone, and approach to providing feedback. They are used to set the context and personality of the AI assistant.
The --marking_instructions argument accepts a file path to a text file containing rubric or marking instructions. If the prompt template contains a {marking_instructions} placeholder, the contents of the file will be inserted at that location in the prompt.
The models used can be seen under the ai_feedback/models folder.
- Model Name: gpt-4-turbo
- System Prompt: Behaviour of model is set with INSTRUCTIONS prompt from helpers/constants.py.
- Features:
- Assistant: Uses the OpenAI Assistant Beta Feature, allowing customized model for specific tasks.
- Vector Store: The model creates and manages a vector store for data retrieval.
- Tools Used: Supports file_search for retrieving information from uploaded files.
- Cleanup: Uploaded files and models are deleted after processing, in order to manage API resources.
 
- OpenAI Assistants Documentation
- Uses the same model as above but doesn't use the vector store functionality. Uploads files as part of the prompt.
Note: If you wish to use OpenAI models, you must specify your API key in an .env file. Create a .env file in your project directory and add your API key:
OPENAI_API_KEY=your_api_key_here
- Model Name: claude-3.7-sonnet
- System Prompt: Behaviour of model is set with INSTRUCTIONS prompt from helpers/constants.py.
- Claude Documentation
Note: If you wish to use the Claude model, you must specify your API key in an .env file. Create a .env file in your project directory and add your API key:
CLAUDE_API_KEY=your_api_key_here
Various models were also tested and run locally on the Teach CS Bigmouth server by using Ollama. Listed below are the models that were used to test out the project:
Models:
- deepSeek-R1:70B Documentation
- codellama:latest Documentation
- llama3.2-vision:90b Documentation
- 
- This model only supports at most one image attachment.
 
- llava:34b Documentation
- When --output filepathis given, the script will:
- Load the template for the output based on the --output_template(Options defined in ai_feedback/helpers/arg_options.OutputTemplate)
- Format it with the provided arguments and processing results.
- Save it under filepath
- When the --outputargument is not given, the prompt used and generated response will be sent to stdout in the format selected by--output_template.
- When the --output_templateargument is not given it will default toresponse_onlywhich is only the response from the model
Use --question "<section name>" to extract a specific section from a submission. Behavior depends on file type.
- 
PDF: looks up section names in the PDF’s Table of Contents (TOC). 
- 
Text / Markdown / Code (.txt, .md, .ipynb, .qmd): looks for Markdown-style headings (#, ##, ###, …). For code, write the heading in a comment line (e.g., ### Question 1 in Python). The extractor returns all content that belongs to that heading (up until the next heading at the same or higher level). 
Matching is case-insensitive and normalizes smart quotes, dashes, and extra whitespace.
- Any subdirectory of /test_submissions can be run locally. More examples can be added to this directory using a similar fashion.
To test the program using the GGR274 files, we assume that the test assignment files follow a specific directory structure. Currently, this program has been tested using Homework 5 of the GGR274 class at the University of Toronto.
Within the test_submissions/ggr274_homework5 directory, mock submissions are contained in a separate subdirectories test_submissions/ggr274_homework5/test#. The following naming convention is used for the files:
- Homework_5_solution.ipynb– Instructor-provided solution file
- student_submission.ipynb– Student's submission file
- test#_error_output.txt– Error trace file for the corresponding test case
Each test folder contains variations of student_submission.ipynb with different errors.
To ensure proper extraction and evaluation of student responses, the following format is assumed for Homework_5_solution.ipynb and student_submission.ipynb:
- 
Each task must be clearly delimited using markdown headers in the format: ## Task {#}This allows the program to isolate specific questions when using the --questionargument, ensuring the model only evaluates errors related to the specified question.
- 
Each file must start with: ## IntroductionThis section serves as the general assignment instructions and is not included in error evaluation. 
Mock student submissions are stored in ggr274_homework5/image_test#. The following naming convention is used for the files:
- solution.ipynb– Instructor-provided solution file
- student_submission.ipynb– Student's submission file
To grade a specific question using the --question argument, add the tag markus_question_name: <question name> to the metadata for the code cell that generates an image to be graded. The previous cell's markdown content will be used as the question's context.
In order to run this package locally:
Ensure you have the environment variables set up (see Models section above).
When you are in a terminal in the repo, run:
pip install -e .Run the program:
python -m ai_feedback \
  --submission_type <file_type> \
  --prompt <prompt_name> \
  --scope <image|code|text> \
  --submission <submission_file_path> \
  --solution <solution_file_path> \
  --test_output <test_ouput_path> \
  --submission_image <image_file_path> \
  --solution_image <image_file_path> \
  --question <question_number> \
  --model <model_name> \
  --output <file_path_to> \
  --output_template <file_name> \
  --system_prompt <prompt_file_path> \
  --llama_mode <server|cli>- See the Arguments section for the different command line argument options, or run this command to see help messages and available choices:
python -m ai_feedback -hpython -m ai_feedback --prompt code_lines --scope code --submission test_submissions/cnn_example/cnn_submission --solution test_submissions/cnn_example/cnn_solution.py --model openaipython -m ai_feedback --prompt_text "Evaluate the student's code readability." --scope code --submission test_submissions/cnn_example/cnn_submission.py --model openaipython -m ai_feedback --prompt text_pdf_analyze --scope text --submission test_submissions/pdf_example/student_pdf_submission.pdf --model openaipython -m ai_feedback --prompt code_table \
  --scope code --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb --question 1 --model deepSeek-R1:70Bpython -m ai_feedback --prompt image_analyze --scope image --solution ./test_submissions/ggr274_homework5/image_test2/student_submission.ipynb --submission_image test_submissions/ggr274_homework5/image_test2/student_submission.png --question "Question 5b" --model llama3.2-vision:90bpython -m ai_feedback --prompt code_lines --scope code --solution ./test_submissions/bfs_example/bfs_solution.py --submission test_submissions/bfs_example/bfs_submission.py --model remote --output --output test_file --output_template verbosepython3 -m ai_feedback --prompt code_table --scope code \
        --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb \
        --solution test_submissions/ggr274_homework5/test1/Homework_5_solution.ipynb \
        --model deepSeek-v3 --llama_mode serverpython3 -m ai_feedback --prompt code_table --scope code \
        --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb \
        --solution test_submissions/ggr274_homework5/test1/Homework_5_solution.ipynb \
        --model deepSeek-v3 --llama_mode clipython -m ai_feedback --prompt code_annotations --scope code --submission test_submissions/cnn_example/cnn_submission --solution test_submissions/cnn_example/cnn_solution.py --model openai --json_schema ai_feedback/data/schema/code_annotation_schema.jsonpython -m ai_feedback --prompt ai_feedback/data/prompts/user/code_overall.md --scope code --submission test_submissions/csc108/correct_submission/correct_submission.py --solution test_submissions/csc108/solution.py --model codellama:latestpython3 -m ai_feedback --prompt code_table --scope code --submission ../ai-autograding-feedback-eval/test_submissions/108/hard_coding_submission.py --model openai-vector --submission_type python --model_options "max_tokens=1200,temperature=0.4,top_p=0.92"In order to run this project on Bigmouth:
- SSH into teach.cs
- SSH into bigmouth (access permission required)
ssh bigmouth- Ensure you're in the project directory
- Start Ollama
ollama start- Ensure models specified in repo are downloaded
ollama list- Run the script according to the Package Usage section above.
This python package can be used as a dependency in the Markus Autotester, in order to display LLM generated feedback as overall comments and test outputs, and as annotations on the submission file. Following the instructions below to set up the Autotester, once 'Run Tests' is pressed, these comments and annotations should appear automatically on the Markus UI.
- /markus_test_scripts contains scripts which can be uploaded to the autotester in order to generate LLM Feedback
- Currently, only openAI and Claude models are supported.
- Within these llm script files, the models and prompts used can be changed by editing the command line arguments, through the run_llm() function.
Files:
- python_tester_llm_code.py: Runs LLM on any code assignment (solution file, submission file) uploaded to the autotester. First, creates general feedback and displays as overall comments and test output (can use any prompt and model). Second, feeds in the output of the first LLM response into the model again, asking it to create annotations for the student's mistakes. (Ensure to change submission file import name.)
- llm_helpers.py: contains helper functions needed to run llm scripts.
- python_tester_llm_pdf.py: Runs LLM on any pdf assignment (solution file and submission file) uploaded to the autotester. Creates general feedback about whether the student's written responses matches the instructors feedback. Dislayed in test outputs and overall comments.
- custom_tester_llm_code.sh: Runs LLM on assignments (solution file, submission file, test output file) uploaded to the custom autotester. Currently, supports jupyter notebook files uploaded. Can specify prompt and model used in the script. Displays in overall comments and in test outputs. Can optionally uncomment the annotations section to display annotations, however the annotations will display on the .txt version of the file uploaded by the student, not the .ipynb file.
- Ensure the student has submitted a submission file.
- Ensure the instructor has submitted a solution file, llm_helpers.py (located in /markus_test_scripts), and python_tester_llm_code.py (located in /markus_test_scripts). Instructor can also upload another pytest file which can be run as its own test group.
- Ensure the submission import statement in python_tester_llm_code.py matches the name of the student's submission file name.
- Create a Python Autotester Test Group to run the LLM File.
- In the Package Requirements section of the Test Group Settings for the LLM file, put:
git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedbackAlong with any other packages that the submission or solution file uses.
- Ensure the Timeout is set to 120 seconds or longer.
- Ensure Markus Autotester docker container has the API Keys in an .env file and specified in the docker compose file.
- Do the same as the code scope, but ensure that the student submission and instructor solution are .pdf files with the same naming assumption. Also, ensure that python_tester_llm_pdf.py is uploaded as the test script.
- In the Autotest settings of the assignment, click Add Tester and select the aioption.
- Fill in all required arguments for the AI tester.
- Upload any related files (e.g., JSON schema files, custom prompts, or configuration files).
- Ensure the MarkUs Autotester Docker container has the API keys defined in an .env file and that these variables are specified in the docker-compose.yml file.
- Ensure the Timeout is set to 120 seconds or longer.
- Look at the /test_submissions/cnn_example directory for the following files
- Instructor uploads: cnn_solution.py, cnn_test.py, llm_helpers.py, python_tester_llm_code.py files
- Separate test groups for cnn_test.py and python_tester_llm_code.py
- cnn_test.py Autotester package requirements: torch numpy
- python_tester_llm_code.py Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback numpy torch
- Student uploads: cnn_submission.pdf
- Look at the /test_submissions/bfs_example directory for the following files
- Instructor uploads: bfs_solution.py, test_bfs.py, llm_helpers.py, python_tester_llm_code.py files
- Separate test groups for test_bfs.py and python_tester_llm_code.py
- python_tester_llm_code.py Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback
- Student uploads: bfs_submission.pdf
- Look at the /test_submissions/pdf_example directory for the following files
- Instructor uploads: instructor_pdf_solution.pdf, llm_helpers.py, python_tester_llm_pdf.py files
- Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback
- Student uploads: student_pdf_submission.pdf
- Ensure the student has submitted a submission file.
- Ensure the instructor has submitted a solution file and custom_tester_llm_code.sh (located in /markus_test_scripts). Instructor can also upload another script used to run its own test group. (See below for GGR274 Example.)
- In the Markus Autotesting terminal:
 docker exec -it -u 0 markus-autotesting-server-1 /bin/bashThen as the root user, install the package:
/home/docker/.autotesting/scripts/defaultvenv/bin/pip install git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedbackAlso pip install other packages that the submission or solution file uses.
- Create a Custom Autotester Test Group to run the LLM script file.
- Ensure the Timeout is set to 120 seconds or longer.
- Ensure Markus Autotester docker container has the API Keys in an .env file and specified in the docker compose file.
- Look at the /test_submissions/ggr274_hw5_custom_tester directory for the following files
- Instructor uploads: Homework_5_solution.ipynb, test_hw5.py, test_output.txt, custom_tester_llm_code.sh, run_hw5_test.sh
- Two separate test groups: one for run_hw5_test.sh, and one for custom_tester_llm_code.sh
- Student uploads: test1_submission.ipynb, test1_submission.txt
NOTE: if the LLM Test Group appears to be blank/does not turn green, try increasing the timeout.
- custom_tester_llm_code.sh: Runs LLM on any assignment (solution file, submission file, test output file) uploaded to the autotester. Can specify prompt and model used in the script. Displays in overall comments and in test outputs.
To install project dependencies, including development dependencies:
$ pip install -e .[dev]To install pre-commit hooks:
$ pre-commit installTo run the test suite:
$ pytest