This tool systematically extracts content from OpenSCAD documentation sources to create high-quality training data for fine-tuning an LLM that specializes in 3D modeling with OpenSCAD.
OpenSCAD is a powerful programming-based 3D modeling tool that uses its own scripting language. This project aims to create a comprehensive dataset from OpenSCAD's online documentation, including:
- The OpenSCAD cheat sheet
- The OpenSCAD user manual (Wikibooks)
- Code examples with their contextual explanations
- Command syntax and usage guidelines
The extracted data is structured in a way that's suitable for LLM fine-tuning, enabling the model to:
- Understand OpenSCAD syntax and commands
- Explain OpenSCAD concepts
- Generate code examples for 3D modeling tasks
- Assist with OpenSCAD programming challenges
- Complete Documentation Coverage: Extracts content from all essential sections of the OpenSCAD documentation.
- Context-Aware Code Examples: Associates code examples with their explanatory context.
- Structured Data Output: Organizes the data in a structured JSON format suitable for training.
- Training Example Generation: Creates question-answer pairs for more effective fine-tuning.
- Node.js (v14 or higher)
- npm or yarn
-
Clone this repository:
git clone https://github.com/yourusername/openscad-documentation-scraper.git cd openscad-documentation-scraper -
Install dependencies:
npm install
Run the scraper:
node scraper.jsThis will:
- Extract content from the OpenSCAD cheat sheet
- Extract content from all key sections of the OpenSCAD user manual
- Save the structured data to the
outputdirectory - Generate training examples for LLM fine-tuning
The scraper generates the following output files:
openscad_training_data.json: Complete dataset with all extracted contentopenscad_cheatsheet.json: Extracted content from the cheat sheetopenscad_usermanual.json: Extracted content from the user manualopenscad_training_examples.json: Generated question-answer pairs for training
The extracted data follows this structure:
{
"metadata": {
"description": "OpenSCAD Training Data for fine-tuning LLM",
"version": "1.0",
"date": "2025-03-16",
"source": "..."
},
"cheatSheet": {
"syntax": [
{ "name": "command", "description": "explanation", "links": [...] }
],
"primitives2D": [...],
"primitives3D": [...],
"transformations": [...],
// Other categories...
},
"userManual": {
"general": {
"title": "General",
"url": "page_url",
"introduction": "intro_text",
"content": {
"section_title": "section_content"
},
"codeExamples": [
{ "code": "code_sample", "context": "explanation" }
]
},
// Additional sections...
}
}The generated training examples are structured as question-answer pairs:
[
{
"query": "What is the syntax for cube in OpenSCAD?",
"response": "The syntax for cube in OpenSCAD is: `cube(size = [x,y,z], center = true/false);`"
},
{
"query": "Explain Primitive Solids in OpenSCAD",
"response": "Primitive solids are the basic 3D shapes in OpenSCAD that..."
},
// More examples...
]You can customize the scraper by modifying:
pagesToScrapearray: Add or remove pages to scrapeprocessCheatSheetDatafunction: Change how cheat sheet data is categorizedcreateTrainingExamplesfunction: Customize the generated training examples
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenSCAD documentation contributors
- Wikibooks OpenSCAD User Manual authors
Contributions are welcome! Please feel free to submit a Pull Request.
This document explains how to use the verification script to ensure your scraped OpenSCAD documentation is complete.
The verification script compares your scraped data with the content from the OpenSCAD User Manual print version to identify any missing information. It:
- Extracts content from the print version of the OpenSCAD documentation
- Compares it with your scraped data
- Identifies missing sections, incomplete content, and missing code examples
- Generates supplementary content to fill in the gaps
- Creates enhanced training examples using the complete data
- Make sure you have Node.js installed
- Install the required dependencies:
npm install puppeteer fs path- Place the verification script in the same directory as your scraped data
- Make sure your scraped data is saved as
output/openscad_training_data.json - Run the script:
node verification-script.jsThe script generates the following files in the verification_output directory:
comparison_results.json- Detailed comparison of scraped vs. print version contentverification_report.md- Human-readable report highlighting issuesenhanced_training_data.json- Your scraped data with missing content filled inenhanced_usermanual.json- Just the enhanced user manual portionenhanced_training_examples.json- Training examples generated from the enhanced data
The script focuses on verifying these key sections:
- Matrix - Vector of vectors explanation
- Objects - Object data structure documentation
- Retrieving a value from an object - Object property access methods
- Iterating over object members - How to loop through object properties
- Getting input - OpenSCAD's input capabilities and limitations
- Vector operators - Operations on vectors
- concat - Concatenation function documentation
- len - Length function documentation
- Special variables - Documentation on $fa, $fs, $fn, etc.
These are sections found in the print version but completely missing from your scraped data. The script will:
- List all missing sections
- Show the content from the print version
- Add this content to the enhanced data
These are sections that exist in your scraped data but are significantly shorter than in the print version (less than 70% of the print version length). The script will:
- Show both your scraped content and the print version content
- Replace your content with the print version in the enhanced data
Code examples found in the print version but missing from your scraped data. The script will:
- List all missing code examples
- Add them to the enhanced data
After running the verification, use the enhanced data files for your LLM training:
enhanced_training_data.json- Complete dataset with all missing content filled inenhanced_training_examples.json- Ready-to-use training examples for your LLM
This ensures your LLM has access to the complete OpenSCAD documentation without any gaps.