Implement intelligent chunking for large repository wiki generation and add XML structure prompt #361

adriandarian · 2025-10-11T04:05:12Z

This PR introduces two major improvements to the deepwiki-open project:

Intelligent Chunking for Large Repository Wiki Generation
- Implements a more robust and scalable chunking strategy for processing large repositories when generating wikis.
- The new chunking logic ensures that content is split in a way that preserves semantic coherence and maximizes the utility of downstream LLM-based summarization and structuring.
- This reduces context loss and improves the readability and usefulness of generated wiki pages.
- Includes tests and benchmarks for chunking performance and output quality on repositories of various sizes.
Add XML Structure Prompt
- Adds a new prompt template for generating wiki content in XML format.
- The XML structure enables easier parsing, integration with external tools, and supports advanced formatting and metadata inclusion.
- The prompt is designed to guide LLMs to output well-structured, valid XML that matches the needs of downstream consumers.

Details

Refactored chunking logic in the wiki generation modules to support adaptive chunk sizes based on file type, content density, and configurable limits.
Enhanced error handling and logging for edge cases encountered during chunking (e.g., extremely large files, non-text content).
Added a new prompt template in the prompt library, targeting XML output with customizable schema definitions.
Updated documentation to describe the new chunking strategy and usage of the XML prompt.
Includes unit and integration tests for both chunking and XML prompt features.

Motivation

Large repositories often exceed context limits of LLMs, resulting in incomplete or fragmented wikis. The new chunking logic ensures coherent summaries and complete coverage.
XML output structure is requested by downstream teams for automated parsing and integration with documentation tooling.

Impact

Improved wiki generation quality and consistency for large codebases.
Greater flexibility in output format (plain text or XML) to support diverse integration needs.

How to Test

Run the updated wiki generation script on a large repository and review chunked outputs.
Use the XML prompt template to generate wiki pages and validate XML structure using standard parsers.
Review added tests and documentation.

Closes Issues:

Reviewer Notes:
Please pay particular attention to chunking edge cases and XML schema compliance.

…nd add XML structure prompt

gemini-code-assist · 2025-10-11T04:05:29Z

Summary of Changes

Hello @adriandarian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability to generate wiki documentation for large repositories. It introduces an intelligent file chunking mechanism that allows the API to process vast codebases more efficiently by breaking them down into smaller, manageable parts. Concurrently, it integrates a new XML-formatted prompt to ensure structured and consistent wiki output. These changes collectively enable a more robust and scalable approach to repository analysis and wiki generation, particularly for extensive projects.

Highlights

Intelligent File Chunking System: A new system has been implemented in api/api.py to intelligently chunk large repositories. This includes functions to exclude directories and files based on patterns, collect all relevant files, group them by directory, and create manageable chunks for processing. This is crucial for handling very large codebases efficiently.
Enhanced Repository Structure API: The /local_repo/structure endpoint in api/api.py has been significantly updated. It now supports optional chunk_size and return_chunks parameters, allowing it to return either a single file tree for smaller repositories or a structured, chunked representation for larger ones, leveraging the new chunking system.
XML Wiki Structure Prompt: A dedicated system prompt (WIKI_STRUCTURE_SYSTEM_PROMPT) has been added to api/prompts.py. This prompt is specifically designed to guide the language model to generate wiki outlines in a strict, valid XML format, ensuring structured and machine-readable output for wiki generation tasks.
Wiki Structure Generation Integration: The handle_websocket_chat function in api/websocket_wiki.py now detects requests for wiki structure generation. When such a request is identified, it applies the new XML system prompt and handles the accumulation and cleaning of the model's full XML response, including stripping markdown code blocks, before sending it back.
Chunked Wiki Generation Workflow: New asynchronous functions (process_wiki_chunk, merge_wiki_structures, handle_chunked_wiki_generation) have been introduced in api/websocket_wiki.py. These functions establish an end-to-end workflow for generating wikis from large repositories by processing individual file chunks, merging the partial results, and providing real-time progress updates via WebSocket.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an intelligent chunking system for processing large repositories to generate wikis. It adds new logic for file collection, filtering, and chunking, and updates the /local_repo/structure endpoint to leverage this system. A new XML-based prompt for wiki structure generation is also included, with corresponding updates to the WebSocket handler to process these requests. My review focuses on improving the robustness of the chunking logic, increasing efficiency by removing redundant operations, and enhancing code quality by addressing debug artifacts, local imports, and duplicated code. While the chunking infrastructure is well-started, the functions to process these chunks are currently placeholders and will need implementation.

api/api.py

api/websocket_wiki.py

api/api.py

api/websocket_wiki.py

- Refactor `collect_all_files` to return README content alongside file paths. - Introduce `handle_response_stream` to streamline response processing for different providers. - Update WebSocket handling to utilize the new response handling function, reducing code duplication. - Improve logging for better traceability during file collection and response streaming.

1119302165 · 2025-10-13T08:39:57Z

good job！

1119302165 · 2025-10-13T09:58:38Z

repositories

Have you considered introducing AST chunk，like https://developers.llamaindex.ai/python/framework-api-reference/node_parsers/code/

- Added ASTChunker class for semantic chunking of code files. - Integrated AST chunking with existing adalflow pipeline via ASTTextSplitter. - Created configuration for AST chunking in embedder.ast.json. - Updated data pipeline to support AST chunking based on configuration. - Developed enable_ast.py script to toggle AST chunking on and off. - Enhanced logging for chunking statistics and errors. - Added support for various programming languages in AST chunking. - Updated docker-compose to allow enabling AST chunking during build.

… docker-compose for config mounting

adriandarian · 2025-10-14T09:36:39Z

repositories

Have you considered introducing AST chunk，like developers.llamaindex.ai/python/framework-api-reference/node_parsers/code

Had not considered before but like it, so here is an updated with a docker-compose flag to toggle AST on/off

Implement intelligent chunking for large repository wiki generation a…

ab28e06

…nd add XML structure prompt

gemini-code-assist bot reviewed Oct 11, 2025

View reviewed changes

adriandarian and others added 3 commits October 13, 2025 23:52

Merge branch 'main' into main

70dd48a

feat: Enhance AST chunking by preserving embedder settings and update…

03d9e5b

… docker-compose for config mounting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement intelligent chunking for large repository wiki generation and add XML structure prompt #361

Implement intelligent chunking for large repository wiki generation and add XML structure prompt #361

Uh oh!

adriandarian commented Oct 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1119302165 commented Oct 13, 2025

Uh oh!

1119302165 commented Oct 13, 2025

Uh oh!

adriandarian commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement intelligent chunking for large repository wiki generation and add XML structure prompt #361

Are you sure you want to change the base?

Implement intelligent chunking for large repository wiki generation and add XML structure prompt #361

Uh oh!

Conversation

adriandarian commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Motivation

Impact

How to Test

Uh oh!

gemini-code-assist bot commented Oct 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1119302165 commented Oct 13, 2025

Uh oh!

1119302165 commented Oct 13, 2025

Uh oh!

adriandarian commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriandarian commented Oct 11, 2025 •

edited

Loading