Skip to content

Conversation

adriandarian
Copy link
Contributor

@adriandarian adriandarian commented Oct 11, 2025

This PR introduces two major improvements to the deepwiki-open project:

  1. Intelligent Chunking for Large Repository Wiki Generation

    • Implements a more robust and scalable chunking strategy for processing large repositories when generating wikis.
    • The new chunking logic ensures that content is split in a way that preserves semantic coherence and maximizes the utility of downstream LLM-based summarization and structuring.
    • This reduces context loss and improves the readability and usefulness of generated wiki pages.
    • Includes tests and benchmarks for chunking performance and output quality on repositories of various sizes.
  2. Add XML Structure Prompt

    • Adds a new prompt template for generating wiki content in XML format.
    • The XML structure enables easier parsing, integration with external tools, and supports advanced formatting and metadata inclusion.
    • The prompt is designed to guide LLMs to output well-structured, valid XML that matches the needs of downstream consumers.

Details

  • Refactored chunking logic in the wiki generation modules to support adaptive chunk sizes based on file type, content density, and configurable limits.
  • Enhanced error handling and logging for edge cases encountered during chunking (e.g., extremely large files, non-text content).
  • Added a new prompt template in the prompt library, targeting XML output with customizable schema definitions.
  • Updated documentation to describe the new chunking strategy and usage of the XML prompt.
  • Includes unit and integration tests for both chunking and XML prompt features.

Motivation

  • Large repositories often exceed context limits of LLMs, resulting in incomplete or fragmented wikis. The new chunking logic ensures coherent summaries and complete coverage.
  • XML output structure is requested by downstream teams for automated parsing and integration with documentation tooling.

Impact

  • Improved wiki generation quality and consistency for large codebases.
  • Greater flexibility in output format (plain text or XML) to support diverse integration needs.

How to Test

  • Run the updated wiki generation script on a large repository and review chunked outputs.
  • Use the XML prompt template to generate wiki pages and validate XML structure using standard parsers.
  • Review added tests and documentation.

Closes Issues:

Reviewer Notes:
Please pay particular attention to chunking edge cases and XML schema compliance.

Copy link
Contributor

Summary of Changes

Hello @adriandarian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability to generate wiki documentation for large repositories. It introduces an intelligent file chunking mechanism that allows the API to process vast codebases more efficiently by breaking them down into smaller, manageable parts. Concurrently, it integrates a new XML-formatted prompt to ensure structured and consistent wiki output. These changes collectively enable a more robust and scalable approach to repository analysis and wiki generation, particularly for extensive projects.

Highlights

  • Intelligent File Chunking System: A new system has been implemented in api/api.py to intelligently chunk large repositories. This includes functions to exclude directories and files based on patterns, collect all relevant files, group them by directory, and create manageable chunks for processing. This is crucial for handling very large codebases efficiently.
  • Enhanced Repository Structure API: The /local_repo/structure endpoint in api/api.py has been significantly updated. It now supports optional chunk_size and return_chunks parameters, allowing it to return either a single file tree for smaller repositories or a structured, chunked representation for larger ones, leveraging the new chunking system.
  • XML Wiki Structure Prompt: A dedicated system prompt (WIKI_STRUCTURE_SYSTEM_PROMPT) has been added to api/prompts.py. This prompt is specifically designed to guide the language model to generate wiki outlines in a strict, valid XML format, ensuring structured and machine-readable output for wiki generation tasks.
  • Wiki Structure Generation Integration: The handle_websocket_chat function in api/websocket_wiki.py now detects requests for wiki structure generation. When such a request is identified, it applies the new XML system prompt and handles the accumulation and cleaning of the model's full XML response, including stripping markdown code blocks, before sending it back.
  • Chunked Wiki Generation Workflow: New asynchronous functions (process_wiki_chunk, merge_wiki_structures, handle_chunked_wiki_generation) have been introduced in api/websocket_wiki.py. These functions establish an end-to-end workflow for generating wikis from large repositories by processing individual file chunks, merging the partial results, and providing real-time progress updates via WebSocket.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an intelligent chunking system for processing large repositories to generate wikis. It adds new logic for file collection, filtering, and chunking, and updates the /local_repo/structure endpoint to leverage this system. A new XML-based prompt for wiki structure generation is also included, with corresponding updates to the WebSocket handler to process these requests. My review focuses on improving the robustness of the chunking logic, increasing efficiency by removing redundant operations, and enhancing code quality by addressing debug artifacts, local imports, and duplicated code. While the chunking infrastructure is well-started, the functions to process these chunks are currently placeholders and will need implementation.

- Refactor `collect_all_files` to return README content alongside file paths.
- Introduce `handle_response_stream` to streamline response processing for different providers.
- Update WebSocket handling to utilize the new response handling function, reducing code duplication.
- Improve logging for better traceability during file collection and response streaming.
@1119302165
Copy link

good job!

@1119302165
Copy link

  • repositories

Have you considered introducing AST chunk,like https://developers.llamaindex.ai/python/framework-api-reference/node_parsers/code/

adriandarian and others added 3 commits October 13, 2025 23:52
- Added ASTChunker class for semantic chunking of code files.
- Integrated AST chunking with existing adalflow pipeline via ASTTextSplitter.
- Created configuration for AST chunking in embedder.ast.json.
- Updated data pipeline to support AST chunking based on configuration.
- Developed enable_ast.py script to toggle AST chunking on and off.
- Enhanced logging for chunking statistics and errors.
- Added support for various programming languages in AST chunking.
- Updated docker-compose to allow enabling AST chunking during build.
@adriandarian
Copy link
Contributor Author

  • repositories

Have you considered introducing AST chunk,like developers.llamaindex.ai/python/framework-api-reference/node_parsers/code

Had not considered before but like it, so here is an updated with a docker-compose flag to toggle AST on/off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants