Material Ingestor – GitHub #1154

kennethrioja · 2025-10-06T12:34:56Z

Summary of changes

We now have a Github Ingestion Method for materials. Here is the list of the fetched metadata fields (n = 14):
- title, url, description, keywords, licence, status, doi, version, date {created|published|modified}, contributors, resource type, prerequisites
Github has a limited rate of API calls, thus we are caching the JSON response 7 days (see TTL) before re-updating the material. Here is the list of the API calls (n = 4 per Github repo)
- Repository : GET /repos/{owner}/{repo}
- Content : GET /repos/{owner}/{repo}/contents/{path}
- Releases : GET /repos/{owner}/{repo}/releases
- Repo contributors : GET /repos/{owner}/{repo}/contributors

Motivation and context

For CERN, and more especially the HEP Software Foundation, most of the training material were Github repos or pages, very few had bioschema (except the carpentries one). I wanted to get more elaborated entries for the HSF Training Center, thus this Github ingestor.

Checklist

I have read and followed the CONTRIBUTING guide.
I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
to license it to the TeSS codebase under the
BSD license.

lib/ingestors/concerns/github_ingestor_read_helpers.rb

+        uri = URI(url)
+        return nil unless uri.host =~ /github\.com|github\.io/i
+
+        if uri.host.end_with?('github.io')


lib/ingestors/concerns/github_ingestor_read_helpers.rb

+
+        if uri.host.end_with?('github.io')
+          github_api_from_io(uri)
+        elsif uri.host.end_with?('github.com')


kennethrioja · 2025-10-06T12:37:28Z

lib/ingestors/concerns/github_ingestor_material_helpers.rb

+# frozen_string_literal: true
+
+module Ingestors
+  module Concerns


I've created 'Concerns' for the sake of linting with Rubocop and passing the ABC metrics

This way we do not have a +200 lines in GithubIngestor Class

kennethrioja · 2025-10-06T12:39:41Z

lib/ingestors/concerns/github_ingestor_material_helpers.rb

+        # return doi if doi
+
+        # doi = fetch_doi_from_file(full_name, 'CITATION.md')
+        # return doi if doi


I left these in case someone wants to use these two other files to fetch DOI – most of the time if a DOI is provided, it is in README.md through a badge.

kennethrioja · 2025-10-06T12:42:05Z

lib/ingestors/github_ingestor.rb

+        category: :materials,
+        user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0'
+      }
+    end


Not really sure about those, I copied this one from bioschemas_ingestor.rb

kennethrioja · 2025-10-06T12:47:25Z

lib/ingestors/concerns/github_ingestor_material_helpers.rb

+
+      def resolve_url(repo_data)
+        homepage_nil_or_empty = repo_data['homepage'].nil? || repo_data['homepage'].empty?
+        url = homepage_nil_or_empty ? repo_data['html_url'] : get_redirected_url(repo_data['homepage'])


get_redirected_url is a method I added in the Ingestor Class to be able to follow links (through a 30X response or a meta[http-equiv="Refresh"] tag) as the open_url method was not sufficient to follow all redirections.

I stumbled upon a jupyter notebook which was automatically redirecting to another path that the github.io page (e.g., instead of visitng me.github.io/myrepo/ it directly redirected to me.github.io/myrepo/introduction/)

This is the github page : https://se-for-sci.github.io/
It automatically redirects to : https://se-for-sci.github.io/content/intro.html

kennethrioja added 3 commits October 6, 2025 14:10

feat(github-ingestor): source can be a sitemap of|or github.{io|com}

ec6bbe3

test(github-ingestor): added

ceff245

refactor(github-ingestor): rubocop lint ok

9bfbc67

github-advanced-security bot found potential problems Oct 6, 2025

View reviewed changes

kennethrioja commented Oct 6, 2025

View reviewed changes

PhilReedData mentioned this pull request Oct 8, 2025

Write guide to ingestion via Google Sheet ElixirTeSS/docs#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Material Ingestor – GitHub #1154

Material Ingestor – GitHub #1154

Uh oh!

kennethrioja commented Oct 6, 2025

Uh oh!

Check failure

Check failure

kennethrioja Oct 6, 2025 •

edited

Loading

Uh oh!

kennethrioja Oct 6, 2025

Uh oh!

kennethrioja Oct 6, 2025

Uh oh!

kennethrioja Oct 6, 2025

Uh oh!

kennethrioja Oct 9, 2025

Uh oh!

Uh oh!

Material Ingestor – GitHub #1154

Are you sure you want to change the base?

Material Ingestor – GitHub #1154

Uh oh!

Conversation

kennethrioja commented Oct 6, 2025

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

kennethrioja Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennethrioja Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

kennethrioja Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

kennethrioja Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

kennethrioja Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kennethrioja Oct 6, 2025 •

edited

Loading