Skip to content

Conversation

kennethrioja
Copy link
Contributor

Summary of changes

  • We now have a Github Ingestion Method for materials. Here is the list of the fetched metadata fields (n = 14):
    • title, url, description, keywords, licence, status, doi, version, date {created|published|modified}, contributors, resource type, prerequisites
  • Github has a limited rate of API calls, thus we are caching the JSON response 7 days (see TTL) before re-updating the material. Here is the list of the API calls (n = 4 per Github repo)

Motivation and context

  • For CERN, and more especially the HEP Software Foundation, most of the training material were Github repos or pages, very few had bioschema (except the carpentries one). I wanted to get more elaborated entries for the HSF Training Center, thus this Github ingestor.

Checklist

  • I have read and followed the CONTRIBUTING guide.
  • I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
    to license it to the TeSS codebase under the
    BSD license.

uri = URI(url)
return nil unless uri.host =~ /github\.com|github\.io/i

if uri.host.end_with?('github.io')

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

'
github.io
' may be preceded by an arbitrary host name.

if uri.host.end_with?('github.io')
github_api_from_io(uri)
elsif uri.host.end_with?('github.com')

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

'
github.com
' may be preceded by an arbitrary host name.
# frozen_string_literal: true

module Ingestors
module Concerns
Copy link
Contributor Author

@kennethrioja kennethrioja Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created 'Concerns' for the sake of linting with Rubocop and passing the ABC metrics

This way we do not have a +200 lines in GithubIngestor Class

# return doi if doi

# doi = fetch_doi_from_file(full_name, 'CITATION.md')
# return doi if doi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left these in case someone wants to use these two other files to fetch DOI – most of the time if a DOI is provided, it is in README.md through a badge.

category: :materials,
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0'
}
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure about those, I copied this one from bioschemas_ingestor.rb


def resolve_url(repo_data)
homepage_nil_or_empty = repo_data['homepage'].nil? || repo_data['homepage'].empty?
url = homepage_nil_or_empty ? repo_data['html_url'] : get_redirected_url(repo_data['homepage'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_redirected_url is a method I added in the Ingestor Class to be able to follow links (through a 30X response or a meta[http-equiv="Refresh"] tag) as the open_url method was not sufficient to follow all redirections.

I stumbled upon a jupyter notebook which was automatically redirecting to another path that the github.io page (e.g., instead of visitng me.github.io/myrepo/ it directly redirected to me.github.io/myrepo/introduction/)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the github page : https://se-for-sci.github.io/
It automatically redirects to : https://se-for-sci.github.io/content/intro.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant