Skip to content

Conversation

davidstap
Copy link
Collaborator

@davidstap davidstap commented Aug 1, 2025

This is long overdue, but this (draft) PR implements functionality to display plenary videos (keynotes, panels, business meetings, etc.) on ACL Anthology event pages and creates dedicated landing pages for each talk.

Overview

Previously, plenary videos were stored in XML files but not displayed on the website #4309. This PR adds:

  • Display of talks on event pages
  • Individual landing pages for each talk with video players
  • BibTeX citation support for talks
  • Integration with existing ACL Anthology infrastructure

Implementation Details

1. Data Export Updates (bin/create_hugo_data.py)

  • Modified export_events(): Added talk data to event JSON exports, including talk metadata, speakers, and video URLs
  • Added export_talks(): New function that creates individual talk data files in build/data/talks/ directory
  • Talk ID format: {event-id}.talk-{number} (e.g., acl-2023.talk-1)

2. Hugo Templates

  • Created hugo/layouts/talks/single.html: Individual talk landing pages with:
    • Embedded video player for available videos
    • Speaker information and talk metadata
    • BibTeX citation format
    • Links back to parent event
  • Updated hugo/layouts/events/single.html: Added "Talks & Presentations" section between existing "Links" and "Volumes" sections, showing:
    • Talk titles with links to individual pages
    • Video availability indicators

3. Hugo Content Generation

  • Created hugo/content/talks/_content.gotmpl: Template for automatic talk page generation
  • Created hugo/content/talks/_index.md: Index page for talks section
  • URL structure: Individual talks accessible at /talks/{talk-id}/

4. Data Model Integration

  • Used existing Talk and Event classes
  • Used existing NameSpecification and EventFileReference infrastructure
  • Maintained consistency with paper video handling patterns

TO DO's (improvements):

  • On plenary pages, speaker names should be linked to author pages, similarly to how that's handled on paper pages.
  • The Talk ID used in the URL currently mismatches with the file name, e.g. acl-2022.talk-2 => https://aclanthology.org/2022.acl-keynote.2.mp4, this should be fixed.
  • Close outdated PR [WIP] Display plenary talks on event page #3603 that aimed to do the same as this PR

@davidstap davidstap requested a review from mjpost August 1, 2025 13:32
@davidstap davidstap self-assigned this Aug 1, 2025
@davidstap davidstap added this to the 2025Q3 milestone Aug 1, 2025
@davidstap
Copy link
Collaborator Author

davidstap commented Aug 1, 2025

Curious about your thoughts @mjpost ! :-)

See e.g. ACL 2022, which has plenaries in the XML: https://preview.aclanthology.org/display_plenaries/events/acl-2022/

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super cool so far, thank you!

I gave the build script changes a first look and have some comments.

Comment on lines 501 to 512
# Generate talk ID from video filename if available, otherwise use default pattern
if "video" in talk.attachments and talk.attachments["video"].name:
# Extract talk ID from video filename like "2022.acl-keynote.2.mp4"
video_name = talk.attachments["video"].name
if video_name.endswith(".mp4"):
# Remove .mp4 extension to get the talk ID
talk_id = video_name[:-4]
else:
talk_id = video_name
else:
# Fallback to sequential numbering if no video
talk_id = f"{event.id}.talk-{idx}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure I like I creating these IDs within the build script. It means that it's not possible to find the talk via this ID through our Python library.

Another issue is that the talk ID will change if we change the filename for some reason, which is not how other IDs work.

One idea would be to make the ID explicit in the XML and expose it through the Python library.

@davidstap
Copy link
Collaborator Author

Thanks for your comments @mbollmann! I'll try to make the required changes, will let you know if I could use more input.

@mbollmann
Copy link
Member

We might want to give the ID question a bit more thought perhaps, since it would be the first time we have IDs that look like paper IDs but refer to non-paper things. E.g. should they be globally unique, could they clash with paper IDs? I know it doesn't matter for the URLs since they use /talks/ but I do think if these function as IDs they should be handled by the Python library, and then it becomes a question how to handle these. I'll think more about it :)

@davidstap
Copy link
Collaborator Author

davidstap commented Aug 1, 2025

We might want to give the ID question a bit more thought perhaps, since it would be the first time we have IDs that look like paper IDs but refer to non-paper things. E.g. should they be globally unique, could they clash with paper IDs? I know it doesn't matter for the URLs since they use /talks/ but I do think if these function as IDs they should be handled by the Python library, and then it becomes a question how to handle these. I'll think more about it :)

My current assumption is that they should be globally unique to prevent potential clashes with presentation videos, since the mp4 plenary files are currently stored in the same folder as the paper presentation mp4s. (Of course, that could be changed easily.)

@davidstap davidstap marked this pull request as draft August 1, 2025 15:03
@mjpost
Copy link
Member

mjpost commented Aug 6, 2025

Hi @davidstap, sorry I took so long to get to this.

My first reaction: this looks fantastic (e.g., 2021.acl-keynote.1). I'm really excited about this; it will be really great to have this completely new feature, and it looks very good.

I do think we need to find a way to encode the IDs analogous to the way we encode paper IDs. To be explicit, in the XML, the full Anthology ID can be reconstructed based on the hierarchical assembly of the collection ID, volume ID, and paper ID. It seems like mirroring this in the <talk> structure would be helpful. Maybe the simplest way to do this would be to assign an id attribute to every talk, and place them within a <volume> block (inside <event>, which would stay nameless). Or we could just treat the <event> block as volume analogous? We could infer a volume from each talk, e.g., <talk> implicitly denotes a <talks> volume that is associated with the event. This would lose us the keynote portion of the Anthology ID, but that might be fine, since we'll have other types of recordings, including panels and so on.

I don't have a solution here but am just writing "out loud" in hopes that we can come up both with a workable ID system and maybe a taxonomy of video types.

@mjpost
Copy link
Member

mjpost commented Aug 6, 2025

A few more thoughts:

  • I think having these posted publicly, citable, and easily accessible is going to have a large impact. ACL spends a lot of money to have these recorded and up to now a key part of the conference experience has just disappeared. This is really going to be great.

  • Currently, we list the videos bullet-point style in the top-level event block.

    image

    I wonder if we should group these into a volume-style category and display them visually the same way that papers are displayed, that is, with the title, authors, and a few handy buttons for getting the bib and the video, like this:

    image
  • We would treat them as a separate volume. I'm not sure what the correct name is. "Plenaries" is wrong since they're not all plenaries. "Talks" is not right, since a panel isn't really a talk. Maybe we just want to do "Recordings" as the volume name?

  • At the same time, I think allowing variation in the Anthology ID is useful. So even if we group them under the event page in a "recordings" block, allowing IDs like 2021.acl-plenary-1 and 2021.acl-business.1 is helpful, say if you download the files. These are able to be semantically distinguishable in a way that papers aren't, and we should use that.

  • We might accomplish this by replacing <talk> with <recording> and allowing a richer id tag that includes the volume, e.g., <recording id="business.1">. Would this work? In all other ways, this would be analogous to a <paper> block (i.e., having a title, authors, etc).

@mjpost mjpost requested a review from nschneid August 6, 2025 12:21
@mbollmann
Copy link
Member

mbollmann commented Aug 6, 2025

I think it should be clear from an ID where to find the item referred to by it. Currently, this is the case: an ID like 2025.acl-long.1 parses into ['2025.acl', 'long', '1'] which refers to a <collection id="2025.acl"> that contains a <volume id="long"> and <paper id="1">. With most of the ID suggestions so far, this assumption would be violated — it wouldn't be clear from the ID whether to look for a <volume> or a <talk>, and we'd have to start looking in multiple places for a potential match.

One solution would be to reserve a special "volume name" for these, e.g. make them all go under talks, so when we see an ID like 2025.acl-talks.1 we know it's a <talk> under the <event> block.

However, based on the comments so far — that maybe talks should be presented on the website in a similar vein to papers, that they should be citeable etc. — I am thinking that we should maybe really just move them into an ordinary volume.

  • Volumes already have a type attribute that can currently be either "journal" or "proceedings". We could add another option, type="talks", or maybe more generically type="media".

  • The <volume type="..."> attribute already controls the generation of bibliographic entries (journals and proceedings trigger different bibentry types etc.), so it would be a natural way to define the correct type of bibliography entry for recordings or other media.

  • No new rules for ID resolution are needed.

  • Presentation on the website can be reused from how ordinary volumes work, and only needs to be overridden where desired.

  • The same approach could be taken to represent other kinds of media, like Ingesting the NLP Highlights podcast #497 or whatever else we want to archive and make citeable in the future, without having to invent new mechanisms.

For completeness, I'm thinking of something along the lines of:

<volume id="keynote" type="media">
    <talk id="1">
      <title>Keynote 1: Harnessing the Power of <fixed-case>LLM</fixed-case>s to Vitalize Indigenous Languages</title>
      <speaker><first>Claudio</first><last>Pinhanez</last></speaker>
      <url>2024.naacl-keynote.1.mp4</url>
    </talk>
</volume>

@mjpost
Copy link
Member

mjpost commented Aug 6, 2025

I like this approach. It would also let us intermingle talks in a proceedings volume, and we'd have flexibility of how to group them (all talks in one volume, or separate out meetings and plenaries, etc).

@davidstap
Copy link
Collaborator Author

davidstap commented Aug 7, 2025

Thanks @mjpost and @mbollmann for the detailed comments. I'll be traveling from tomorrow and will probably not have time in the next ~1.5 weeks to work on this, but will find time after that.

The main thing to decide seems how to handle the IDs. I also like @mbollmann's latest suggestion, and will try to implement that.

I agree with @mjpost it'd be great to list the videos like papers are listed (with title, authors, and handy buttons). I couldn't find an obvious (non-hacky) way to make <speaker> link to author pages, but I'm sure that can be sorted out.

@mbollmann
Copy link
Member

mbollmann commented Aug 7, 2025

<volume id="keynote" type="media">
    <talk id="1">
      <title>Keynote 1: Harnessing the Power of <fixed-case>LLM</fixed-case>s to Vitalize Indigenous Languages</title>
      <speaker><first>Claudio</first><last>Pinhanez</last></speaker>
      <url>2024.naacl-keynote.1.mp4</url>
    </talk>
</volume>

The Python side of this solution might require a bit more work and technical refactoring than I initially anticipated, now that I think of it. Volume objects are currently dictionaries mapping IDs to Paper objects; if we want to make this Paper | Talk, that has many implications e.g. for typing. We might need to create an abstract VolumeItem class or something that defines the interface and make both Paper and Talk inherit from it. If we go this route, I wonder if we should lay the groundwork for this in a separate PR and use this one only for the front-end parts.

@mjpost
Copy link
Member

mjpost commented Aug 21, 2025

@mbollmann I remembered today that I was in talks some time ago to ingest the "NLP Highlights" podcast. That fell through for lack of time, but it might become relevant again. This is all the more reason to generalized VolumeItem so we could also include an audio or podcast type.

@mbollmann
Copy link
Member

@mjpost I mentioned the issue with the NLP Highlights podcast in my comment above from two weeks ago ;) #5612 (comment) – but good to know it might still be relevant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants