Skip to content

Removing empty nodes early breaks medium.com images #337

@Kdecherf

Description

@Kdecherf

Considering the following snippet from a medium.com article:

<figure class="pd pe pf pg ph md mo mp paragraph-image">
   <div role="button" tabindex="0" class="pi pj ff pk bg pl">
      <div class="mo mp pc">
         <picture>
            <source srcset="[…]" type="image/webp"></source>
            <source data-testid="og" srcset="[…]"></source>
            <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
         </picture>
   </div>
</div>
<figcaption class="ml mm mn mo mp mq mr be b bf z dt">Dow</figcaption></figure>

Currently graby removes these source tags because of this routine in Graby.php:

        // Remove empty lines to avoid runaway evaluation of following regex on badly coded websites
        $re = '/^[ \t]*[\r\n]+/m';
        $htmlCleaned = preg_replace($re, '', $html);

However, not keeping these tags actually breaks images because img does not define any src path (thanks medium).

As this routine is run before ContentExtractor, we can't use find/replace in site-config to prevent that.

Thus, I see two ways to deal with it, whether:

  • exclude more nodes (in addition to iframe, td and th)
  • or remove this routine and stop removing empty nodes at this point

The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.

Any thoughts @j0k3r @jtojnar?

On a side note, no, medium.com is not compliant with HTML specification as the source tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions