Removing empty nodes early breaks medium.com images

Considering the following snippet from a medium.com article:

``` html
<figure class="pd pe pf pg ph md mo mp paragraph-image">
   <div role="button" tabindex="0" class="pi pj ff pk bg pl">
      <div class="mo mp pc">
         <picture>
            <source srcset="[…]" type="image/webp"></source>
            <source data-testid="og" srcset="[…]"></source>
            <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
         </picture>
   </div>
</div>
<figcaption class="ml mm mn mo mp mq mr be b bf z dt">Dow</figcaption></figure>
```

Currently graby removes these `source` tags because of this routine in [`Graby.php`](https://github.com/j0k3r/graby/blob/master/src/Graby.php#L277-L279):

``` php
        // Remove empty lines to avoid runaway evaluation of following regex on badly coded websites
        $re = '/^[ \t]*[\r\n]+/m';
        $htmlCleaned = preg_replace($re, '', $html);
```

However, not keeping these tags actually breaks images because `img` does not define any src path (_thanks medium_).

As this routine is run before ContentExtractor, we can't use `find/replace` in site-config to prevent that.

Thus, I see two ways to deal with it, whether:
- exclude more nodes (_in addition to iframe, td and th_)
- or remove this routine and stop removing empty nodes at this point

The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.

Any thoughts @j0k3r @jtojnar?

On a side note, no, medium.com is not compliant with HTML specification as the `source` tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Removing empty nodes early breaks medium.com images #337

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Removing empty nodes early breaks medium.com images #337

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions