Skip to content

Separate AstNode variants for expressions, statements, types #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ysthakur
Copy link
Member

@ysthakur ysthakur commented Mar 9, 2025

This PR adds three variants, AstNode::Expr, AstNode::Stmt, and AstNode::Type, and adds corresponding types for these variants to wrap around. I'm not sure if this would make the size of AstNode slightly bigger, but I think being able to separate these nodes into their own types is helpful. I also separated Variable into VarRef and VarDecl, since the meaning of each is different.

One advantage of this is that, if a method needs to check that the NodeId it received points to an expression, it can simply check if the node is an AstNode::Expr. We can also pass around Exprs themselves now.

I'm not sure if this is what you meant by using generics, @kubouch, but in the future, we may be able to make NodeId typed, i.e., create a NodeId<Expr> when an Expr node is added to the compiler, or a NodeId<AstNode> when something like AstNode::Name is added. Then we could have an accompanying trait looking like this:

trait ExtractNode { fn extract(node: AstNode) -> Option<Self>; }
impl ExtractNode for AstNode { fn extract(node: AstNode) -> Option<Self> { Some(node) } }
impl ExtractNode for Expr { // Similar impl for Stmt and Type too
  fn extract(node: AstNode) -> Option<Self> {
    match node { AstNode::Expr(e) => Some(e), _ => None }
  }
}

And then you could have a method in Compiler like

fn get_node_typed<T: ExtractNode>(&self, node_id: NodeId<T>) -> T {
  ExtractNode::<T>::extract(self.get_node(node_id)).expect("idk");
}

Apologies in advance for the massive diff. The changes to the snapshots are mostly just wrapping things in Stmt() or Expr()

@ysthakur ysthakur requested a review from kubouch March 9, 2025 19:14
@kubouch
Copy link
Contributor

kubouch commented Mar 12, 2025

I'm not sure how much value we get by splitting expr, statement and type nodes into separate types. The AstNode enum is shorter but now, you need to type AstNode::Expr(<foo>) instead of just <foo> everywhere. Not sure how valuable the passing of expressions is because we generally pass around the NodeIds.

I've been thinking we could instead follow a more compact memory layout similar to https://alic.dev/blog/dense-enums. Currently, many AstNodes contain just one or few bytes of information (like operators) but the enum itself takes 48 bytes, so there is both a lot of wasted space and a potential for optimization. This one I'd do once we have the parser more complete.

VarRef and VarDecl separation looks good to me.

By generics, I meant making each AstNode variant a separate type, then instead of having methods like .lsquare() etc. copy-pasted all over the parser, we'd have just one .node<T>(), for example .node<LSquare>(). Just to shorten the code.

@ysthakur
Copy link
Member Author

My main motivation is a function I'm making to do typechecking/type inference on expressions. It looks like this:

fn typecheck_expr(&mut self, expr: Expr, expected: TypeId) -> TypeId {
  // match on the different Expr variants
}

I could totally make it accept a NodeId instead and keep the current design of AstNode:

fn typecheck_expr(&mut self, expr: NodeId, expected: TypeId) -> TypeId {
  match self.compiler.get_node(expr) {
    AstNode::Int => ...
    ...more valid expression variants
    _ => panic!("expected expression")
  }
}

But it makes calling this function from typecheck_node more annoying. I don't want to have to do something like this:

fn typecheck_node(&mut self, node_id: NodeId) {
  match self.compiler.ast_nodes[node_id.0] {
    AstNode::Int
    | AstNode::List(_)
    | AstNode::Closure { ... }
    | ...every single expression variant => self.typecheck_expr(node_id, UNKNOWN_TYPE),
    // other cases
  }
}

An alternative would be to first match on every single variant that isn't an expression, then leave the default case for expressions. But that makes me uncomfortable, because we might easily miss a variant:

fn typecheck_node(&mut self, node_id: NodeId) {
  match self.compiler.ast_nodes[node_id.0] {
    AstNode::Let { .. } => ...
    AstNode::Name => ...
    AstNode::Def { ... } => ...
    ...everything but the expression variants
    _ => self.typecheck_expr(node_id, UNKNOWN_TYPE),
  }
}

So I pulled Expr out to make my life easier. It's not as big a problem with Stmt and Type, but I pulled them out while I was at it in case we run into the same problem with them in the future too.


You're right about the parser wasting memory for smaller variants. Are you thinking about just pulling out operators into a separate vector, or also grouping other variants of similar sizes together?

Side note: Bumpalo would've been really nice for avoiding the padding without us having to think too hard about variant sizes and stuff, but it requires giving up the ability to iterate through a list of nodes, not rolling back so we can parse closures, and using Boxes rather than NodeIds.


By generics, I meant making each AstNode variant a separate type, then instead of having methods like .lsquare() etc. copy-pasted all over the parser, we'd have just one .node(), for example .node(). Just to shorten the code.

Ah, I think I understand now.

@kubouch
Copy link
Contributor

kubouch commented Mar 13, 2025

Could you do this using a helper? For example:

fn typecheck_node(&mut self, node_id: NodeId) {
  if self.compiler.ast_nodes[node_id.0].is_expr()
    self.typecheck_expr(node_id, UNKNOWN_TYPE),
  } else {
    // other cases
  }
}

But if you think it's more helpful to have them separate, we can do that. I checked and the AstNode size remains 48 bytes so there shouldn't be a performance difference.

Are you thinking about just pulling out operators into a separate vector, or also grouping other variants of similar sizes together?

I'm not sure yet, but for example an X-bit tag and Y-bit index could be packed into one Z-bit integer for each node. The tag identifies the node enum variant, index would point at another vector with the node data (if relevant, operators wouldn't point anywhere). X depends on the number of variants, Z is either 32 or 64, which then gives Y.

I'm not sure how bumpalo fits here. We don't really do many allocations, we have just a Vec of nodes (with some additional vecs of node data, like blocks) that can be pre-allocated in advance to some sensible size. Otherwise, we shouldn't really be allocating anything at runtime, unless we overflow some of the vector capacity. But maybe I'm missing something, I haven't checked the crate closely.

Ah, I think I understand now.

The is_xxx() methods could also be modified to is<AstNode::Xxx>() the same way. Some nodes have custom checking, but most of the simple ones should be possible to do like that.

@ysthakur
Copy link
Member Author

ysthakur commented Mar 20, 2025

Sorry, forgot to reply to this.

Could you do this using a helper? For example:

Yeah, that does work, thanks. I can use that for now, and if it ends up not working out, we can always come back to this.

I'm not sure how bumpalo fits here. We don't really do many allocations, we have just a Vec of nodes (with some additional vecs of node data, like blocks) that can be pre-allocated in advance to some sensible size. Otherwise, we shouldn't really be allocating anything at runtime, unless we overflow some of the vector capacity. But maybe I'm missing something, I haven't checked the crate closely.

There's a with_capacity function you can use to create an arena of a specified size to avoid allocating at the start, not sure if that's what you meant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants