[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Regular expression too big

hey stefan,

thank you for your response. it was helpful to know that this isn't going to be an easy fix. we're already using regexp-opt, which is supposed to optimize and shrink the regex. i can anticipate the regexes getting much much bigger, and i'm keen to avoid having to dig into the guts of the regexp C code, so i think we're going to figure out a way to do our regexing outside of emacs.

thanks again,

Stefan Monnier wrote:
I'm writing a new mode for Emacs that involves a massive
regular expression, auto-generated from a list of files in
the directory. If the number of files is too large (c. 1500,
depending on the length of the filenames), then the regular
expression that gets built is too big, and Emacs flashes up
an error: Invalid regexp: "regular expression too big".

So it looks as though this is a known issue, and that the
solution was just to hardcode a ceiling on regexp size. This
is a showstopper for us. At the moment, the only workaround
that we can think of would be to chop the regexp into
multiple pieces, run them separately, and then somehow
combine the results. As you can imagine, this is going to be
much slower, and much much uglier.

Is there anything that can be done to extend the allowed
size of the regexp?

Well, you can rewrite regexp.c if you want.  Currently it works by compiling
your regexp to a non-deterministic (i.e. backtracking) byte-code machine,
which uses 2-byte offsets to jump around, so it makes it difficult to write
regexps much larger than about 32KB (after compilation).

There could be various ways to change regexp.c so as to allow
larger regexps.  One would be to make the "too large" check more precise
(right now, I believe it just complains as soon as the whole compiled
regexp exceeds 32KB, but we could allow larger ones, as long as all offsets
fit within the ±32KB limit), or one could add "long jumps" with 4byte
offsets and try to insert them were needed, or one could make all offsets
4bytes, or one could rewrite regexp.c completely (ideally just adapting GNU
libc's regexp engine or some other).

But maybe you can circumvent the limit without removing it.  Tell us more
about your regexps: maybe we can optimize them.


Reply to: