[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Regular expression too big

> I'm writing a new mode for Emacs that involves a massive
> regular expression, auto-generated from a list of files in
> the directory. If the number of files is too large (c. 1500,
> depending on the length of the filenames), then the regular
> expression that gets built is too big, and Emacs flashes up
> an error: Invalid regexp: "regular expression too big".

> So it looks as though this is a known issue, and that the
> solution was just to hardcode a ceiling on regexp size. This
> is a showstopper for us. At the moment, the only workaround
> that we can think of would be to chop the regexp into
> multiple pieces, run them separately, and then somehow
> combine the results. As you can imagine, this is going to be
> much slower, and much much uglier.

> Is there anything that can be done to extend the allowed
> size of the regexp?

Well, you can rewrite regexp.c if you want.  Currently it works by compiling
your regexp to a non-deterministic (i.e. backtracking) byte-code machine,
which uses 2-byte offsets to jump around, so it makes it difficult to write
regexps much larger than about 32KB (after compilation).

There could be various ways to change regexp.c so as to allow
larger regexps.  One would be to make the "too large" check more precise
(right now, I believe it just complains as soon as the whole compiled
regexp exceeds 32KB, but we could allow larger ones, as long as all offsets
fit within the ±32KB limit), or one could add "long jumps" with 4byte
offsets and try to insert them were needed, or one could make all offsets
4bytes, or one could rewrite regexp.c completely (ideally just adapting GNU
libc's regexp engine or some other).

But maybe you can circumvent the limit without removing it.  Tell us more
about your regexps: maybe we can optimize them.


Reply to: