Re: Nepomuk: re-checking the strigi index constantly
On Sunday 28 March 2010, Michael Schuerig wrote:
> Apparently, a new
> /usr/bin/nepomukservicestub nepomukstrigiservice
> is started every 8 or 9 minutes. from ~/.xsession-errors I can't see
> anything that indicates that these processes are crashing.
Well, strace and gdb know better. Indexing is interrupted, when an
assert statement fails and causes a SIGABRT
# strigi-0.7.1/src/streamanalyzer/lineeventanalyzer.cpp:180
void
LineEventAnalyzer::handleUtf8Data(const char* data, uint32_t length) {
assert(!(sawCarriageReturn && missingBytes > 0));
I haven't tried to understand the code intimately, but from looking
around a bit, I gather that this is to ensure that multi-byte characters
are complete when the end of line is reached. I take it that one of my
files is either containing broken UTF-8 or strigi mistakes it for UTF-8
when actually it isn't.
Now, I'm wondering, is this something I ought to report as a bug against
strigi or is the problem with Nepomuk for not logging abnormal
termination of child processes? Or is it pdftotext for apparently
producing invalid UTF-8 from a PDF (iconv doesn't complain about it,
though)?
Michael
--
Michael Schuerig
mailto:michael@schuerig.de
http://www.schuerig.de/michael/
Reply to: