[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Nepomuk: re-checking the strigi index constantly



On Sunday 28 March 2010, Michael Schuerig wrote:

> Apparently, a new
> /usr/bin/nepomukservicestub nepomukstrigiservice
> is started every 8 or 9 minutes. from ~/.xsession-errors I can't see
> anything that indicates that these processes are crashing.

Well, strace and gdb know better. Indexing is interrupted, when an 
assert statement fails and causes a SIGABRT

# strigi-0.7.1/src/streamanalyzer/lineeventanalyzer.cpp:180
void
LineEventAnalyzer::handleUtf8Data(const char* data, uint32_t length) {
    assert(!(sawCarriageReturn && missingBytes > 0));

I haven't tried to understand the code intimately, but from looking 
around a bit, I gather that this is to ensure that multi-byte characters 
are complete when the end of line is reached. I take it that one of my 
files is either containing broken UTF-8 or strigi mistakes it for UTF-8 
when actually it isn't.

Now, I'm wondering, is this something I ought to report as a bug against 
strigi or is the problem with Nepomuk for not logging abnormal 
termination of child processes? Or is it pdftotext for apparently 
producing invalid UTF-8 from a PDF (iconv doesn't complain about it, 
though)?

Michael

-- 
Michael Schuerig
mailto:michael@schuerig.de
http://www.schuerig.de/michael/


Reply to: