[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Popcon-developers] invalid UTF-8 in by_inst



On Tue, Feb 03, 2015 at 11:08:03PM +0100, Bill Allombert wrote:
> On Tue, Feb 03, 2015 at 12:21:16PM +0800, Paul Wise wrote:
> > Hi,
> > 
> > The current by_inst has invalid UTF-8 on lines 96386 and 144364, causing
> > the qa.d.o code that consumes these files to crash. I'll add a fix for
> > this in the qa.d.o code but it would be nice to have the popcon service
> > cope with very invalid data being sent to it.
> 
> It is not quite easy to fix this on the popcon server-side because broken
> reports are still signal (potentially).
> 
> In general, popcon data should be considered as untrusted and any program
> processing them should be ready to handle anything.
> 
> One major cause of broken reports is the lack of checksum.
> Hopefully with the generalization of encrypted reports (which are checksummed),
> this issue will be solved.

In this instance, the issue was due to a corrupted report (probably corrupted
during transit, a better checksum than the TCP checksum should have caught it)
and I removed it.

However, a lots of reports I receive are not in correct UTF-8 so I cannot simply 
discard all such reports. This is due to filenames that appear in the report:
they are not always encoded in UTF-8. For example aspell-es includes the file
/usr/lib/aspell/espa?ol.alias. In older version of the package, the name is
encoded in latin1 instead of UTF-8.

In any case, thanks for lettin us know about the broken report!
-- 
Bill. <ballombe at debian.org>

Imagine a large red swirl here. 



Reply to: