[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: coreutils wc count multi bytes question



Hi,

I am Japanese speaker and "this is a 文件 vi 打的" looks to me 6 words.

  (Please note both Japanese and Chinese rarely use space as 
   word separator, so this should be parsed by english syntax.  Also
   Japanese tends to treat pair of Kanji as a word.  For example,
   Kanji is 漢字.)
  
I count it with my brain. So I think it should be correct :)

Seriously, question is what you want to do?

On Sat, Feb 07, 2009 at 01:16:56AM +0100, Samuel Thibault wrote:
> Neo Anderson, le Fri 06 Feb 2009 15:50:34 -0800, a écrit :
> > If I remember correctly that there is a mapping table, so possibly
> > this can be done. But of course, perhaps this is just my wishful
> > thinking.
> 
> The problem is that posix says
> 
> `The wc utility shall consider a word to be a non-zero-length string
> of characters delimited by white space.'
> 
> So that the -w behavior can't be changed, that'd need to be another
> option.  

I agree it can not be changed.

The word counting method by human depends on grammar of human languages.
I am not sure if it is worth including such complication to simple base
tool such as wc.

It is good to have another tool to count words like human may be
interesting by itself.  It can not be simple chinese character conting
and space checking.  It will be a utility tool as a part of
morphological analysis system.  I see following in our archive.

 chasen
 juman
 mecab
 lttoolbox
 ...

They may have such thing or base for such thing.

I hope this may help.

Osamu


Reply to: