[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OT: strip hebrew vowels and accents from utf-8 text



On Thu, 5 Nov 2009 14:50:56 +1030
David Purton <dcpurton@marshwiggle.net> wrote:

> Can anyone suggest a simple way to strip vowels out of utf-8 encoded
> hebrew text, leaving just the consenants?
> 
> i.e., given something like בָָּ֟֟רָא, pipe it through something so that the
> output is ברא. The unicode characters <U+0591> to <U+05C7> ideally
> should be stripped. This includes accents, etc.

#! /usr/bin/perl -w

use strict;
use Encode;

while (<>) {
	$_ = Encode::decode('utf-8', $_);
	s/[\x{0591}-\x{05C7}]//g;
	print Encode::encode('utf-8', $_);
}

This works (tested on your example, and on a sample from here:
http://www.mechon-mamre.org/c/ct/c0101.htm).

Celejar
-- 
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator


Reply to: