Naive text cleaning
Text is an unstructured data in which we can extract some entities. Assume that the text is a random sequence of words, punctuation and noise. You can find a really naive text cleaner under the cat.
public class NaiveCleaning {
	
	static void main(String[] args) {
		String s = clean("<title>Hello :), %world!</title>");
		System.out.println(s);
		//Prints:
		// Hello        , world! 
	}
	
	static String clean(String s) {
		String p1 = "<[/a-zA-Z]+>";
		String p2 = "[^\\.,!a-zA-Z0-9]";
		s = Pattern.compile(p1).matcher(s).replaceAll(" ");
		return Pattern.compile(p2).matcher(s).replaceAll(" ");
	}
	
}
May 11th, 2011
Back to main
Moi krug - Yernat Assanov
Advertisement
Documentolog
(C) 2010, kseeker
Email: kseeker@yandex.kz
Используются технологии uCoz