Naive text cleaning
Text is an unstructured data in which we can extract some entities.
Assume that the text is a random sequence of words, punctuation and noise.
You can find a really naive text cleaner under the cat.
public class NaiveCleaning {
static void main(String[] args) {
String s = clean("<title>Hello :), %world!</title>");
System.out.println(s);
//Prints:
// Hello , world!
}
static String clean(String s) {
String p1 = "<[/a-zA-Z]+>";
String p2 = "[^\\.,!a-zA-Z0-9]";
s = Pattern.compile(p1).matcher(s).replaceAll(" ");
return Pattern.compile(p2).matcher(s).replaceAll(" ");
}
}
May 11th, 2011