- Home /
SOLVED Exclude set of keywords from string (using Regex)
Hello community!
I am currently using this piece of code
import System.Text.RegularExpressions;
var theString: String = "word1 and word3 in the word6 blah blah"
var editString : String = Regex.Replace(theString, "( and )+", " ");
editString = Regex.Replace(editString, "( in )+", " ");
editString = Regex.Replace(editString, "( the )+", " ");
editString = Regex.Replace(editString, "( )+", " ");
in order to exclude some common words from a string, which I then split at the white spaces to get an array of the words. No matter the different syntaxes I tried (and the research I've done), I couldn't figure out how to combine the (at least first three) above "Replace" lines into one. Is it possible? As indicated from the code sample, I am using Unity Javascript and the Regex namespace. Suggestions to generally optimize the method are welcome of course.
EDIT: Just noticed that the ( keyword )+ method will replace only the first match, so please let me know how I would be able to replace all the matches in the string.
EDIT2: My actual goal is to create a keyword search method, where from a string input, which represents several words, I get every single keyword in a different string (so let's say an array of the substring keywords), excluding some predefined terms. I don't necessarily want to use regular expressions, but I thought it would be the most straightforward way to do it. I am now thinking that I might create the array with ALL the keyword substrings first and then edit this array to remove unwanted inputs...I'll give it a try, but if a regex could do the job, I'd be pleased to learn something new! :)
FINAL EDIT: Note, that Unityscript will not accept a single slash symbol, so it needs a second slash. In another case, for example, "\\n" would be needed instead of "\n" to represent the line break character.
Answer by Alanisaac · Feb 11, 2015 at 01:39 AM
I think I've got this right, but I'm by no means a regex expert, and I also am not sure on what exactly you want.
I do know that you can combine different regex expression matches using a grouping with the alternator symbol, "|". I also think you might want to use a different method than just the space character to represent word boundaries. "\W" can represent non-word characters. Add to that, where you might have some beginning lines or ending lines ("^" and "$") when one of the words starts a sentence. And what if the words appear in sequence ("...in the...")? You might want to also match a single space character in your sequence. "+" gets you one or more of the preceding expression, which will be useful for all of these elements. Take that all together, and you get something like:
(\W|^)+(and|in|the| )+(\W|$)+
Check out this link, where I tested the Regex out.
In implementation, the backslash "\" character is actually the escape character. So you'll need to escape it by using a second "\" character, such as the following:
var editKeyword : String = Regex.Replace(newKeyword, "(\\W|^)+(and|in|the| )+(\\W|$)+", " ");
Edit: Revised to include the final scripted version.
Thank you alfalfasprout for your quick and analytic reply! I am also sorry for my late respond, but it was getting late here and went to sleep. In order to defend the reason I posted, however, is that I have tried several expressions that I found and are supposed to do this task, but my actual problem probably is that I don't know how to implement them in the replace function. Like with other expressions I tried, if I write
var edit$$anonymous$$eyword : String = Regex.Replace(new$$anonymous$$eyword, '(\W|^)+(and|in|the|)+(\W|$)+', " ");
the compiler complaints about "unexpected char: 'W'.". You're absolutely right though that I should mention my exact goal: I'd like from one string, which represents several words (separated by white space) to get an array of the words contained, excluding predefined words (which are common words and not specific terms). It's a keyword search after all. Regarding the common words in sequence, this is the reason I replace each word matched with a white space, so that for the following replace function to be able to find substrings surrounded by white spaces, but this method I use is rubbish and I only posted it in order to explain myself better. In conclusion any thoughts of the unexpected character error? By the way, the site you posted is very neat! Thanks again.
Not sure if this will also work in UnityScript vs JavaScript, but since it's a Regex, try writing it between forward slashes.
var regex = /(\W|^)+(and|in|the|)+(\W|$)+/;
var edit$$anonymous$$eyword : String = Regex.Replace(new$$anonymous$$eyword, regex, " ");
Thanks alfalfasprout for helping me out! I cannot use the
var regex = /(\W|^)+(and|in|the|)+(\W|$)+/;
line as is (unexpected token :/, unexpected character ), but if I quote the regex expression, it will return the same error as before: unexpected char: W. I also tried to find out how I create a new regex pattern, but I couldn't figure out from the autofill suggested options, when I type: new Regex. in $$anonymous$$onoDevelop.
by using double slash before W ("\\W) does not pop up errors, but I messed my code trying to fix something else and I haven't run it yet, but I'll report back the soonest.
Well, unfortunately, the \\ would not help either. There are no errors, but nothing is replaced... :(