Parsing an inputed text into individual words

Kent · May 28, 2006

I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far:

while( scanf(fpname, %s, stun) !=EOF) )

{

******

****

***

pname= AllocateName( stun);

*******

*****

***

*

}

char* AllocateName( char *stun)

{

char*name;

char let;

int num;

num= strlen(stun);

--num;

let=stun[num];

if(let=='.' ||let==',' || let==':')

{

stun[num]='o';

}

if(!(name=(char*)malloc( strlen(stun)+1, sizeof(char))))

{

printf("problem allocating namen");

exit(2);

}

strcmp(name, stun);

return name;

}

Yes, yes.. It only care for words that ends with a comma, or a period.

So it is a not a perfect solution. There can be words like:

(log)(base4)(12) <-- consider one word

but not this:

I have the utter-most-hatred-this-lab, where "utter", "most", "hatred", "this" , "lab" are consider individual words, without the god damn '-'. In other word, if i save "utter-most-hatred-this-lab" as a string as stun, than it must be borken up into some unknown number of pieces individually allocate in the heap!

example of the input text:

..........data are used, consisting ......

.................so-called B-trees.............

.................................nodes (leaves)........................

........in O(log(n)) time............................

......................................(Used in internet routers.)...................

The "O(log(n))" is consider to be one word. I am not sure how i should proceed.

Turtle · May 28, 2006

I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far:

...
example of the input text:

..........data are used, consisting ......

.................so-called B-trees.............

.................................nodes (leaves)........................
........in O(log(n)) time............................

......................................(Used in internet routers.)...................

The "O(log(n))" is consider to be one word. I am not sure how i should proceed.

I have no evaluation of your code, however in your example no word begins with a symbol.

I think you meant 'parsing' not 'pausing', yes?:eek:

Kent · May 28, 2006

I have no evaluation of your code, however in your example no word begins with a symbol.
I think you meant 'parsing' not 'pausing', yes?:eek:

yes

do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string?

Jay-qu · May 28, 2006

what language is this?

Turtle · May 28, 2006

yes

do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string?

Perhaps by refering to a library of the specified symbols & then specifying that what follows them begins a word unless it too is a symbol?:eek:

PS That only takes care of the begin of a word!?

Kent · May 28, 2006

what language is this?

I know C, and a bite of C++( i am taking it this quarter).

This lab i am doing is data sturature using c.

Turtle · May 28, 2006

Also, as I see no contractions; you can make the rule that a word contains no spaces, no "-", no ".", etc., but may contain the parenthesis.?? H ope I hlep more than hender.:cup: :eek:

Kent · May 28, 2006

One way is to scanf for a string, put it in to "stun".

I try to find open braces. and if found, i put push it into a stack.

when i spot a closing braces, i pop t from a stack.

if the stack is empty, that means i consider the whole god damn stun as one string.

If the stack is not empty, i need to only keep the the letters, and discard the open and close braces.

Qfwfq · May 29, 2006

If, by word, you mean consecutive alpha characters, wouldn't it be enough to cycle through the text using a function such as:

isalpha(char c){return c >= 'a' && c <= 'Z' && (c <= 'z' || c >= 'A');}

to find where a word starts and when it has ended?

Of course, if you want "one word" to be one word you can accomodate that to, but I don't see how you could work around cases where something starts with a single " unless you can suppose that it's the last " in the text.

Qfwfq · May 29, 2006

Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts.

So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word. Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended?

Kent · May 29, 2006

Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts.

So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word.

By design, there is not word like : O(log(n-m)) or any space between the '(' and the next character.

Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended?

by design, things like: (log)base4)(12) is consider a single word.

Southtown · May 29, 2006

Feel free to check my javascript word counter.

http://st10.startlogic.com/~thedawgs/mostuff/southie/extras/CountWords.html

The "Count Words" button does just that. But the "Clean Text" button only deletes hard returns to allow natural word wrapping and limits returns between paragraphs to two. It doesn't flag special characters, but it does differentiate in a way you can utilize. ASCII codes. Here's a taste.

   // --- validate at least one character
  for ( i = 0; i < text.length; i++ )
  {
     if ( text.charCodeAt( i ) > 32 )
     {
        // --- count first word
        wordCount++;
        break;
     }
  }

So say you were going to pipe each word in a string into individual strings or an array you would do it kinda like:

initialize vars

loop through string

---if charCode > 64 && < 91 || > 96 && < 123 (alpha chars only)

------pipe me into var (put sequential alpha chars into string)

---else (we have a non-alpha char)

------pipe var into array and clear var (save string without this and start over)

repeat

You could be more specific of course and look for individual chars such as spaces or increment a counter or whatever.

Kent · May 29, 2006

if there are words like : (home

The '(' would be deleted, and home would be a word to be allocated.

If it was : (home) , then : (home) would be one word.

Here is my thought( in c)

let ary be the string with everything in it.

here is my code:

1) set ptr to ary

2) loop( *ptr not equal to '0')

2.1) if *ptr is '(', then put it in a stack.

2.2) if *ptr is ')' , then pop ')' from the stack

2.3) increment ptr by 1.

3) end loop

4) if( stack is empty)

4.1) set det to ary.

4.2) loop ( *det!= '0')

4.2.1)if(isalpha(*det)) || det*= '(' )

4.2.2) strung[ num] = *det

4.2.3) end if

4.2.1) increment num

4.2.2) increment det

4.3) end loop

5) end if

Is This a viable way? Is there a better way to do this?

Southtown · May 29, 2006

Oh sorry. I misunderstood your situation. You could go the long route and attempt to assertain the purpose of each '(' or ')'... Or you could just flag the 'delimiters', the characters that will always constitute word breaks, such as spaces.

My counter, for example only counts spaces and carriage returns. It would be just as easy, though, to count multiple delimiters at the same time; like spaces, hyphens, commas, and periods. If a character such as a parenthesis does not always act as a delimiter, though, ignore it. The counter would then count ((x)(base10)log(y)) as one word and (this phrase) as two words because of the space alone. Just be careful not to over-increment.

I'm no pro, but I would set an incrementor and a flag: wordCount/doCount or similar. (I hate little names, they confuse me.) You would initialize both of these and then create a flip-flop situation.

wordCount = 0;
doCount = true;

loop through string
if (char == alphabetical) {
  if (doCount == true) {
     wordCount++;
     doCount = false;
  }
}
else if (char == delimiter) { // but ignore parentheses
  doCount = true;
}
end loop

This will catch double counting. You just need to specify your alpha characters and delimiters to define words and word breaks.

wordCount = 0;
doCount = true;
  
while ( char = ary.charCodeAt(ptr) ) {
  if ( char > 64 &&   // between 'A' and
       char < 91 ||   // 'Z' or
       char > 96 &&   // between 'a' and
       char < 123 ) { // 'z'
     if ( doCount ) {
        wordCount++;
        doCount = false;
     }
  }
  else if ( char == 32 ||   // space or
            char == 44 ||   // comma or
            char == 45 ||   // hyphen or
            char == 46 ||   // period or
            char == 58 ) {  // colon
            // and so on (ignore parentheses)
     doCount = true;
  }
  ptr++;
}

This is javascript, though. I don't know c very well, yet. But the same logic would apply.

Southtown · May 31, 2006

Sorry I misunderstood again. I'm slow. You're not counting words at all are you. The good news is: char is treated as a US-ASCII integer. :phones:

Kent · May 31, 2006

well, it turns on the solution is ridiculous simply by "design". I only had to do was to check the front and back of a string for ( and ).

Sign In

Parsing an inputed text into individual words

Recommended Posts

Kent

Turtle

Kent

Jay-qu

Turtle

Kent

Turtle

Kent

Qfwfq

Qfwfq

Kent

Southtown

Kent

Southtown

Southtown

Kent

Join the conversation

Browse

Activity