Skip to content

josephy02/NLP-Basics

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

NLP Basics

Step 1: Tokenization

convert "the cat sat on the mat" to ["the", "cat", "sat", "on", "the", "mat"]

things that may need to be considered: upper vs lower case (Apple or apple), stops words ("the", "a", "of", etc), and typo correction ("gooood" or "good").

Step 2: Build Dictionary

["the": 1, "cat": 2, "sat": 3, "on": 4, "the": 5, "mat": 6]

Step 3: One-Hot Encoding

"the" becomes [1, 0, 0, 0, 0, 0]

Step 4: Align Sequences

Align all sample sentences to the same length / number of tokens. Perform zero padding on sentences that are too short by filling missing words with 0's.

Screen Shot 2023-08-19 at 11 06 34 AM

Screen Shot 2023-08-19 at 11 06 04 AM

Step 5: Word Embedding

Screen Shot 2023-08-19 at 11 10 13 AM

v is the amount of words in a dictionary & length of a one-hot encoded vector

d is the dismension of a vector which represents a word

from the matric multiplication, we select out a specific word vector

Screen Shot 2023-08-19 at 12 20 01 PM

Screen Shot 2023-08-19 at 12 38 30 PM

Simple RNN

Screen Shot 2023-08-19 at 5 19 49 PM

There's only one set of A in a RNN model. The values in A are initialized in the beginning by random values and adjusted during training.

Screen Shot 2023-08-19 at 5 21 03 PM Screen Shot 2023-08-19 at 5 27 12 PM

LSTM

Conveyor Belt: information directly flows from the past to the future

Screen Shot 2023-08-19 at 5 57 47 PM

Forget Gate:

Screen Shot 2023-08-19 at 5 32 46 PM

For example, if a = [1, 3, 0, -2], we get:

Screen Shot 2023-08-19 at 5 33 49 PM Screen Shot 2023-08-19 at 5 35 22 PM Screen Shot 2023-08-19 at 5 53 45 PM ## In the above, 0.2 will not go through because it's matched with 0. -0.5 will fully go through because it's matched with 1.

It's more complicated than just numbers in a real scenario

Screen Shot 2023-08-19 at 6 03 50 PM

The vector of the previous state is concatenated with the current word's vector, multiplied with a set of weight, and then goes through the sigmoid activation function to become the ft.

Input Gate: decides which values of the conveyor belt to update

Screen Shot 2023-08-19 at 5 55 14 PM

There are two sets of operations here, one with sigmoid (it) and one with tanh (ćt).

Now that we have everything, we can now find Ct.

Screen Shot 2023-08-19 at 6 10 39 PM

Output Gate: decide what flows from the conveyor belt to the state

Screen Shot 2023-08-19 at 6 13 14 PM

Ot has the exact same calculations as the previous ones.

Screen Shot 2023-08-19 at 6 14 46 PM

About

Text generation using TensorFlow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%