How does Keras 1d convolution layer work with word embeddings – text classification problem? (Filters, kernel size, and all hyperparameter)

I am currently developing a text classification tool using Keras. It works (it works fine and I got up to 98.7 validation accuracy) but I can’t wrap my head around about how exactly 1D-convolution layer works with text data.

What hyper-parameters should I use?

I have the following sentences (input data):

Maximum words in the sentence: 951 (if it’s less – the paddings are added)
Vocabulary size: ~32000
Amount of sentences (for training): 9800
embedding_vecor_length: 32 (how many relations each word has in word embeddings)
batch_size: 37 (it doesn’t matter for this question)
Number of labels (classes): 4

It’s a very simple model (I have made more complicated structures but, strangely it works better – even without using LSTM):

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(labels_count, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

My main question is: What hyper-parameters should I use for Conv1D layer?

model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))

If I have following input data:

Max word count: 951
Word-embeddings dimension: 32

Does it mean that filters=32 will only scan first 32 words completely discarding the rest (with kernel_size=2)? And I should set filters to 951 (max amount of words in the sentence)?

Examples on images:

So for instance this is an input data: http://joxi.ru/krDGDBBiEByPJA

It’s the first step of a convoulution layer (stride 2): http://joxi.ru/Y2LB099C9dWkOr

It’s the second step (stride 2): http://joxi.ru/brRG699iJ3Ra1m

And if filters = 32, layer repeats it 32 times? Am I correct?
So I won’t get to say 156-th word in the sentence, and thus this information will be lost?

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I would try to explain how 1D-Convolution is applied on a sequence data. I just use the example of a sentence consisting of words but obviously it is not specific to text data and it is the same with other sequence data and timeseries.

Suppose we have a sentence consisting of m words where each word has been represented using word embeddings:

Now we would like to apply a 1D convolution layer consisting of n different filters with kernel size of k on this data. To do so, sliding windows of length k are extracted from the data and then each filter is applied on each of those extracted windows. Here is an illustration of what happens (here I have assumed k=3 and removed the bias parameter of each filter for simplicity):

As you can see in the figure above, the response of each filter is equivalent to the result of its convolution (i.e. element-wise multiplication and then summing all the results) with the extracted window of length k (i.e. i-th to (i+k-1)-th words in the given sentence). Further, note that each filter has the same number of channels as the number of features (i.e. word-embeddings dimension) of the training sample (hence performing convolution, i.e. element-wise multiplication, is possible). Essentially, each filter is detecting the presence of a particular feature of pattern in a local window of training data (e.g. whether a couple of specific words exist in this window or not). After all the filters have been applied on all the windows of length k we would have an output of like this which is the result of convolution:

As you can see, there are m-k+1 windows in the figure since we have assumed that the padding='valid' and stride=1 (default behavior of Conv1D layer in Keras). The stride argument determines how much the window should slide (i.e. shift) to extract the next window (e.g. in our example above, a stride of 2 would extract windows of words: (1,2,3), (3,4,5), (5,6,7), ... instead). The padding argument determines whether the window should entirely consists of the words in training sample or there should be paddings at the beginning and at the end; this way, the convolution response may have the same length (i.e. m and not m-k+1) as the training sample (e.g. in our example above, padding='same' would extract windows of words: (PAD,1,2), (1,2,3), (2,3,4), ..., (m-2,m-1,m), (m-1,m, PAD)).

You can verify some of the things I mentioned using Keras:

from keras import models
from keras import layers

n = 32  # number of filters
m = 20  # number of words in a sentence
k = 3   # kernel size of filters
emb_dim = 100  # embedding dimension

model = models.Sequential()
model.add(layers.Conv1D(n, k, input_shape=(m, emb_dim)))

model.summary()

Model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_2 (Conv1D)            (None, 18, 32)            9632      
=================================================================
Total params: 9,632
Trainable params: 9,632
Non-trainable params: 0
_________________________________________________________________

As you can see the output of convolution layer has a shape of (m-k+1,n) = (18, 32) and the number of parameters (i.e. filters weights) in the convolution layer is equal to: num_filters * (kernel_size * n_features) + one_bias_per_filter = n * (k * emb_dim) + n = 32 * (3 * 100) + 32 = 9632.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating