CNN과 LSTM과 결합한 프로젝트를 정리해보려한다.

CNN (합성곱 신경망)

CNN은 이미지 분류, 객체 탐지, 분할 등 이미지나 동영상과 관련된 머신러닝 문제를 해결하는 일종의 딥러닝 모델이다.

CNN 특징

CNN은 convolution layer라는 특수한 유형의 layer를 사용하게 때문에 테두리나 윤곽선 등 이미지에서 학습한 패턴은 이미지의 픽셀 위치에 독립적인 것으로 가정하므로 가중치나 매개변수를 공유할 수 있다.

LSTM

LSTM은 Sequential Data와 관련한 머신러닝 문제를 해결하는 데 효과적이다. Sequential Data의 예시는 텍스트이다. 예를들어, 문장에서 각 단어는 앞의 단어에 종속된다. LSTM 모델은 이러한 순차적 종속성을 모델링하기 위한 것이다.

CNN과 LSTM으로 신경망 만들기

CNN과 LSTM을 연결해, 이미지나 동영상을 가져와 텍스트를 출력하는 하이브리드 모델을 구성할 수 있다.

구성

CNN-LSTM 네트워크 아키텍처는 Input data(이미지)에서 특징을 추출하는 합성곱 계층에 뒤이어 순차적 예측을 수행하는 LSTM 계층이 나오는 형태로 구성된다.

→ 이 모델에서 CNN은 주로 입력 이미지를 가져와 고차원 특징이나 임베딩을 출력하는 인코더로 사용됨, LSTM은 텍스트를 생성하는 디코더로 사용됨

디코더, 인코더란?

encoder 부분은 입력 받은 정보에서 불필요한 정보를 제외하고 차원을 축소하는 역할.

위 사진에는 auto-encoder(데이터가 입력되면 encoder를 통해서 저차원의 벡터로 변환되고, 이를 decoder가 받아서 결과물을 출력한다.)

오토 인코더는 다음과 같은 목적으로 사용될 수 있다.

차원 축소
Denoising
Anomaly & Outlier Detection
추천시스템

텍스트 인코딩 데모

텍스트 데이터를 다룰 때는 토큰(단어 및 문장 부호)을 숫자로 나타내어야 함 → 텍스트 인코딩이 필요함

ex) Text Encoding

<start> PyTorch is a deep learning library. <end>

------------------------------------------------------------------------

<start> : 0
PyTorch : 1
is : 2
a : 3
deep : 4
learning : 5
library : 6
. : 7
<end> : 8

------------------------------------------------------------------------

<start> PyTorch is a deep learning library. <end> -> [0,1,2,3,4,5,6,7,8] (list)

만약 “ PyTorch is deep. ”을 인코딩하면 [0, 1, 2, 4, 7, 8]이 된다. 이와 같은 매핑을 **vocabulary(사전)**이라고 한다.

CNN-LSTM 아키텍처 예제

LSTM 모델은 t=0일 때 CNN 임베딩을 입력으로 가져온다 → LSTM의 각 셀은 시간 단계마다 토큰을 예측해, 다음 LSTM 셀에 입력으로 제공한다 → 위의 사진처럼 전체 아키텍처를 다이어그램으로 나타낼 수 있음.

만약 단일 이미지 대신 동영상을 CNN layer에 입력으로 넣는다면, t=0에서만이 아니라 각 시간 단계에서 LSTM 셀의 입력으로 CNN 임베딩이 포함됨 → 행동인식, 동영상 설명 같은 분야에서 유용함.

파이토치로 이미지 캡션 생성하기

COCO 데이터셋을 이용

이미지 캡션 데이터셋 다운로드

  
# Windows를 포함한 모든 환경에서 실습할 수 있게 파이썬 코드를 작성했습니다. - 위키북스import urllib
import zipfile
from tqdmimport tqdm

#https://stackoverflow.com/a/53877507/1558946class DownloadProgressBar(tqdm):
def update_to(self, b=1, bsize=1, tsize=None):
if tsizeisnotNone:
            self.total= tsize
        self.update(b* bsize- self.n)

def download_data(url):
    print(f"{url} 다운로드 중 ...")
with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1])as t:
        zip_path, _= urllib.request.urlretrieve(url, reporthook=t.update_to)

    print("압축을 푸는 중 ...")
with zipfile.ZipFile(zip_path, "r")as f:
for namein tqdm(iterable=f.namelist(), total=len(f.namelist())):
            f.extract(member=name, path="data_dir")

download_data("http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip")
download_data("http://images.cocodataset.org/zips/train2014.zip")
download_data("http://images.cocodataset.org/zips/val2014.zip")

캡션(텍스트) 데이터 전처리

Libaray Import

  
import nltk # 자연어 툴 킷으로 사전을 구축할 때 유용
from pycocotools.cocoimport COCO # COCO 데이터셋을 가지고 작업할 때 유용
import torch.utils.data as data
import torchvision.models as model
import torchvision.transforms as transforms
from torch.nn.utils.rnn import pack_padded_sequence # 다양한 길이의 문장에 패딩을 적용해 고정된 길이의 문장으로 변환할 때 사용

nltk.download('punkt') # 주어진 텍스트를 구성된 단어로 토큰화할 수 있다.

실제 텍스트 토큰을 숫자 토큰으로 전환할 수 있는 vocabulary를 구축.

  
def build_vocabulary(json, threshold):
    """Build a simple vocabulary wrapper."""
    coco= COCO(json)
    counter= Counter()
    ids= coco.anns.keys()
for i, idin enumerate(ids):
        caption= str(coco.anns[id]['caption'])
        tokens= nltk.tokenize.word_tokenize(caption.lower())
        counter.update(tokens)

if (i+1)% 1000== 0:
            print("[{}/{}] Tokenized the captions.".format(i+1, len(ids)))

함수 내에서 JSON 텍스트 annotation을 로딩하고 annotation/캡션 내의 개별 단어를 토큰화하거나 숫자로 전환하고 카운터에 저장.

  
# If the word frequency is less than 'threshold', then the word is discarded.tokens= [tokenfor token, cntin counter.items()if cnt>= threshold]

# Create a vocab wrapper and add some special tokens.vocab= Vocab()
    vocab.add_token('<pad>')
    vocab.add_token('<start>')
    vocab.add_token('<end>')
    vocab.add_token('<unk>')

# Add the words to the vocabulary.for i, tokenin enumerate(tokens):
        vocab.add_token(token)
return vocab

함수 내에서 특정 횟수 이하로 발생한 토큰을 제거하고 나머지 토큰을 사전 객체에 추가한다. 사전 객체에는 이렇게 추가된 토큰 외에 문장시작(start), 끝(end), 모르는 단어(unknown_word), 패딩 토큰 같은 와일드카드 토큰이 포함된다.

  
vocab= build_vocabulary(json='data_dir/annotations/captions_train2014.json', threshold=4)
vocab_path= './data_dir/vocabulary.pkl'
with open(vocab_path, 'wb')as f:
    pickle.dump(vocab, f)
print("Total vocabulary size: {}".format(len(vocab)))
print("Saved the vocabulary wrapper to '{}'".format(vocab_path))

끝으로 사전 객체라고 하는 사전 구축 함수를 생성하고 나중에 재사용할 수 있게 로컬 시스템에 저장.

이 과정을 통해 실행시간에 텍스트 데이터를 숫자로 전환해 다룰 수 있다.

이미지 데이터 전처리

데이터셋의 이미지 크기와 모양은 다양할 수 있으므로, CNN 모델의 첫 번째 계층에 입력으로 제공될 수 있도록 전체 이미지를 고정된 모양으로 바꿔줘야 함.

  
def reshape_image(image, shape):
    """Resize an image to the given shape."""
    return image.resize(shape, Image.ANTIALIAS)
 
def reshape_images(image_path, output_path, shape):
    """Reshape the images in 'image_path' and save into 'output_path'."""
    if not os.path.exists(output_path):
        os.makedirs(output_path)
 
    images = os.listdir(image_path)
    num_im = len(images)
    for i, im in enumerate(images):
        with open(os.path.join(image_path, im), 'r+b') as f:
            with Image.open(f) as image:
                image = reshape_image(image, shape)
                image.save(os.path.join(output_path, im), image.format)
        if (i+1) % 100 == 0:
            print ("[{}/{}] Resized the images and saved into '{}'."
                   .format(i+1, num_im, output_path))

image_path = './data_dir/train2014/'
output_path = './data_dir/resized_images/'
image_shape = [256, 256]
reshape_images(image_path, output_path, image_shape)

CNN 모델 아키텍처에 맞도록 전체 이미지를 256*256 픽셀로 모양을 변경.

이미지 캡션 데이터 로더 정의하기

  
class CustomCocoDataset(data.Dataset):
    """COCO Custom Dataset compatible with torch.utils.data.DataLoader."""
    def __init__(self, data_path, coco_json_path, vocabulary, transform=None):
        """Set the path for images, captions and vocabulary wrapper.
        
        Args:
            root: image directory.
            json: coco annotation file path.
            vocab: vocabulary wrapper.
            transform: image transformer.
        """
        self.root = data_path
        self.coco_data = COCO(coco_json_path)
        self.indices = list(self.coco_data.anns.keys())
        self.vocabulary = vocabulary
        self.transform = transform
 
    def __getitem__(self, idx):
        """Returns one data pair (image and caption)."""
        coco_data = self.coco_data
        vocabulary = self.vocabulary
        annotation_id = self.indices[idx]
        caption = coco_data.anns[annotation_id]['caption']
        image_id = coco_data.anns[annotation_id]['image_id']
        image_path = coco_data.loadImgs(image_id)[0]['file_name']
 
        image = Image.open(os.path.join(self.root, image_path)).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)
 
        # Convert caption (string) to word ids.
        word_tokens = nltk.tokenize.word_tokenize(str(caption).lower())
        caption = []
        caption.append(vocabulary('<start>'))
        caption.extend([vocabulary(token) for token in word_tokens])
				caption.append(vocabulary('<end>'))
        ground_truth = torch.Tensor(caption)
        return image, ground_truth
 
    def __len__(self):
        return len(self.indices)

X,y 형태로 데이터의 미니 배치를 반환하는 collate_function을 정의.

  
def collate_function(data_batch):
    """Creates mini-batch tensors from the list of tuples (image, caption).
    
    We should build custom collate_fn rather than using default collate_fn, 
    because merging caption (including padding) is not supported in default.
    Args:
        data: list of tuple (image, caption). 
            - image: torch tensor of shape (3, 256, 256).
            - caption: torch tensor of shape (?); variable length.
    Returns:
        images: torch tensor of shape (batch_size, 3, 256, 256).
        targets: torch tensor of shape (batch_size, padded_length).
        lengths: list; valid length for each padded caption.
    """
    # Sort a data list by caption length (descending order).
    data_batch.sort(key=lambda d: len(d[1]), reverse=True)
    imgs, caps = zip(*data_batch)
 
    # Merge images (from list of 3D tensors to 4D tensor).
    # Originally, imgs is a list of <batch_size> number of RGB images with dimensions (3, 256, 256)
    # This line of code turns it into a single tensor of dimensions (<batch_size>, 3, 256, 256)
    imgs = torch.stack(imgs, 0)
 
    # Merge captions (from list of 1D tensors to 2D tensor), similar to merging of images donw above.
    cap_lens = [len(cap) for cap in caps]
    tgts = torch.zeros(len(caps), max(cap_lens)).long()
    for i, cap in enumerate(caps):
        end = cap_lens[i]
        tgts[i, :end] = cap[:end]        
    return imgs, tgts, cap_lens

마지막으로 다음 코드에서 COCO 데이터셋을 위한 맞춤형 데이터 로더를 반환하는 get_loader 함수를 구현

  
def get_loader(data_path, coco_json_path, vocabulary, transform, batch_size, shuffle, num_workers):
    """Returns torch.utils.data.DataLoader for custom coco dataset."""
    # COCO caption dataset
    coco_dataser = CustomCocoDataset(data_path=data_path,
                       coco_json_path=coco_json_path,
                       vocabulary=vocabulary,
                       transform=transform)
    
    # Data loader for COCO dataset
    # This will return (images, captions, lengths) for each iteration.
    # images: a tensor of shape (batch_size, 3, 224, 224).
    # captions: a tensor of shape (batch_size, padded_length).
    # lengths: a list indicating valid length for each caption. length is (batch_size).
    custom_data_loader = torch.utils.data.DataLoader(dataset=coco_dataser, 
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              num_workers=num_workers,
                                              collate_fn=collate_function)
    return custom_data_loader

CNN-LSTM 모델 정의하기

CNN모델과 RNN 모델 정의. 심층 CNN 모델은 152개 계층으로 구성되며 ImageNet 데이터셋에서 사전 훈련됨

  
class CNNModel(nn.Module):
    def __init__(self, embedding_size):
        """Load the pretrained ResNet-152 and replace top fc layer."""
        super(CNNModel, self).__init__()
        resnet = models.resnet152(pretrained=True)
        module_list = list(resnet.children())[:-1]      # delete the last fc layer.
        self.resnet_module = nn.Sequential(*module_list)
        self.linear_layer = nn.Linear(resnet.fc.in_features, embedding_size)
        self.batch_norm = nn.BatchNorm1d(embedding_size, momentum=0.01)
        
    def forward(self, input_images):
        """Extract feature vectors from input images."""
        with torch.no_grad():
            resnet_features = self.resnet_module(input_images)
        resnet_features = resnet_features.reshape(resnet_features.size(0), -1)
        final_features = self.batch_norm(self.linear_layer(resnet_features))
        return final_features

LSTM 모델 정의.

  
class LSTMModel(nn.Module):
    def __init__(self, embedding_size, hidden_layer_size, vocabulary_size, num_layers, max_seq_len=20):
        """Set the hyper-parameters and build the layers."""
        super(LSTMModel, self).__init__()
        self.embedding_layer = nn.Embedding(vocabulary_size, embedding_size)
        self.lstm_layer = nn.LSTM(embedding_size, hidden_layer_size, num_layers, batch_first=True)
        self.linear_layer = nn.Linear(hidden_layer_size, vocabulary_size)
        self.max_seq_len = max_seq_len
        
    def forward(self, input_features, capts, lens):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embedding_layer(caps)
        embeddings = torch.cat((input_features.unsqueeze(1), embeddings), 1)
        lstm_input = pack_padded_sequence(embeddings, lens, batch_first=True) 
        hidden_variables, _ = self.lstm_layer(lstm_input)
        model_outputs = self.linear_layer(hidden_variables[0])
        return model_outputs
    
    def sample(self, input_features, lstm_states=None):
        """Generate captions for given image features using greedy search."""
        sampled_indices = []
        lstm_inputs = input_features.unsqueeze(1)
        for i in range(self.max_seq_len):
            hidden_variables, lstm_states = self.lstm_layer(lstm_inputs, lstm_states)          # hiddens: (batch_size, 1, hidden_size)
            model_outputs = self.linear_layer(hidden_variables.squeeze(1))            # outputs:  (batch_size, vocab_size)
            _, predicted_outputs = model_outputs.max(1)                        # predicted: (batch_size)
            sampled_indices.append(predicted_outputs)
            lstm_inputs = self.embedding_layer(predicted_outputs)                       # inputs: (batch_size, embed_size)
            lstm_inputs = lstm_inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
        sampled_indices = torch.stack(sampled_indices, 1)                # sampled_ids: (batch_size, max_seq_length)
        return sampled_indices

max_seq_len은 20으로 정의 20보다 커지면 20으로 축소

CNN-LSTM 모델 훈련하기

장치 정의. GPU를 사용할 수 있으면 훈련시킬 때 GPU를 사용하고 그렇지 않으면 CPU 사용

  
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

파이토치의 transform 모듈을 사용해 입력 이미지 픽셀 값을 정규화한다.
→ 이미 이미지 전처리를 했지만 데이터 차원에 따라 분포가 서로 다르므로 전체 최적화 공간이 왜곡되고 비효율적인 경사 하강이 발생할 수 있어 정규화가 중요.

  
# Image preprocessing, normalization for the pretrained resnet
transform = transforms.Compose([ 
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(), 
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406), 
                         (0.229, 0.224, 0.225))

캡션(텍스트) 데이터 전처리 부분에서 구성한 vocabulary를 로딩

  
# Load vocabulary wrapper
with open('data_dir/vocabulary.pkl', 'rb') as f:
    vocabulary = pickle.load(f)

    
# Build data loader
custom_data_loader = get_loader('data_dir/resized_images', 'data_dir/annotations/captions_train2014.json', vocabulary, 
                         transform, 128,
                         shuffle=True, num_workers=0)

CNN과 LSTM 모델을 인코더와 디코더 모델 형태로 인스턴스화한다. 손실 함수로 cross entropy loss를, 최적화 스케줄에는 Adam optimizer를 정의.

  
# Build the models
encoder_model = CNNModel(256).to(device)
decoder_model = LSTMModel(256, 512, len(vocabulary), 1).to(device)
 
    
# Loss and optimizer
loss_criterion = nn.CrossEntropyLoss()
parameters = list(decoder_model.parameters()) + list(encoder_model.linear_layer.parameters()) + list(encoder_model.batch_norm.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.001)

Why adam? → 아담은 희소성 있는 데이터를 다룰 때 선택할 수 있는 가장 좋은 최적화 스케줄이기 때문

데이터 로더를 사용해 COCO 데이터셋의 미니 배치를 가져오고, 미니 배치를 인코더와 디코터 네트워크를 통해 순전파하고, 역전파를 사용해 CNN-LSTM 모델 매개변수를 조정하는 훈련 루프를 5 epoch 동안 실행.

  
# Train the models
total_num_steps = len(custom_data_loader)
for epoch in range(5):
    for i, (imgs, caps, lens) in enumerate(custom_data_loader):
 
        # Set mini-batch dataset
        imgs = imgs.to(device)
        caps = caps.to(device)
        tgts = pack_padded_sequence(caps, lens, batch_first=True)[0]
 
        # Forward, backward and optimize
        feats = encoder_model(imgs)
        outputs = decoder_model(feats, caps, lens)
        loss = loss_criterion(outputs, tgts)
        decoder_model.zero_grad()
        encoder_model.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Print log info
        if i % 10 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                  .format(epoch, 5, i, total_num_steps, loss.item(),
                          np.exp(loss.item()))) 
 
        # Save the model checkpoints
        if (i+1) % 1000 == 0:
            torch.save(decoder_model.state_dict(), os.path.join(
                'models_dir/', 'decoder-{}-{}.ckpt'.format(epoch+1, i+1)))
            torch.save(encoder_model.state_dict(), os.path.join(
                'models_dir/', 'encoder-{}-{}.ckpt'.format(epoch+1, i+1)))

훈련된 모델로 이미지 캡션 생성하기

장치 설정과 이미지 전처리

  
image_file_path = 'sample.jpg'
 

# Device configuration (장치 설정)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 

def load_image(image_file_path, transform=None):
    img = Image.open(image_file_path).convert('RGB')
    img = img.resize([224, 224], Image.LANCZOS)
    
    if transform is not None:
        img = transform(img).unsqueeze(0)
    
    return img
 

# Image preprocessing (이미지 전처리)
transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406), 
                         (0.229, 0.224, 0.225))])

사전을 로딩하고 인코더와 디코더 모델을 인스턴스화한다.

  
# Load vocabulary wrapper
with open('data_dir/vocabulary.pkl', 'rb') as f:
    vocabulary = pickle.load(f)

# Build models
encoder_model = CNNModel(256).eval()  # eval mode (batchnorm uses moving mean/variance)
decoder_model = LSTMModel(256, 512, len(vocabulary), 1)
encoder_model = encoder_model.to(device)
decoder_model = decoder_model.to(device)

훈련된 모델 매개변수 로딩

  
# Load the trained model parameters
encoder_model.load_state_dict(torch.load('models_dir/encoder-2-3000.ckpt'))
decoder_model.load_state_dict(torch.load('models_dir/decoder-2-3000.ckpt'))

중요부분 실제로 이미지를 로딩하고 이미지에 대해 추론을 실행한다.

  
# Prepare an image
img = load_image(image_file_path, transform)
img_tensor = img.to(device)

# Generate an caption from the image
feat = encoder_model(img_tensor)
sampled_indices = decoder_model.sample(feat)
sampled_indices = sampled_indices[0].cpu().numpy()          # (1, max_seq_length) -> (max_seq_length)

숫자 토큰을 텍스트 토큰으로 변환

  
# Convert word_ids to words
predicted_caption = []
for token_index in sampled_indices:
    word = vocabulary.i2w[token_index]
    predicted_caption.append(word)
    if word == '<end>':
        break
predicted_sentence = ' '.join(predicted_caption)

출력

  
# Print out the image and the generated caption
%matplotlib inline
print (predicted_sentence)
img = Image.open(image_file_path)
plt.imshow(np.asarray(img))

Result

출처 : 실전! 파이토치 딥러닝 프로젝트

CNN과 LSTM 결합하기