Image Based Product Recommendation System

10 min readMay 22, 2020

Leveraging image descriptors and deep learning to get accurate user specific product recommendations

Source: https://www.lynda.com/Data-Science-tutorials/Content-based-recommendations-Recommending-based-product-attributes/563030/600814-4.html

Why Image recommendation?

Recommendation systems have always been in demand in every industry and domain as matching a user’s preference is always the most important thing in modern-day business. The project deals with recommending the most suitable products to users based on the current selection and choice of the product. Our model relies on the fact that multiple features can be extracted from images and used for similarity computation.

The project initially uses the information retrieval paper “A content-based goods image recommendation system” as a baseline model to generate similarity scores for images. The same content-based image retrieval technique is now extended to Deep Learning models and architectures to achieve better results and generate most similar recommendations.

Image Recommendation vs Image Classification

These two terms might look a bit similar but there is a big difference between the two. Unlike Image classification where the sole aim is to classify and predict the class of an object, Image recommendation deals with finding the most similar objects for a given object where the object is present in the form of an image. This undoubtedly involves the underlying process of Image classification but in a broad sense, Image recommendation relies on extracting most descriptive and distinctive features and utilizing them to get matching recommendations.

Bottom-wear input image (left) and top 5 recommendations

About the data

We were able to obtain rich and high-resolution fashion products data-set from Kaggle which consists of about 44k images of 143 unique and different classes like T-shirt, jeans, watches, etc. The size of the images is about 2400 x 1600 which makes it a pretty huge dataset posing a great challenge in Image pre-processing as well as training deep learning models.

Class wise Data-set distribution (top frequent classes)

Low-level features extraction technique

While building an image recommendation system, the first task is to identify what type of feature descriptors of an image needs to be taken into consideration. This may vary from the image to image. Some of the most commonly used feature descriptors are extracting color features, texture, and shape of an image. We used the following features for classification.

HSV histogram feature deals with the color distribution of the image between 0 and 255 pixels. The feature vector is of 255 x1 dimensions and represents the pixel frequency distribution across the image.

Edge detection feature corresponds to the edges of the images which are detected using the Sobel edge detection algorithm which in turn returns a feature vector of image size.

Image texture of an image is also an important feature for analyzing pixel distribution in an image. Our approach uses a Gabor’s filter to get an image size one-dimensional texture feature vector :

Histogram of oriented gradients (HOG) plays an important role in Object and shape detection. We receive a 3780 X 1 size one-dimensional feature vector.

Although these features were very good for representing an image in numerical form but still they were insufficient in retrieving more details from the image as per our requirement.

'''Getting HOG features'''
fd, hog_image = hog(img, orientations=9, pixels_per_cell=(8, 8), 
                    cells_per_block=(2, 2), visualize=True, multichannel=True)
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 

'''Getting Edge Detection features'''
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
sobel_img = filters.sobel(gray_img)

'''Getting Texture features'''
g_kernel = cv2.getGaborKernel((21, 21), 8.0, np.pi/4, 10.0, 0.5, 0, ktype=cv2.CV_32F)
texture_img = cv2.filter2D(gray_img, cv2.CV_8UC3, g_kernel)

Similarity Computation and Generating Recommendations

We generate training vectors for each of the training and testing images. Once this is done, then for each test image we compute cosine similarity of the test image with leaders of all the 143 predefined clusters. Here the leaders of each cluster are chosen arbitrarily. We do this for all the different types of feature descriptors mentioned above. After that we calculate the mean of all the cosine similarities and then the top 5 most similar leaders are chosen. After this cosine similarity is computed between the test image and all the images in the 5 chosen clusters. This is again done for all the different types of feature descriptors mentioned above. Finally top k most similar images.

Accuracy achieved with this low level features model did not turn out to be pretty well as it could only reach the mark of 51%

Digging Deeper with Deep Learning

The results achieved with low-level features were not convincing enough for us and therefore, we decided to explore techniques which may help us extract more descriptive and distinctive feature representation for the images which can be further used in computing similarity score based on which top K recommendations can be returned to the user. In a quest to find such feature representation, we landed at Deep Learning techniques which are quite powerful at extracting patterns and features from images.

Pre-trained models based features extraction technique

Before making use of our data-set for training any deep learning model, we decided to go ahead with pre-trained models and check how these models work on our data-set. For this purpose, we mainly used five of the most popular CNN based architectures namely VGG, Resnet, MobileNet, DenseNet, and Inception network.

Features Extraction using pre-trained models

To use pre-trained deep learning models as feature extractors, the very first step was to remove its final output layer as we did not intend to use these models as classifiers. With this step, we are left with the output of a convolutional layer that is pooled and reduced using a global average pooling layer followed by a flatten layer to get a linear feature vector for an image.

The features vector are generated for test image and all training images and cosine similarity is computed and based on the top K scores, the corresponding top K images are returned as a recommendation

''' defining Resnet and Vgg pre-trained Models'''def resNetModel(height, width):
    model = ResNet50(weights='imagenet', include_top=False, input_shape = (height, width, 3))
    model.trainable = False
    output = GlobalMaxPooling2D()(model.outputs)
    model = Model(inputs=model.inputs, outputs=output)
    model.summary()
    return modeldef vggModel(height, width):
    model = VGG16(weights='imagenet', include_top=False, input_shape = (height, width, 3))
    model.trainable = False
    output = GlobalMaxPooling2D()(model.outputs)
    model = Model(inputs=model.inputs, outputs=output)
    model.summary()
    return model

Performance Evaluation of Pre-trained models

On testing our pre-trained models on a test set of about 1200 images, we were able to obtain following results

Pre-trained models precision vs recall comparison

We used precision and recall as evaluation metric for our models evaluation and based on the above result, we were able to conclude that VGG and Resnet were the best performing models for our data-set

VGG + Resnet based Weighted Ensemble Technique

On evaluating our models, we could draw the conclusion that VGG and Resnet were the best performing models among all the models,therefore we decided to further work on these models to get better results.

To create a rich feature representation that involves the representation of both VGG and Resnet models, we used a weighted average of both the features to obtain the final feature vector. Since in case of Resnet, the size of the feature vector was bigger than the size of VGG’s feature vector , we used SKlearn’s SelectK feature reduction technique which selects top K features from a set of features based on the target value for the features.

After finally generating the weighted feature representation, the same representation was used to compute cosine similarity between test and train images and get top K recommendations based on top K scores

We compared the standalone VGG and Resnet model’s performance with our ensemble model and the ensemble technique somewhat worked better than two standalone models

Ensemble model vs stand-alone VGG and Resnet comparison

As we saw an accuracy boost with this method, we tried this technique for different weight ratios and here are the results.

Ensemble model precision vs recall comparison for different weights combinations

After trying multiple weight combinations, we didn't observe much change in the accuracy and therefore we decided to move ahead to a newer technique.

CNN Classifier Based Retrieval technique (CCBR)

As we had played with a lot of feature representations with pre-trained models, we decided to test another technique quite different from the previous ones where we first classify the input image and then generate recommendations based on the predicted class.

Training CNN model

As we got success with the pre-trained models, we decided to train our own CNN classifier over our data-set and use it for feature representations as well as predicting the class of the test image. As we already have the classes or true labels of all the images, we decided to train our CNN model on this data-set to build an image classifier . We divided 44K images in 80:20 ratio to get about 9000 images in test set and rest in training.

# CNN model's architecture
model = Sequential()
model.add( Conv2D(filters = 128, kernel_size=(3, 3), input_shape=X.shape[1:], padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add( Conv2D(filters = 128, kernel_size=(3, 3), padding='same', activation='relu') )model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))model.add( Conv2D(filters = 256, kernel_size=(3, 3), padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add( Conv2D(filters = 256, kernel_size=(3, 3), padding='same', activation='relu') )model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(output_size, activation='softmax'))

Classifying input image

Before we use the last layer’s output as feature representation, we used our model to classify the test image and gets it class. This class was used to know what training images are to be used for computing cosine similarity with

Generating Feature representations

For every test image, its class is predicted using the classifier. Then the last layer of the model is removed and the output of the last layer is flattened and used as a feature vector . This process is done for all test and train images

Similarity Computation and getting Recommendations

Since the class of the test image is predicted , its cosine similarity is computed with only the images having the same class. This saves time and generates better results. The top K scores and their corresponding training images are returned as recommendations

CCBR technique precision vs recall comparison

The performance of CCBR technique turns out to be pretty well thanks to initial image classification before generating recommendations.

Summing it up

In our project, we tried multiple approaches and obtained different types of feature representations . The accuracy table shows how various features techniques performed on the test-set

Accuracy comparsion of different techniques applied

Why CCBR technique worked better than pre-trained models?

There can be two major reasons such as

The CNN model was used as a classifier before generating feature vectors to predict the class of input image, therefore chances of getting right recommendations increased as we only computed similarity with those training images lying in the predicted class unlike the case of pre-trained models where the direct feature representations were used and similarity was computed with all training images irrespective of the class the images belong to
The CNN model was trained on specifically the fashion dataset therefore it learned the specific feature representation of these images where as the pre-trained models were trained on Image-Net where they may have learned broader and generic image feature representation

Bag input image (left) and top 5 recommendations

What is next?

We plan to work on this project further to develop much better recommendation system; There can be two major extensions in this project such as

Using transfer learning and fine tuning VGG and Resnet to adapt the model to fashion dataset specifically
Using Generative adversarial neural networks for getting accurate recommendation

Contribution of each member

Each member/author has contributed equally to the project’s outline, techniques and algorithms development, training models and data collection. However, major contribution to various components are as follows —

Abdul Wajid Nasar (MT19083) — Data pre-processing and Data balancing, Low level features extraction, metric learning, KNN algorithm,image classification, CCBR evaluation
Minnet Khan (MT19040)— Data pre-processing and splitting, CCBR technique implementation , Model hyperparameter tuning , baseline evaluation, Image features extraction
Zaki Mustafa Farooqi (MT19048)— Data Pre-processing and visualization, Deep features extraction, Pre-trained models technique , Weighted Ensemble technique , features vectorization technique, models performance evaluation

Acknowledgement

We are extremely grateful to our Information Retrieval course professor Dr. Tanmoy Chakraborty for his continuous guidance and help throughout the semester. Moreover, we would like to acknowledge the efforts of the course TAs — Abhinav, Anubhav , Hridoy, Jasmeet and Vrutti in making our course journey quite smooth and fluid.

References

Shobhit Bhatnagar ; Deepanway Ghosal ; Maheshkumar H. Kolekar , Classification of fashion article images using convolutional neural networks, 2017 Fourth International Conference on Image Information Processing (ICIIP)
Manali Shaha ; Meenakshi Pawar, Transfer Learning for Image Classification, 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA)
Arpana Mahajan ; Sanjay Chaudhary,Categorical Image Classification Based On Representational Deep Network (RESNET), 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA)
Li Yu & Fangjian Han & Shaobing Huang & Yiwen Luo [2017]
A content based images recommendation system