-
Notifications
You must be signed in to change notification settings - Fork 5
/
index.json
1 lines (1 loc) · 114 KB
/
index.json
1
[{"authors":["admin"],"categories":null,"content":"Jianchao Li is a software engineer with expertise in deep learning, machine learning, and computer vision. He is currently based in Zurich, Switzerland and works at Meta. Prior to joining Meta, he has held software engineering roles at leading technology companies such as Indeed, ByteDance, and ViSenze.\n","date":-62135596800,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":-62135596800,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://jianchao-li.github.io/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"Jianchao Li is a software engineer with expertise in deep learning, machine learning, and computer vision. He is currently based in Zurich, Switzerland and works at Meta. Prior to joining Meta, he has held software engineering roles at leading technology companies such as Indeed, ByteDance, and ViSenze.","tags":null,"title":"Jianchao Li","type":"authors"},{"authors":[],"categories":[],"content":"Starting with this post, I want to venture into a new area — investing — distinct from my usual topics of computer vision, deep learning, etc. Investing has become my latest hobby, and after diving into it, I have found it both fascinating and inspiring.\nIn this first post, I will explore a fundamental concept in the world of investing: compound interest. To me, it represents the \u0026ldquo;first principle\u0026rdquo; of investing.\nWhat is Compound Interest? Compound interest is a form of interest. When you deposit money into a bank account or lend it to someone, you typically earn interest. The amount you deposit or lend is called the principal, and the interest is calculated as a percentage (interest rate) of the principal. For example, if you have a principal of $100 and an interest rate of 10%, your interest would be $10 ($100 x 10% = $10).\nInterest is typically paid at fixed intervals. In the example above, let\u0026rsquo;s assume the $10 is paid annually. After the first year, you would have $100 in principal and $10 in interest. Now, imagine that the $10 interest is added to the $100 principal. This means that, in the next period, interest will be calculated based on the new principal amount of $110. If the interest rate remains at 10%, your next interest payment will be $11 ($110 * 10% = $11).\nHere, you may notice a snowball effect: as your interest payments are added to your principal (increasing it from $100 to $110), your interest payments also grow (from $10 to $11). This cycle continues, with the interest added to your principal, which increases the principal further, leading to higher interest payments and, in turn, an even higher principal, and so on.\nIf we continue the above calculations for 10 years, the principals and interest amounts would look as follows (with rounding errors). The principal is the amount at the beginning of each year, and the interest is the amount at the end of each year. The principal for year n+1 is the principal of year n plus the interest earned in year n.\nYear Principal ($) Interest ($) 1 100 10 2 110 11 3 121 12.1 4 133.1 13.31 5 146.41 14.641 6 161.051 16.1051 7 177.1561 17.7156 8 194.8717 19.4872 9 214.3589 21.4359 10 235.7948 23.5795 The magic of Compound Interest \u0026ldquo;Compound interest is the eighth wonder of the world. He who understands it, earns it\u0026hellip; he who doesn\u0026rsquo;t\u0026hellip; pays it.\u0026rdquo; - Albert Einstein\nLooking at the numbers in the table above, you might realize the magic of compound interest: starting with $100 in principal, earning 10% interests per year, by the beginning of the 9th year, you have doubled your money. If this doesn\u0026rsquo;t amaze you, let\u0026rsquo;s explore two more calculations to help you fully appreciate it.\nCompounding for Another 10 Years Let’s extend the calculations from the table above for another 10 years.\nYear Principal ($) Interest ($) 11 259.3742 25.9374 12 285.3117 28.5312 13 313.8428 31.3843 14 345.2271 34.5227 15 379.7498 37.9750 16 417.7248 41.7725 17 459.4973 45.9497 18 505.4470 50.5447 19 555.9917 55.5992 20 611.5909 61.1591 As you can see, at the beginning of the 10th year, your principal is around $236. By the start of the 20th year, it has grown (non-linearly) to approximately $612. Continuing the calculations, by the 26th year, you will receive an interest payment of $108.35, the first time your interest payment exceeds your initial $100 principal! The longer you continue the calculations, the wilder the numbers become.\nThe Case of Non-Compounding Now, let\u0026rsquo;s consider the non-compounding case, where the annual interest payment is not added back to the principal. In this scenario, you receive interest based on the initial $100 principal, meaning you earn $10 in interest each year. After 10 years, your total amount would be $200 ($100 + $100 x 0.1 x 10). While this is not too far from the compounded case, where your total would be around $260, the gap widens significantly after another 10 years. In the non-compounding case, you would have $300, while in the compounding case, you would have approximately $672. As before, the longer you continue the calculations, the more drastic the gap becomes.\nExponential Explosion The snowball effect of compound interest described above is known as exponential growth, or more dramatically, exponential explosion. This is the essence of the magic of compound interest — over time, the amount of your money grows exponentially to a \u0026ldquo;formidable\u0026rdquo; sum (though not so dramatic in the example above, largely due to the small $100 principal).\nIf we plot the results of compounding versus non-compounding over 50 years, the \u0026ldquo;explosion\u0026rdquo; or magic of compound interest becomes much more evident. After 50 years, your initial $100 principal grows to around $11,739! And remember, this 117-fold growth comes from a \u0026ldquo;modest\u0026rdquo; 10% compound interest rate. Take a moment to think about it: a 117-times increase from just 0.1, and you will start to appreciate the magic of compound interest. By comparison, in the non-compounding case, it would take 1,164 years to grow to $11,739!\nFor what it\u0026rsquo;s worth, the non-compounding case is also known as simple interest.\nWhy Care About Compound Interest Now that we have a clear understanding of compound interest, why is it important to investing? The answer is simple: it describes your return on investment. If you are familiar with investing, you have likely heard terms like \u0026ldquo;the annualized rate of return is X% over the past Y years\u0026rdquo;. This \u0026ldquo;annualized rate of return\u0026rdquo; is similar (but not identical) to the compound interest rate we discussed earlier. As we have seen, compound interest has the magic to grow your money into a formidable sum over time.\nMisconceptions While the \u0026ldquo;annualized rate of return\u0026rdquo; is similar to the compound interest rate in our example, there is a crucial difference that often leads to misconceptions. In our compound interest example, the 10% interest rate is fixed every year. However, in real-world investing, the rate is rarely, if ever, fixed. For instance, an annualized rate of return of 10% does NOT mean you are guaranteed to earn 10% each year! Instead, the annualized rate of return is calculated by comparing the starting and ending values of an investment over a given time period and determining the equivalent compound interest rate.\nIf you are interested in the details of this computation, you can refer to this article. It is crucial for all investors to dispel this misconception (even smart people can fall into this trap) and set realistic expectations.\nWhy Compounding Is a Double-Edged Sword From the discussions above, I hope you now appreciate the magic of compounding in investing. However, it is equally important to understand why compounding can become a double-edged sword. The reason is simple: while compounding can work wonders, it can also work against you.\nIn the real world, investing is not free. While the market might generate an annualized return of 10%, you don\u0026rsquo;t get to keep all of it. A portion is deducted as investment costs. A common example of such costs is the total expense ratio (TER) of investment funds, where you pay a percentage of your money to the fund providers each year. When it comes to costs, compounding works against you.\nLet’s revisit the compound interest example of 10% and consider a principal of $10,000. Suppose the TER is 1%. After one year, your gross amount would grow to $11,000 ($10,000 * 10%). However, you would pay $100 ($10,000 * 1%) in fees to the fund providers, leaving you with a net amount of $10,900. Effectively, your net compound interest rate is 9%. This leads to the following simple formula:\nNet annualized rate of return = Gross annualized rate of return - Total expense ratio\nOr\nInvestment return rate = Market return rate - Total expense ratio\nNow, let’s perform similar calculations to compare a 10% market return with a 9% investment return. Specifically, we will calculate the total asset value (with rounding errors) at 10-year intervals for both cases.\nYears Market Value (10%, $) Investment Value (9%, $) Percentage of Market Value Kept by Investment (%) 10 25,937.42 23673.64 91.27 20 67,275.00 56044.11 83.31 30 174,494.02 132676.78 76.04 40 452,592.56 314094.20 69.40 50 1,173,908.53 743,575.20 63.34 From the table, it is clear that over time, you retain less and less of the market value. After 10 years, you keep over 91%, but after 50 years, this drops to just over 63%. In other words, after 10 years, you lose around 9% of the market value, and after 50 years, the loss grows to 37%. Where does this 37% go? You guessed it - it\u0026rsquo;s taken as fees.\nNow, consider this: the 37% loss stems from a seemingly \u0026ldquo;small\u0026rdquo; 1% annual fee. To better understand the impact, let’s calculate the total amount of fees you paid to fund providers over time.\nYears Market Value (10%, $) Investment Value (9%, $) Investment Fees (%) 10 25,937.42 23,673.64 2,263.79 20 67,275.00 56,044.11 11,230.89 30 174,494.02 132,676.78 41,817.24 40 452,592.56 314,094.20 138,498.36 50 1,173,908.53 743,575.20 430,333.32 As shown, over 50 years, you pay a staggering $430,333 in fees to fund providers — more than 43 times your initial $10,000 principal! And again, this immense cost results from a seemingly \u0026ldquo;small\u0026rdquo; 1% annual fee.\nBy now, it should be clear: costs also compound, and over time, they can amount to a formidable sum. This is why compounding can be a double-edged sword.\nHow to Avoid the Double-Edged Sword The solution is simple: minimize investment costs, or at the very least, reduce them to a level you can accept. For instance, if the fund cost were magically reduced to a TER of 0.03% (and such funds do exist), you would still be able to retain over 98% of the market value after 50 years!\nYears Market Value (10%, $) Investment Value (9.97%, $) Percentage of Market Value Kept by Investment (%) 10 25,937.42 25,866.77 99.73 20 67,275.00 66,908.99 99.46 30 174,494.02 173,071.98 99.19 40 452,592.56 447,681.35 98.91 50 1,173,908.53 1,158,007.18 98.65 Is My Fund Cost High? I have seen people ask this question when evaluating whether to select a fund: \u0026ldquo;Is this X% fund cost high?\u0026rdquo; To answer this, you can simply repeat the calculations I provided earlier. Determine the percentage of money that you will need to pay to fund providers over your investment horizon and decide whether that cost is acceptable to you. If you find it unacceptable, the fund cost might be considered high, and you may want to consider looking for one with lower fees.\nSummary John C. Bogle aptly summarized the concepts discussed in this post as \u0026ldquo;The magic of compounding investment returns\u0026rdquo; and \u0026ldquo;The tyranny of compounding investment costs\u0026rdquo; in his book, The Little Book of Common Sense Investing. This post is heavily inspired by Mr. Bogle\u0026rsquo;s insights.\nDisclaimers I am not a professional investor or financial advisor, nor do I claim to be. You are solely responsible for your financial decisions. I do not accept responsibility for any inaccuracies in this blog post. The content reflects my personal research and understanding and should not be interpreted as professional advice.\n","date":1735938480,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1735951980,"objectID":"45ae77da4f283297727d6ca724c2f722","permalink":"https://jianchao-li.github.io/post/the-double-edged-sword-of-compounding/","publishdate":"2025-01-03T22:08:00+01:00","relpermalink":"/post/the-double-edged-sword-of-compounding/","section":"post","summary":"While many investors understand the power of compound returns, they may not fully grasp that compounding can also result in a significant downside. This post explores this double-edged sword.","tags":[],"title":"The Double-Edged Sword of Compounding","type":"post"},{"authors":[],"categories":[],"content":"While deep neural networks have achieved state-of-the-art performance in many problems(e.g., image classification, object detection, scene parsing etc.), it is always not trivial to intepret their outputs. Till now, the most common and useful way to interpret the output of a deep neural network is still by visualization. You may refer to this CS231n course note for some introduction.\nIn this post, I will describe how to interpret an image classification model using Captum. Captum, which means \u0026ldquo;comprehension\u0026rdquo; in Latin, is a open-source project with many model interpretabiliy algorithms implemented in PyTorch. Specifically, I adopted LayerGradCam for this post.\nInstall Captum As LayerGradCam is still not released at the time of writing this post, to use it, clone the Captum repository locally and install it from there.\ngit clone git@github.com:pytorch/captum.git cd captum pip install -e . Then import all the required packages.\nimport json import requests from io import BytesIO import cv2 import numpy as np import torch from torchvision import models, transforms from PIL import Image import matplotlib.pyplot as plt %matplotlib inline from captum.attr import LayerAttribution, LayerGradCam Prepare a Model and an Image I use the MobileNetV2 pretrained on ImageNet from torchvision and an image of a Hornbill from Wikipedia. Later I will use LayerGradCam to intepret and visualize why the model gives the specific output for this image.\nNote that the model needs to be set to the test mode.\n# use MobileNetV2 model = models.mobilenet_v2(pretrained=True) model = model.eval() For the image, I first read its encoded string from its URL and then use the PIL.Image format to decode it. In this way, the channels of the image are in the RGB order.\nimg_url = \u0026#39;https://upload.wikimedia.org/wikipedia/commons/8/8f/Buceros_bicornis_%28female%29_-feeding_in_tree-8.jpg\u0026#39; resp = requests.get(img_url) img = Image.open(BytesIO(resp.content)) plt.figure(figsize=(10, 10)) plt.imshow(img) I also prepare the class names for the 1000 classes in ImageNet. This will let me know the specific class names instead of only the index of the predicted class. The class names are loaded from the following URL.\nurl = \u0026#39;https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json\u0026#39; resp = requests.get(url) class_names_map = json.loads(resp.text) Preprocessing For torchvision models, before passing an image to it, the image needs to be applied the following preprocessing (reference). This is a key step to make the model run on images from the same distribution as of those that it was trained on.\npreprocessing = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ), ]) LayerGradCam Now we can apply LayerGradCam to \u0026ldquo;attribute\u0026rdquo; the output of the model to a specific layer of the model. What LayerGradCam does is basically computing the gradients of the output with respect to that specific layer. The following function is used to get a layer from the model by its name.\ndef get_layer(model, layer_name): for name in layer_name.split(\u0026#34;.\u0026#34;): model = getattr(model, name) return model The features.18 layer of MobileNetV2 will be used in this notebook.\nlayer = get_layer(model, \u0026#39;features.18\u0026#39;) We will use LayerGradCam to compute the attribution map (gradients) of the model\u0026rsquo;s top-1 output with respect to layer. This map can be interpreted as to what extent is the output influenced by a unit in layer. This makes sense as the larger the gradient, the larger the influence.\nThis attribution map (with the same size as the output of layer, in this case, 7*7) is further upsampled to the size of the image and overlaid on the image as a heatmap. So this heatmap reflects how much influence each pixel has on the output of the model. The pixels with larger influence (the red regions in the heatmap) can thus be interpreted as the main regions in the image that drive the model to generate its output.\nTo enable all above processing of the attribution map, two functions are implemented as follows. The first function to_gray_image converts an np.array to a gray-scale image by normalizing its values to [0, 1], multiplying it by 255, and converting its data type to uint8. The second one compute_heatmap utilizes cv2 to overlay a torch.Tensor as a heatmap on an image.\ndef to_gray_image(x): x -= x.min() x /= x.max() + np.spacing(1) x *= 255 return np.array(x, dtype=np.uint8) def overlay_heatmap(img, grad): # convert PIL Image to numpy array img_np = np.array(img) # convert gradients to heatmap grad = grad.squeeze().detach().numpy() grad_img = to_gray_image(grad) heatmap = cv2.applyColorMap(grad_img, cv2.COLORMAP_JET) heatmap = heatmap[:, :, ::-1] # convert to rgb # overlay heatmap on image return cv2.addWeighted(img_np, 0.5, heatmap, 0.5, 0) In overlay_heatmap, note that img is in RGB order while the heatmap returned by cv2.applyColorMap is in BGR order. So we convert heatmap to RGB order first before the overlay.\nUsing all above functions, the following function attribute computes and overlays the LayerGradCam heatmap on an image.\ndef attribute(img): # preprocess the image preproc_img = preprocessing(img) # forward propagation to get the model outputs inp = preproc_img.unsqueeze(0) out = model(inp) # construct LayerGradCam layer_grad_cam = LayerGradCam(model, layer) # generate attribution map _, out_index = torch.topk(out, k=1) out_index = out_index.squeeze(dim=1) attr = layer_grad_cam.attribute(inp, out_index) upsampled_attr = LayerAttribution.interpolate(attr, (img.height, img.width), \u0026#39;bicubic\u0026#39;) # generate heatmap heatmap = overlay_heatmap(img, upsampled_attr) return heatmap, out_index.item() Specifically, what attribute does is as follows.\nPreprocess the image; Run a forward propagation on the image to get the model\u0026rsquo;s output; Construct a LayerGradCam object using model and layer; Generate the attribution map of the model\u0026rsquo;s top-1 output to layer; Upsample the attribution map to the same size as the image; Overlay the attribution map as a heatmap on the image. Now it is time to run an example! Let\u0026rsquo;s see what class the model predicts on the Hornbill image, and more importantly, why.\nvis, out_index = attribute(img) fig = plt.figure(figsize=(10, 10)) ax = fig.add_subplot(111) ax.set_title(class_names_map[str(out_index)], fontsize=30) plt.imshow(vis) We can see that the model makes a correct prediction. From the above visualization, we can also see that the red regions are mostly around the head and beak of the Hornbill, especiall its heavy bill. The red regions are the main regions that drive the model to generate its output. This makes great sense as those regions are just the distinctive features of a Hornbill.\nNow you can also apply the above technique (and more from Captum) to interpret the output of your PyTorch model. Have fun!\nNotes: This post is alao avaialble as a Jupyter notebook.\n","date":1582944558,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1582944558,"objectID":"8cb66b84d74de50567bc0380372a92c8","permalink":"https://jianchao-li.github.io/post/interpret-pytorch-models-with-captum/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/post/interpret-pytorch-models-with-captum/","section":"post","summary":"I used Captum to interpre the output of a MobileNetV2, which visualized the main regions in the input image that drove the model to generate its output.","tags":[],"title":"Interpret PyTorch Models with Captum","type":"post"},{"authors":[],"categories":[],"content":"After working with PyTorch in my daily work for some time, recently I got a chance to work on something completely new - Core ML. After converting a PyTorch model to the Core ML format and seeing it work in an iPhone 7, I believe this deserves a blog post.\nWhat is Core ML? Core ML is a framework developed by Apple to integrate machine learning models into iOS applications. As like each other framework, Core ML has its own model format (.mlmodel), like .pth of PyTorch or .params of MXNet.\nCompared to PyTorch or MXNet, Core ML is mainly used as an inference engine in iOS. That means you will first train a model using PyTorch (.pth) or MXNet (.params) and then convert it to the Core ML format (.mlmodel) and deploy it to an iOS app.\nGet a Sense First Before diving into details, it is better to get a sense of what a Core ML model looks like. You may download the sample code in Classifying Images with Vision and Core ML. There is a MobileNet.mlmodel inside it. You can open it with Xcode to see what it looks like.\nThe following is a screenshot of the model details. In the center area, there are 3 sections: Machine Learning Model, Model Class and Prediction.\nThe interesting part is the Prediction. It tells us that the input to the model is a color (RGB) image of size 224 x 224 and the outputs have two parts: top-1 category classLabel and the probabilities of all categories classLabelProbs. This will guide the model conversion later.\nThen ou can click the triangle in the following red rectangle to build the project. You can also select the device simulator that you want to run the project on in the blue rectangle.\nYou may need to configure the \u0026ldquo;Signing \u0026amp; Capabilities\u0026rdquo; by clicking the Vision+ML Example folder (an Apple ID will be needed). After that, you should see an iPhone coming out in your screen and you can start to add a photo and play with it! If you want to try it on a real iPhone, just connect your iPhone to the computer (USB or Type-C) then you should be able to select your iPhone in the blue rectangle.\nYou can try more open source Core ML models here. To add a model to the project, you need to drag it to the project structure and set it up as follows. Some files will be generated automatically for you to use the model.\nYou need to change the line let model = try VNCoreMLModel(for: MobileNet().model) in ImageClassificationViewController.swift to use the model. You may also need to update the target iOS version shown in the red rectangle of the following screenshot.\nModel Conversion Now we take a step back. We have just trained a model using PyTorch or MXNet and we wwant to run it on iOS. Obviously, we need to convert the .pth or .params to .mlmodel. This is model conversion.\nFor Caffe and Keras, their models can be converted to Core ML models directly. However, such direct conversion is not supported for PyTorch. Fortunately, we have ONNX, an excellent exchange format between models of various frameworks.\nThe conversion flow from PyTorch to Core ML is as follows. I will use the mobilenet_v2 of torchvision as an example to walk through the conversion process.\nLoading TorchVision Model First I load a MobileNet v2 pretrained on ImageNet. Note that I add a Softmax layer to get the probabilities of all categories (remember by the output classLabelProbs of the Core ML model?).\nimport torch import torch.nn as nn import torchvision model = torchvision.models.mobilenet_v2(pretrained=True) # torchvision models do not have softmax outputs model = nn.Sequential(model, nn.Softmax()) PyTorch to ONNX Then I convert the above PyTorch model to onnx (model.onnx). Note that the input_names and output_names are consistent with the above Core ML model.\ndummy_input = torch.randn(1, 3, 224, 224) torch.onnx.export(model, dummy_input, \u0026#39;mobilenet_v2.onnx\u0026#39;, verbose=True, input_names=[\u0026#39;image\u0026#39;], output_names=[\u0026#39;classLabelProbs\u0026#39;]) ONNX to Core ML Finally, convert the ONNX model to a Core ML model (mobilenet_v2.mlmodel). In this process, the class labels of ImageNet is required, which can be dowloaded to imagenet_class_index.json from here. The image_input_names=['image'] means we treat the image (input of the onnx model) as an image (remember the input image of the above Core ML model?). predicted_feature_name='classLabel' will generate the other output of the above Core ML model.\nimport json import requests from onnx_coreml import convert IMAGENET_CLASS_INDEX_URL = \u0026#39;https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json\u0026#39; def load_imagenet_class_labels(): response = requests.get(IMAGENET_CLASS_INDEX_URL) index_json = response.json() class_labels = [index_json[str(i)][1] for i in range(1000)] return class_labels model_onnx = onnx.load(\u0026#39;mobilenet_v2.onnx\u0026#39;) class_labels = load_imagenet_class_labels() model_coreml = convert(model_onnx, mode=\u0026#39;classifier\u0026#39;, image_input_names=[\u0026#39;image\u0026#39;], class_labels=class_labels, predicted_feature_name=\u0026#39;classLabel\u0026#39;) model_coreml.save(\u0026#39;mobilenet_v2.mlmodel\u0026#39;) Now you can drag the mobilenet_v2.mlmodel to your project and play with it. Have fun!\n","date":1571158818,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1571158818,"objectID":"68dfa2e6f400df506f02f6cbe9603051","permalink":"https://jianchao-li.github.io/post/from-pytorch-to-coreml/","publishdate":"2019-10-16T01:00:18+08:00","relpermalink":"/post/from-pytorch-to-coreml/","section":"post","summary":"I converted a PyTorch model to Core ML and ran it on an iPhone.","tags":[],"title":"From PyTorch to Core ML","type":"post"},{"authors":[],"categories":[],"content":"Introduction In recent years, computer vision has witnessed extremely rapid progress. Somewhat surprisingly, this thriving field was originated from a summer project at MIT in 1966. Richard Szeliski wrote in Computer Vision: Algorithms and Applications:\nin 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman to spend the summer linking a camera to a computer and getting the computer to describe what it saw\nThis see-and-describe summarizes the original goal of the pioneers: let the computer see the world around it (expressed in images/videos) and describe it.\nTill now, several granularities of descriptions have been developed: image-level category descriptions (image classification), object-level location descriptions (object detection), and pixel-level dense descriptions (image segmentation).\nHowever, the most natural way to describe something (for humans) is to use natural language. Actually, the above story of Marvin Minsky, though often been cited as an evidence of how those masters underestimated the difficulties of vision problems, also shows that computer vision was born with an expectation of being connected with natural language.\nCurrent Research Many researchers have been seeking to build the connection between computer vision and natural language, which poses a challenging modeling problem with two modalities of data (images and natural language). Nowadays, the research community has generally come to a consensus on modeling images with convolutional neural networks (CNNs) and natural language with recurrent neural networks (RNNs). Both of these architectures can be made deeper by adding layers with homogeneous computations for better performance. Specifically, researchers have tried to build the connection between vision and natural language in the following ways.\nImage/Video Captioning In image/video captioning, an image/video is given and a sentence describing its content is returned. In current research, the image/video is usually encoded into a feature vector by a CNN. Then an RNN generates the captions using this vector as the initial hidden state, as shown in the following two figures (taken from Deep Visual-Semantic Alignments for Generating Image Descriptions and Translating Videos to Natural Language Using Deep Recurrent Neural Networks).\nImage Generation (from Text) This is the inverse problem of image captioning: a sentence is given and an image matching the meaning of the sentence is returned. The recent advances of generative adversarial networks (GANs) have opened up tons of opportunities for generation tasks like this. Typically, image generation makes use of GANs with the text being encoded by an RNN and fed into the generator/discriminator networks, as shown below (taken from Generative Adversarial Text to Image Synthesis).\nVisual Question Answering In visual question answering (VQA), an image and a question about it are given and the answer is returned. This is arguably a very natural way for humans to interact with computers. In recent years, computers have learned to answer questions like is this a cat (classification) or where is the cat (detection/segmentation). Now they are asked more questions like counting, spatial/logical reasoning, and analyzing graphical plots and figures. In VQA, the visual content and the question content are often encoded by CNNs and RNNs respectively and then combined in some way to generate the answer, as shown below (taken from Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering).\nFuture Opportunities Current research in connecting computer vision and natural language has achieved great breakthroughs thanks to the success of CNNs and RNNs in the two areas respectively. However, there are still some limitations in current research that open up future opportunities.\nFine-grained Image Captioning Current image captioning generates captions which give an overall description of images. The results of applying a state-of-the-art image captioning algorithm to images from the COCO dataset are shown below.\nIn the COCO dataset, images are of various scenes and objects. And the captioning algorithm is able to capture the overall content of what is happening in the image, except for some mistakes like the cat is not sitting on the laptop. But, in general, the captions are very discriminative considering the large differences between images. Given images and captions, it is very easy to tell which image corresponds to which caption. Image captioning makes great sense in this case.\nThen I applied the same captioning algorithm to the Clothing Co-Parsing (CCP) dataset, whose images are all clothing images. The captions are shown below.\nIn the CCP dataset, images are all coming from the clothing domain and thus they are very similar to each other in the overall content. And the differences are mostly reflected in fine-grained details. In this case, the captions which only capture the overall content become meaningless and are not very helpful for distinguishing one image from others. Moreover, the captions make more mistakes, like a lot of false positives of cell phones.\nFor classifying images in the same domain, researchers have come up with fine-grained image classification. Now to caption these images, whose fine-grained details are much more important than the overall content, it makes sense to state that we need fine-grained image captioning.\nSimilar to fine-grained image classification, which finds many applications in online advertising (like searching by images), fine-grained image captioning also finds an important application in this area, that is, to write captions for goods.\nActually, businesses are always trying to describe the attractive details of their goods to convince customers to make the buying decision. For example, the advertising captions of two clothing images in Toutiao are shown below.\nThe above captions are very different from those of the COCO and CCP datasets. Instead of merely focusing on the overall image content, they try to capture more fine-grained details of the clothes. They even go beyond those details to present customers a sense of how the clothes will look on him/her. These captions are also more flexible since they are manually written by businesses, though a mistake about the color of the dress is made in the right image. So a natural question is whether we can apply image captioning to write such captions for advertising. Obviously, general image captioning is still unable to perform well on it, as shown in the captions of the CCP dataset. So fine-grained image captioning comes into use. However, there are still very few works on it.\nIt is worth noting that though I am using clothing as an example domain to present fine-grained image captioning, it is definitely not limited to clothing and can also be applied to many other domains like food and cars.\nTo solve the fine-grained image captioning problem, the considerable number of online advertising captions serve as a good basis. The pipeline of fine-grained image captioning may also be similar to that of general image captioning: a CNN learns a domain-specific representation of the image (maybe via fine-tuning the network in a fine-grained image classification task) and then an RNN generates a fine-grained caption conditioned on the representation. There should be many problems waiting to be discovered and solved in fine-grained image captioning.\nShort Videos Recent years, we have witnessed an increasing popularity of short videos. Many companies like Facebook, Instagram, Kuaishou, Douyin etc. have developed products to enable their users to upload and share short videos. Compared to static images and long videos, short videos have the flexibility and authenticity of videos, and can also be as concentrated (on a topic) as static images.\nFor long videos (like movies), they contain much more information than what several sentences can describe, which poses challenges to video captioning. And they have a relatively large and non-trivial search space for visual question answering. Given the large number of short videos, a moderate next step is to work on these tasks using short videos.\nModeling short videos can be done by combing CNNs and RNNs: each frame can be modeled by a CNN and the sequence of frames is well suited for an RNN. On the way to connect computer vision and natural language, short videos act as a good transition state between static images and long videos. And their popularity provides a lot of applications in recommendation, structured analysis, etc. Successful modeling of short videos will also be helpful to long videos since long videos can be treated as a sequence of relatively short videos (shots).\nConclusions Computer vision has been expected to be connected with natural language since born. And humans are good at both of these tasks. So an intelligent agent in the future should preferably have these two kinds of abilities. However, the two areas present two modalities of data, which poses a challenging modeling problem. In recent years, the success of CNNs and RNNs has solved the modeling problem much better than ever before. Based on these homogeneous network architectures, breakthroughs have already been achieved in tasks like image/video captioning, image generation and visual question answering, which all seek to build the connection in some way. These advancements open up more opportunities, like fine-grained image captioning for online advertising and modeling short videos. And efforts spent on solving these problems will become the next ``small step\u0026quot; to enable the computer to describe what it sees.\nFurther Reading If you find this topic (connecting computer vision and natural image) interesting to you, I strongly recommend you to read Andrej Karpathy\u0026rsquo;s PhD thesis - Connecting Images and Natural Language. Actually the title of this blog post is inspired by his thesis.\n","date":1543304160,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1543304160,"objectID":"634c163457627b1765d597dc6a7e5381","permalink":"https://jianchao-li.github.io/post/connecting-computer-vision-and-natural-language/","publishdate":"2018-11-27T15:36:00+08:00","relpermalink":"/post/connecting-computer-vision-and-natural-language/","section":"post","summary":"I combined my previous posts on image captioning and visual question answering and extended them to a wider topic - connecting computer vision and natural language.","tags":[],"title":"Connecting Computer Vision and Natural Language","type":"post"},{"authors":[],"categories":[],"content":"As you may have noticed from the title, this post is somewhat different from my previous ones. I would like to talk about a PyTorch DataLoader issue I encountered recently. I feel like devoting a post to it because it has taken me long time to figure out how to fix it.\nMemory consumption Since my last post on FCNs, I have been working on semantic segmentation. Nowadays, we have deep neural networks for it, like the state-of-the-art PSPNet from CVPR 2017.\nIn practice, segmentation networks are much more memory-intensive than recognition/classification networks. The reason is that semantic segmentation requires dense pixel-level predictions. For example, in the ImageNet classification task, you may use a neural network to transform a 224x224 image into 1000 real numbers (class probabilities). However, in semantic segmentation, suppose you have 20 semantic classes, you need to transform the 224x224 image into 20 224x224 probability maps, each representing probabilities of pixels belonging to one class. The output size changes from 1000 to 20x224x224=1003520, which is more than 1000 times!\nBesides the output, the intermediate feature maps in segmentation networks also consume more memory. In recognition networks, sizes of intermediate feature maps usually decrease monotically. However, since segmentation requires output of the same spatial dimension as the input, the feature maps will go through an extra process with their sizes increased (upsampled) back to the size of the input image. This extra upsample process further increases the memory consumption of segmentation networks.\nSo, when we fit segmentation networks on a GPU, we need to reduce the batch size of the data. However, batch size is crucial to the performance of networks, especially those containing the batch normalization layer. Since no more data can be held in a single GPU, a natural soltuion is to use multiple GPUs and split the data across them (or more formally, data parallelism).\nSynchronized batch normalization Here I would like to make a digression and mention an interesting layer, the synchronized batch normalization layer, which is introduced to increase the working batch size for multi-GPU training. You may refer to the section Cross-GPU Batch Normalization in MegDet for more details.\nWhen we use data parallelism to train on multiple GPUs, a batch of images will be splitted across several GPUs. Suppose your batch size is 16 (a common setting in semantic segmentation) and you train on 8 GPUs with data parallelism, then each GPU will have 2 images. A normal batch norm layer will only uses the 2 images on a single GPU to compute the mean and standard deviation, which is highly inaccurate and will make the training unstable.\nTo effectively increase the working batch size, we need to synchronize all the GPUs in the batch norm layer, and fetch the mean and standard deviation computed at each GPU to compute a global value using all images. This is what synchronized batch norm layer does. If you would like to learn more about its implementation details, you may have a look at Synchronized-BatchNorm-PyTorch.\nThe Issue After so much background information, the main idea is that semantic segmentation networks are very memory-intensive and require multiple GPUs to train a reasonable batch size. And synchronized batch norm can be used to increase the working batch size in multi-GPU training.\nNow comes the issue that I encountered recently. I was working with a semantic segmentation codebase written in PyTorch on a machine with 8 GPUs. The codebase incorporates synchronized batch norm and uses PyTorch multiprocessing for its custom DataLoader. I ran the training program for some time and then I killed it (I was running the program in a virtualized docker container in a cloud GPU cluster. So killing it is just to click a button in the cloud GUI).\nThen I checked the GPUs using nvidia-smi and everything looked good.\n+-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.46 Driver Version: 390.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:1A:00.0 Off | 0 | | N/A 32C P0 25W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:1F:00.0 Off | 0 | | N/A 34C P0 25W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... Off | 00000000:20:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... Off | 00000000:21:00.0 Off | 0 | | N/A 33C P0 23W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... Off | 00000000:B2:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... Off | 00000000:B3:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE... Off | 00000000:B4:00.0 Off | 0 | | N/A 34C P0 25W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-PCIE... Off | 00000000:B5:00.0 Off | 0 | | N/A 35C P0 25W / 250W | 11MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ But then when I tried to start a new training program. An OOM error occurred. For the sake of privacy, some traceback logs were omitted by ....\nTraceback (most recent call last): ... File \u0026#34;/usr/local/lib/python3.6/site-packages/torch/cuda/streams.py\u0026#34;, line 21, in __new__ return super(Stream, cls).__new__(cls, priority=priority, **kwargs) RuntimeError: CUDA error (2): out of memory Exception in thread Thread-1: Traceback (most recent call last): File \u0026#34;/usr/local/lib/python3.6/threading.py\u0026#34;, line 916, in _bootstrap_inner self.run() File \u0026#34;/usr/local/lib/python3.6/threading.py\u0026#34;, line 864, in run self._target(*self._args, **self._kwargs) ... File \u0026#34;/usr/local/lib/python3.6/multiprocessing/queues.py\u0026#34;, line 337, in get return _ForkingPickler.loads(res) File \u0026#34;/usr/local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py\u0026#34;, line 151, in rebuild_storage_fd fd = df.detach() File \u0026#34;/usr/local/lib/python3.6/multiprocessing/resource_sharer.py\u0026#34;, line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File \u0026#34;/usr/local/lib/python3.6/multiprocessing/resource_sharer.py\u0026#34;, line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File \u0026#34;/usr/local/lib/python3.6/multiprocessing/connection.py\u0026#34;, line 493, in Client answer_challenge(c, authkey) File \u0026#34;/usr/local/lib/python3.6/multiprocessing/connection.py\u0026#34;, line 732, in answer_challenge message = connection.recv_bytes(256) # reject large message File \u0026#34;/usr/local/lib/python3.6/multiprocessing/connection.py\u0026#34;, line 216, in recv_bytes buf = self._recv_bytes(maxlength) File \u0026#34;/usr/local/lib/python3.6/multiprocessing/connection.py\u0026#34;, line 407, in _recv_bytes buf = self._recv(4) File \u0026#34;/usr/local/lib/python3.6/multiprocessing/connection.py\u0026#34;, line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer I ran nvidia-smi again and everything still seemed good. So I wrote a check.cu to check the GPU memory using CUDA APIs.\n#include \u0026lt;iostream\u0026gt; #include \u0026#34;cuda.h\u0026#34; #include \u0026#34;cuda_runtime_api.h\u0026#34; using namespace std; int main( void ) { int num_gpus; size_t free, total; cudaGetDeviceCount( \u0026amp;num_gpus ); for ( int gpu_id = 0; gpu_id \u0026lt; num_gpus; gpu_id++ ) { cudaSetDevice( gpu_id ); int id; cudaGetDevice( \u0026amp;id ); cudaMemGetInfo( \u0026amp;free, \u0026amp;total ); cout \u0026lt;\u0026lt; \u0026#34;GPU \u0026#34; \u0026lt;\u0026lt; id \u0026lt;\u0026lt; \u0026#34; memory: free=\u0026#34; \u0026lt;\u0026lt; free \u0026lt;\u0026lt; \u0026#34;, total=\u0026#34; \u0026lt;\u0026lt; total \u0026lt;\u0026lt; endl; } return 0; } Again, everything looked good.\n$ nvcc check.cu -o check \u0026amp;\u0026amp; ./check GPU 0 memory: free=16488464384, total=16945512448 GPU 1 memory: free=16488464384, total=16945512448 GPU 2 memory: free=16488464384, total=16945512448 GPU 3 memory: free=16488464384, total=16945512448 GPU 4 memory: free=16488464384, total=16945512448 GPU 5 memory: free=16488464384, total=16945512448 GPU 6 memory: free=16488464384, total=16945512448 GPU 7 memory: free=16488464384, total=16945512448 Since the error happened to PyTorch, I moved on to write a check.py, which created a single-element PyTorch CUDA tensor for sanity check. And this script reproduced the OOM error.\nimport torch if __name__ == \u0026#39;__main__\u0026#39;: num_gpus = torch.cuda.device_count() for gpu_id in range(num_gpus): try: torch.cuda.set_device(gpu_id) torch.randn(1, device=\u0026#39;cuda\u0026#39;) print(\u0026#39;GPU {} is good\u0026#39;.format(gpu_id)) except Exception as exec: print(\u0026#39;GPU {} is bad: {}\u0026#39;.format(gpu_id, exec)) The output was as follows: GPU 1 and 2 were OOM.\n$ python3 check.py GPU 0 is good GPU 1 is bad: CUDA error: out of memory THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=2 : out of memory GPU 2 is bad: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCGeneral.cpp:663 GPU 3 is good GPU 4 is good GPU 5 is good GPU 6 is good GPU 7 is good So, my GPUs 2 and 3 should be magically occupied by some zombie process. And I had to restart the machine to fix it. I think the zombie process was generated due to my incorrect way of killing the training program. So I decided not to use the kill button in the cloud GUI but logged into the docker container to kill it in the terminal.\nI searched on Google for how to kill a PyTorch multi-GPU training program. And I found @smth\u0026rsquo;s suggestion in this reply.\n@rabst so, I remember this issue. When investigating, we found that there’s actually a bug in python multiprocessing that might keep the child process hanging around, as zombie processes. It is not even visible to nvidia-smi . The solution is killall python , or to ps -elf | grep python and find them and kill -9 [pid] to them.\nIt explained why nvidia-smi failed to reveal the memory issue. Great! But, the above commands did not work for me\u0026hellip;\nNothing is so fatiguing as the eternal haning on of an uncompleted task.\nThe Solution After several days of searching, failing, searching again, failing again etc., I finally found one solution. It is just to find out the processes that occupied the GPUs and kill them. To find out those processes, I ran fuser -v /dev/nvidia*, which listed all the processes that were occupying my NVIDIA GPUs. Since I have 8 GPUs, the output of this command is a bit log.\n$ fuser -v /dev/nvidia* USER PID ACCESS COMMAND /dev/nvidia0: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia1: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia2: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia3: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia4: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia5: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia6: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia7: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidiactl: root 5284 F...m python3 root 5416 F...m python3 root 5417 F...m python3 root 5418 F...m python3 root 5419 F...m python3 root 5420 F...m python3 root 5421 F...m python3 root 5422 F...m python3 root 5423 F...m python3 root 5424 F...m python3 root 5425 F...m python3 root 5426 F...m python3 root 5427 F...m python3 root 5428 F...m python3 root 5429 F...m python3 root 5430 F...m python3 root 5431 F...m python3 /dev/nvidia-uvm: root 5284 F.... python3 root 5416 F.... python3 root 5417 F.... python3 root 5418 F.... python3 root 5419 F.... python3 root 5420 F.... python3 root 5421 F.... python3 root 5422 F.... python3 root 5423 F.... python3 root 5424 F.... python3 root 5425 F.... python3 root 5426 F.... python3 root 5427 F.... python3 root 5428 F.... python3 root 5429 F.... python3 root 5430 F.... python3 root 5431 F.... python3 As can be seen from above, the main process had a PID of 5284. I spawned 16 workers for the DataLoader so there were 16 subprocesses whose PIDs were consecutive (from 5416 to 5431). First I used kill -9 to kill all of them. Then killed the main process.\n$ for pid in {5416..5431}; do kill -9 $pid; done # kill subprocesses $ kill -9 5284 # kill main process After killing the subprocesses and main process, I ran check.py again and this time every GPU was good.\n$ python3 check.py GPU 0 is good GPU 1 is good GPU 2 is good GPU 3 is good GPU 4 is good GPU 5 is good GPU 6 is good GPU 7 is good Another Trick If the above solution still does not work for you (it does happen to me sometimes), the following trick may be helpful. First, find out the training loop of your program. In most cases it will contain a loop based on the number of iterations. Then add the following code to that loop.\nif os.path.isfile(\u0026#39;kill.me\u0026#39;): num_gpus = torch.cuda.device_count() for gpu_id in range(num_gpus): torch.cuda.set_device(gpu_id) torch.cuda.empty_cache() exit(0) Inside the if statement, the code empties the caches of all GPUs and exits. After you add this code to the training iteration, once you want to stop it, just cd into the directory of the training program and run\ntouch kill.me Then in the current or next iteration (based on whether the above code has been executed), the if check will become true and all GPUs will be cleared and the program will exit. Since you directly tell Python to exit in the program, it will take care of everything for you. You may use anything instead of kill.me. But just make sure it is special enough and thus you will not terminate the training inadvertently by creating a file with the same name.\nConclusions The issue made me stuck for long time. And in this process of looking for the solution, I made some expensive trial and error. Several GPU servers in the cloud had a card OOM due to my incorrect way of killing the training program. And I had to ask the administrators to restart them. So I would definitely like others to avoid such a case.\nFrom another perspective, I would like to highlight the importance of engineering capabilities and experiences. Though I was working on semantic segmentation, I spent most of my time digging through all sorts of problems while running the multi-GPU codes.\nA final remark, I would not like to leave you an impression that I am blaming the issue on PyTorch, CUDA, the cloud GPU cluster, or any others. Actually it is mainly due to that I do not understand how PyTorch multi-GPU and multiprocessing work. And I think I will need to study these topics more systematically. Hopefully I will write a new post after learning more about them :-)\n","date":1541141890,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1541141890,"objectID":"fe49ad10b708b26f066da5214af9fda9","permalink":"https://jianchao-li.github.io/post/killing-pytorch-multi-gpu-training-the-safe-way/","publishdate":"2018-11-02T14:58:10+08:00","relpermalink":"/post/killing-pytorch-multi-gpu-training-the-safe-way/","section":"post","summary":"Recently I was working with PyTorch multi-GPU training and I came across a nightmare GPU memory problem. After some expensive trial and error, I finally found a solution for it.","tags":[],"title":"Killing Pytorch Multi Gpu Training the Safe Way","type":"post"},{"authors":[],"categories":[],"content":"Fully convolutional networks, or FCNs, were proposed by Jonathan Long, Evan Shelhamer and Trevor Darrell in CVPR 2015 as a framework for semantic segmentation.\nSemantic segmentation Semantic segmentation is a task in which given an image, we need to assign a semantic label (like cat, dog, person, background etc.) to each of its pixels. The following are some examples taken from The PASCAL VOC data sets with different colors representing different semantic classes.\nThe PASCAL VOC data sets define 20 semantic classes: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor. Actually these are all object classes. For pixels falling into non-object classes (which are called stuff) like sky, they will be labeled as \u0026ldquo;background\u0026rdquo;.\nSome later semantic segmentation data sets like The Cityscapes Dataset and The COCO-Stuff dataset account for both object and stuff classes. The following are some examples taken from the COCO-Stuff dataset, with class names also shown.\nHow to use CNNs for segmentation Back in 2015, convolutional neural networks (CNNs) have achieved tremendous success in image classification (like AlexNet and GoogLeNet), doing extremely well in learning from a fixed-size image to a single category label. Generally, the image will go through several convolutional layers and then fully connected layers, giving a fixed-size (equal to the number of classes) output, whose softmax loss with respect to the ground truth label will be computed and back-propagated to update parameters.\nHowever, in semantic segmentation, we need pixel-wise dense predictions instead of just a single label. For example, for a 256x128 image, the segmentation mask is also of size 256x128. We need to use CNNs to generate a predicted mask and then compute the loss between it and the ground truth one.\nI was working on object segmentation at the end of 2014. Object segmentation is arguably much simpler than semantic segmentation in that it only classifies the pixels into two classes: the backround and the foreground (a specific object, like pedestrian, horse, bird etc.). The following are some examples taken from Caltech-UCSD Birds-200-2011 with the object masks shown below the images.\nSince CNNs are good at handling fixed-size data, a simple idea is to fix both the sizes of the image and the object mask. Then we set the number of output units of the last fully connected layer to be equal to the number of mask elements.\nFor loss computation, since I was working on object segmentation with only two values (0 for background and 1 for foreground) in the mask, I simply adopted the Euclidean loss. This extremely simple idea achieved state-of-the-art results in some object segmentation datasets and was publised in ICIP 2015 entitled Object segmentation with deep regression.\nChanging fully connected layers to convolutional layers However, the above idea is just not elegant. It mechanically resizes the object mask to the designated size and flattens it into a vector as the regression target. For semantic segmentation, we will have one mask for each class. If we still would like to use the same idea, we need to flatten all the masks into vectors and concatenate them to be a giant vector for regression. For example, in PASCAL VOC, there are 21 classes. Suppose we set the mask size to be 64x64, then we need to regress a vector with 64x64x21=86016 elements using a fully connected layer. If the input to this layer is 512x14x14 (you will come across this size in the section of VGG), the weight matrix will have 512x14x14x86016=8631877632 (over 8.6 billion) elements! If represented using the 4-byte float, this matrix alone will occupy more than 34 gigabytes! Well, let\u0026rsquo;s try to save some space of the parameters.\nA common idea to avoid the large number of parameters consumed by fully connected layers is to use convolutional layers. For example, we may use a convolutional layer to generate a 21x64x64 output feature map (we no longer need to flatten the masks into vectors). In this way, we only need 21 convolutional kernels, which will use considerably fewer number of parameters. Actually, each of the 64x64 positions in the 21 maps represent the probabilities of that location being each of the 21 classes. So we may compute a softmax loss for each of the position and take an average over all the positions as the final loss. Semantic segmentation actually can be treated as a pixel-wise classification task. Therefore, computing a softmax loss makes more sense than using the Euclidean loss.\nLet\u0026rsquo;s summarize the above idea. Given an image, we resize it to be, say, 224x224. And we resize the segmentation masks (suppose there are 21 of them) to be 64x64. Then we carefully design a convolutional network to transform the 3x224x224 image (3 represents the R, G, B channels) to a 21x64x64 feature map and compute softmax loss over it with which we can use back-propagation to learn the network parameters.\nI am not sure how you would perceive this idea. From my perspective, a natural question to task is, why not just set the masks to be of the same size as the image. Well, we can actually make this happen by using same padding in all the convolutional layers and discarding all the pooling layers.\nHowever, CNNs typically include pooling layers to downsample the feature maps, which have shown the effectiveness in image classification and have become a must-do. So it is not a good idea to discard the pooling layers. Now, since pooling layers will reduce the size of feature maps, how can we enlarge them to be of the same size as the input image? In fully convolutional networks, the authors proposed to use deconvolutional layers to upsample the feature maps.\nDeconvolutional layers Actually deconvolution is a bad name as suggested by A guide to convolution arithmetic for deep learning and it suggests to use transpose convolution. There are also other names for your choice: upconvolution, fractionally strided convolution, and backward strided convolution, as mentioned in CS231n 2018 Lecture 11.\nIn deep learning, deconvolution is just a convolution with the input size and output size swapped. In convolution, the relationship between the input size and the output size can be expressed as follows.\n$H_{out} = \\frac{H_{in} + 2P - K}{S} + 1 \\tag{1}\\label{eq1}$\n$W_{out} = \\frac{W_{in} + 2P - K}{S} + 1 \\tag{2}\\label{eq2}$\n$H_{out}$, $H_{in}$, $W_{out}$, $W_{in}$, $P$, $K$, and $S$ represent the output height, input height, output width, input width, padding, kernel size and stride respectively. In this post, we assume that the same padding and stride is used in both the height and width dimensions.\nFor example, if you are convolving a 4x4 input ($H_{in} = W_{in} = 4$) with a 3x3 kernel ($K = 3$) with stride 1 ($S = 1$) and no padding ($P = 0$), you will get an output of size 2x2 ($\\frac{4 + 2 \\times 0 - 3}{1} + 1 = 2$). So this is a convolution that transforms a 4x4 input to a 2x2 output with a 3x3 convolutional kernel.\nNow, in deconvolution, we would like to transform a 2x2 input to a 4x4 output using the same 3x3 convolutional kernel. Since deconvolution is still convolution, equations $\\eqref{eq1}$ and $\\eqref{eq2}$ still hold. Suppose we also use stride 1, then we need to solve $4 = \\frac{2 + 2P - 3}{1} + 1$, which gives $P = 2$. So the corresponding deconvolution is just a convolution of a 2x2 input with the same 3x3 kernel, stride 1, and padding 2.\nNow let\u0026rsquo;s look at the relationship between the input size and output size in deconvolution. As mentioned above, deconvolution is a convolution with swapped input size and output size. So we can derive the relationship just by swapping $H_{in}$ with $H_{out}$ and $W_{in}$ with $W_{out}$ in equations $\\eqref{eq1}$ and $\\eqref{eq2}$, which gives\n$$H_{in} = \\frac{H_{out} + 2P - K}{S} + 1 \\tag{3}\\label{eq3}$$\n$$W_{in} = \\frac{W_{out} + 2P - K}{S} + 1 \\tag{4}\\label{eq4}$$\nBy moving $H_{out}$ and $W_{out}$ to the left-hand side, we finally get\n$$H_{out} = SH_{in} + K - S - 2P \\tag{5}\\label{eq5}$$\n$$W_{out} = SW_{in} + K - S - 2P \\tag{6}\\label{eq6}$$\nNow you know how to design a deconvolutional layer to upsample an input size to a specific output size. With deconvolution, we can upsample downsampled feature maps back to the same size as the input image. I think deconvolution is indeed the key to FCNs. For more details of deconvolution, please refer to A guide to convolution arithmetic for deep learning.\nVGG Before introducing FCNs, I would like to talk about VGG, which is the backbone network for the FCN example that will be presented in the next section. It was used by Karen Simonyan and Andrew Zisserman in ILSVRC 2015 and won the second-place in the image classification task.\nSpecifically, I will use the 16-layer VGG as an example. The following table is a breakdown of the network layer by layer. Note that the stride and padding is 1 and 0 by default if not specified. And the last softmax layer for loss computation is ignored. By Size, we mean the shape of the output blobs, which is computed using equations $\\eqref{eq1}$ and $\\eqref{eq2}$.\nName Type Params Size data Data 3x224x224 conv1_1 Convolution 64 3x3 kernels, padding 1 64x224x224 relu1_1 ReLU 64x224x224 conv1_2 Convolution 64 3x3 kernels, padding 1 64x224x224 relu1_2 ReLU 64x224x224 pool1 Pooling max 2x2, stride 2 64x112x112 conv2_1 Convolution 128 3x3 kernels, padding 1 128x112x112 relu2_1 ReLU 128x112x112 conv2_2 Convolution 128 3x3 kernels, padding 1 128x112x112 relu2_2 ReLU 128x112x112 pool2 Pooling max 2x2, stride 2 128x56x56 conv3_1 Convolution 256 3x3 kernels, padding 1 256x56x56 relu3_1 ReLU 256x56x56 conv3_2 Convolution 256 3x3 kernels, padding 1 256x56x56 relu3_2 ReLU 256x56x56 conv3_3 Convolution 256 3x3 kernels, padding 1 256x56x56 relu3_3 ReLU 256x56x56 pool3 Pooling max 2x2, stride 2 256x28x28 conv4_1 Convolution 512 3x3 kernels, padding 1 512x28x28 relu4_1 ReLU 512x28x28 conv4_2 Convolution 512 3x3 kernels, padding 1 512x28x28 relu4_2 ReLU 512x28x28 conv4_3 Convolution 512 3x3 kernels, padding 1 512x28x28 relu4_3 ReLU 512x28x28 pool4 Pooling max 2x2, stride 2 512x14x14 conv5_1 Convolution 512 3x3 kernels, padding 1 512x14x14 relu5_1 ReLU 512x14x14 conv5_2 Convolution 512 3x3 kernels, padding 1 512x14x14 relu5_2 ReLU 512x14x14 conv5_3 Convolution 512 3x3 kernels, padding 1 512x14x14 relu5_3 ReLU 512x14x14 pool5 Pooling max 2x2, stride 2 512x7x7 fc6 InnerProduct 25088x4096 weight, 1x4096 bias 4096 relu6 ReLU 4096 drop6 Dropout p=0.5 4096 fc7 InnerProduct 4096x4096 weight, 1x4096 bias 4096 relu7 ReLU 4096 drop7 Dropout p=0.5 4096 fc8 InnerProduct 4096x1000 weight, 1x1000 bias 1000 As can be seen, VGG only uses 3x3 convolutional kernels and 2x2 pooling kernels with stride 2. This simple and homogeneous structure accounts for its popularity to some degree.\nThere is a nice visualization of the 16-layer VGG in netscope. By hovering your mouse over the layers, you will be able to see their parameters and output shapes, which should be the same to those in the above table. The CS231n 2018 Lecture 9 also covers this popular network.\nFully convolutional networks Now you are ready to embrace the idea of FCNs. It is fairly simple: first downsample the image to smaller feature maps and then upsample them to the segmentation masks (of the same size as the image). CS231n 2018 Lecture 11 has the following nice illustration which summarizes this process. Actually I think the $D_3 \\times H/4 \\times W/4$ of Low-res should be $D_3 \\times H/8 \\times W/8$. Anyway, you can just ignore the captions. The picture has reflected the core idea.\nTo gain more understanding, let\u0026rsquo;s walk through a concrete example - voc-fcn32s, an adaptation of the 16-layer VGG into an FCN for semantic segmentation in the PASCAL VOC data sets. Since this dataset has 21 classes, we need to learn 21 segmentation masks.\nLet\u0026rsquo;s also break the voc-fcn32s down layer by layer. Note that the size of data is now $3 \\times H \\times W$. In this way, we will show that FCN is able to handle input of any size! All default settings are the same to those in the above table.\nName Type Params Size data Data $3 \\times H \\times W$ conv1_1 Convolution 64 3x3 kernels, padding 100 $64 \\times \\left(H + 198\\right) \\times \\left(W + 198\\right)$ relu1_1 ReLU $64 \\times \\left(H + 198\\right) \\times \\left(W + 198\\right)$ conv1_2 Convolution 64 3x3 kernels, padding 1 $64 \\times \\left(H + 198\\right) \\times \\left(W + 198\\right)$ relu1_2 ReLU $64 \\times \\left(H + 198\\right) \\times \\left(W + 198\\right)$ pool1 Pooling max 2x2, stride 2 $64 \\times \\left(\\frac{H}{2} + 99\\right) \\times \\left(\\frac{W}{2} + 99\\right)$ conv2_1 Convolution 128 3x3 kernels, padding 1 $128 \\times \\left(\\frac{H}{2} + 99\\right) \\times \\left(\\frac{W}{2} + 99\\right)$ relu2_1 ReLU $128 \\times \\left(\\frac{H}{2} + 99\\right) \\times \\left(\\frac{W}{2} + 99\\right)$ conv2_2 Convolution 128 3x3 kernels, padding 1 $128 \\times \\left(\\frac{H}{2} + 99\\right) \\times \\left(\\frac{W}{2} + 99\\right)$ relu2_2 ReLU $128 \\times \\left(\\frac{H}{2} + 99\\right) \\times \\left(\\frac{W}{2} + 99\\right)$ pool2 Pooling max 2x2, stride 2 $128 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ conv3_1 Convolution 256 3x3 kernels, padding 1 $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ relu3_1 ReLU $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ conv3_2 Convolution 256 3x3 kernels, padding 1 $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ relu3_2 ReLU $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ conv3_3 Convolution 256 3x3 kernels, padding 1 $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ relu3_3 ReLU $256 \\times \\left(\\frac{H + 2}{4} + 49\\right) \\times \\left(\\frac{W + 2}{4} + 49\\right)$ pool3 Pooling max 2x2, stride 2 $256 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ conv4_1 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ relu4_1 ReLU $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ conv4_2 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ relu4_2 ReLU $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ conv4_3 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ relu4_3 ReLU $512 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ pool4 Pooling max 2x2, stride 2 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ conv5_1 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ relu5_1 ReLU $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ conv5_2 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ relu5_2 ReLU $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ conv5_3 Convolution 512 3x3 kernels, padding 1 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ relu5_3 ReLU $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ pool5 Pooling max 2x2, stride 2 $512 \\times \\left(\\frac{H + 6}{32} + 6\\right) \\times \\left(\\frac{W + 6}{32} + 6\\right)$ fc6 Convolution 4096 7x7 kernels $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ relu6 ReLU $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ drop6 Dropout p=0.5 $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ fc7 Convolution 4096 1x1 kernels $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ relu7 ReLU $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ drop7 Dropout $4096 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ score_fr Convolution 21 1x1 kernels $21 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ upscore Deconvolution 21 64x64 kernels, stride 32 $21 \\times \\left(H + 38\\right) \\times \\left(W + 38\\right)$ score Crop Explained below $21 \\times H \\times W$ Several interesting facts worth notice have been highlighted in red. Let\u0026rsquo;s go over them one by one.\nThe most interesting and confusing one is probably the padding 100 in conv1_1. Why do FCNs use padding 100 instead of just 1 as does VGG? Well, let\u0026rsquo;s try to use padding 1 and see what will happen. Using equations $\\eqref{eq1}$ and $\\eqref{eq2}$ repeatedly, we can compute that the corresponding output size of pool5 will be $512 \\times \\frac{H}{32} \\times \\frac{W}{32}$.\nSo far so good. But now comes fc6 with 4096 7x7 kernels. By plugging the variables into $\\eqref{eq1}$ and $\\eqref{eq2}$, the output size of fc6 will be $4096 \\times \\frac{H - 192}{32} \\times \\frac{W - 192}{32}$. To make $\\frac{H - 192}{32}$ and $\\frac{W - 192}{32}$ positive (at least 1), both $H$ and $W$ should be greater than or equal to 224. This means that if we use padding 1 in conv1_1, the FCN will only be able to handle images not smaller than 224x224. However, we would like FCN to be able to handle input of any size, which is one of its main advantages. So we need to add more padding in conv1_1 and 100 is a sensible value.\nWe also see that both fc6 and fc7 are now convolutional layers, fitting the name fully convolutional networks. In the deconvolutional layer upscore, the feature maps of score_frwith size $\\frac{H + 6}{32}$x$\\frac{W + 6}{32}$ are upsampled to $\\left(H + 38\\right) \\times \\left(W + 38\\right)$. You may try to verify the correctness of this output size using equations $\\eqref{eq5}$ and $\\eqref{eq6}$.\nAfter upscore, we have an output feature map of $21 \\times \\left(H + 38\\right) \\times \\left(W + 38\\right)$. However, what we want is $21 \\times H \\times W$. So here comes the last but not least Crop layer, which is used to crop the input and defined as follows in Caffe.\nlayer { name: \u0026#34;score\u0026#34; type: \u0026#34;Crop\u0026#34; bottom: \u0026#34;upscore\u0026#34; bottom: \u0026#34;data\u0026#34; top: \u0026#34;score\u0026#34; crop_param { axis: 2 offset: 19 } } This Crop layer accepts upscore ($21 \\times \\left(H + 38\\right) \\times \\left(W + 38\\right)$) and data ($3 \\times H \\times W$) from the two bottom fields. It also has two parameters: axis: 2 and offset: 19. In Caffe, a feature map (blob) is of size $N \\times C \\times H \\times W$, with $N$, $C$, $H$ and $W$ being the 0th, 1st, 2nd and 3rd dimension. So upscore and data are actually of size $N \\times 21 \\times \\left(H + 38\\right) \\times \\left(W + 38\\right)$ and $N \\times 3 \\times H \\times W$ respectively. axis: 2 means to crop from the 2nd dimension (inclusive). So only the dimension $H$ and $W$ of upscore ($\\left(H + 38\\right) \\times \\left(W + 38\\right)$) will be cropped to be the same as data ($H \\times W$). And offset: 19 specifies the starting index of the cropping, which means that upscore will be cropped to be upscore[19: 19 + H, 19: 19 + W], literally the central part of upscore. The following is an illustration of this process, with the green part being the cropped region score.\nSimplifying FCN to a stack of convolutional layers As shown in the above example, we use a Crop layer with offset: 19 to crop the feature maps. This cropping layer comes into use since sometimes the deconvolutional (upsampling) layer may not precisely generate an $H \\times W$ feature map. Instead, it may give us something like $\\left(H + 2T\\right) \\times \\left(W + 2T\\right)$. In this case we need to determine the offset $T$ to do the cropping.\nIn the previous section, we break down the network layer by layer and write down the output shape for each layer, based on which we compute the offset of the Crop layer. In the coming sections, we will look at a general case and derive the offset.\nIn this section, we first simplify an FCN into a stack of $n$ convolutional layers, as shown below. This will make later derivation easier.\n$$ \\begin{align} \\overset{\\text{Input}}{\\longrightarrow}\\fbox{conv-1}\\longrightarrow\\fbox{conv-2}\\longrightarrow\\dots\\longrightarrow\\fbox{conv-n}\\overset{\\text{Output}}{\\longrightarrow} \\end{align} $$\nHowever, FCNs also have other layers like pooling layers, deconvolutional layers, or ReLU layers. So why do we only consider convolutional layers?\nWell, if you have walked through the computation of the offset in voc-fcn32s, you will notice that the offset is only related to the size ($H$ and $W$) of the feature maps. And in FCNs, only convolutional layers, deconvolutional layers and pooling layers will change the feature map size. So we can safely ignore other layers like ReLU and Dropout.\nFor deconvolutional layers, they are just convolutional layers. So we only need to check pooling layers. For pooling layers, they are actually equivalent to convolutional layers regarding the size relationship between the input and output. Specifically, for pooling layers, equations $\\eqref{eq1}$ and $\\eqref{eq2}$ exactly hold true. For example, in pool1, we use 2x2 max pooling kernels ($K = 2$) with stride 2 ($S = 2$). And the default padding is 0 ($P = 0$). According to $\\eqref{eq1}$ and $\\eqref{eq2}$, we have\n$$ \\begin{equation} \\begin{aligned} H_{out} \u0026amp;= \\frac{H_{in} + 2 \\times 0 - 2}{2} + 1 \\\\\\ \u0026amp;= \\frac{H_{in}}{2} \\end{aligned} \\end{equation}\\tag{7}\\label{eq7} $$\nand\n$$ \\begin{equation} \\begin{aligned} W_{out} \u0026amp;= \\frac{W_{in} + 2 \\times 0 - 2}{2} + 1 \\\\\\ \u0026amp;= \\frac{W_{in}}{2} \\end{aligned} \\end{equation}\\tag{8}\\label{eq8} $$\nwhich match our expectation to downsample the input by a factor of 2.\nSo it makes sense to simplify an FCN to be a stack of convolutional layers since we only care about the size of the feature maps.\nReparameterizing convolutional layers Now, we only need to deal with convolutional layers. But, before diving into the derivation, let\u0026rsquo;s further simplify it by reparameterizing convolution. Specifically, we rewrite equations $\\eqref{eq1}$ and $\\eqref{eq2}$ by moving $H_{in}$ and $W_{in}$ to the left-hand side.\n$$ \\begin{equation} \\begin{aligned} H_{in} \u0026amp;= S\\left(H_{out} - 1\\right) + K - 2P \\\\\\ \u0026amp;= SH_{out} + 2\\left(\\frac{K - S}{2} - P\\right) \\\\\\ \u0026amp;= SH_{out} + 2P^\\prime \\end{aligned} \\end{equation}\\tag{9}\\label{eq9} $$\n$$ \\begin{equation} \\begin{aligned} W_{in} \u0026amp;= S\\left(W_{out} - 1\\right) + K - 2P \\\\\\ \u0026amp;= SW_{out} + 2\\left(\\frac{K - S}{2} - P\\right) \\\\\\ \u0026amp;= SW_{out} + 2P^\\prime \\end{aligned} \\end{equation}\\tag{10}\\label{eq10} $$\nAs can be seen, we introduce a new parameter $P^\\prime$ in equations $\\eqref{eq9}$ and $\\eqref{eq10}$, which is defined as follows.\n$$P^\\prime = \\frac{K - S}{2} - P \\tag{11}\\label{eq11}$$\nGiven $P^\\prime$, a convolutional layer with parameters $K$, $S$ and $P$ can be reparameterized by $S$ and $P^\\prime$. $S$ still stands for the stride. And we name $P^\\prime$ offset. Notice that $P^\\prime$ is the offset of a convoltional layer, which is different from the aforementioned $T$, the offset of the FCN.\nFor pooling layers, equations $\\eqref{eq9}$ and $\\eqref{eq10}$ also apply to them exactly. For deconvolutional layers, we rewrite equations $\\eqref{eq5}$ and $\\eqref{eq6}$ by moving $H_{out}$ and $W_{out}$ to the left-hand side.\n$$ \\begin{equation} \\begin{aligned} H_{out} \u0026amp;= SH_{in} + 2\\left(\\frac{K - S}{2} - P\\right) \\\\\\ \u0026amp;= SH_{in} + 2P^\\prime \\end{aligned} \\end{equation}\\tag{12}\\label{eq12} $$\n$$ \\begin{equation} \\begin{aligned} W_{out} \u0026amp;= SW_{in} + 2\\left(\\frac{K - S}{2} - P\\right) \\\\\\ \u0026amp;= SW_{in} + 2P^\\prime \\end{aligned} \\end{equation}\\tag{13}\\label{eq13} $$\nSince a deconvolutional layer is just a convolutional layer with its input size and output size swapped. Let\u0026rsquo;s swap $H_{out}$ with $H_{in}$ and $W_{out}$ with $W_{in}$ in $\\eqref{eq12}$ and $\\eqref{eq13}$. Then we get the following convolutional layer expressed by equations $\\eqref{eq14}$ and $\\eqref{eq15}$.\n$$H_{in} = SH_{out} + 2P^\\prime \\tag{14}\\label{eq14}$$\n$$W_{in} = SW_{out} + 2P^\\prime \\tag{15}\\label{eq15}$$\nSimilarly, we move $H_{out}$ and $W_{out}$ to the left-hand side.\n$$H_{out} = \\frac{1}{S}H_{in} + 2\\left(-\\frac{P^\\prime}{S}\\right) \\tag{16}\\label{eq16}$$\n$$W_{out} = \\frac{1}{S}W_{in} + 2\\left(-\\frac{P^\\prime}{S}\\right) \\tag{17}\\label{eq17}$$\nNote that equations $\\eqref{eq12}$ and $\\eqref{eq13}$ represent a deconvolutional layer with stride $S$ and offset $P^\\prime$ while equations $\\eqref{eq16}$ and $\\eqref{eq17}$ represent a convolutional layer, whose stride and offset are $\\frac{1}{S}$ and $-\\frac{P^\\prime}{S}$ respectively.\nBased on the above analysis, we can obtain the following theorem, which will come into use later.\nTheorem A deconvolution with stride $S$ and offset $P^\\prime$ is equavilent to a convolution with stride $\\frac{1}{S}$ and offset $-\\frac{P^\\prime}{S}$.\nComputing the offset Based on the above reparameterization of convolutional layers, the simplified FCN stack of convolutional layers can be represented as follows.\n$$ \\begin{align} \\underset{H_0, W_0}{\\overset{L_0}{\\longrightarrow}}\\underset{\\text{conv-1}}{\\boxed{S_1, P^\\prime_1}}\\underset{H_1, W_1}{\\overset{L_1}{\\longrightarrow}}\\underset{\\text{conv-2}}{\\boxed{S_2, P^\\prime_2}}\\underset{H_2, W_2}{\\overset{L_2}{\\longrightarrow}}\\dots\\underset{H_{n - 1}, W_{n - 1}}{\\overset{L_{n - 1}}{\\longrightarrow}}\\underset{\\text{conv-n}}{\\boxed{S_n, P^\\prime_n}}\\underset{H_n, W_n}{\\overset{L_n}{\\longrightarrow}} \\end{align} $$\nThe input to the network is denoted as $L_0$ and the output as $L_n$. For layer $L_i, i = 0, 1, 2, \\dots, n$, its height, width, stride and offset are denoted as $H_i$, $W_i$, $S_i$ and $P^\\prime_i$ respectively. Note that we ignore the $N$ and $C$ dimensions since the offset of FCN ($T$) is only related to $H$ and $W$. Let\u0026rsquo;s further assume that $H_0 = W_0$ such that $H_i = W_i$ for all $i = 0, 1, 2, \\dots, n$. Now we only need to consider a single dimension $H$.\nBased on equations $\\eqref{eq9}$ and $\\eqref{eq10}$, we can write down\n$$ \\begin{equation} \\begin{aligned} H_0 \u0026amp;= S_1H_1 + 2P^\\prime_1 \\\\\\ H_1 \u0026amp;= S_2H_2 + 2P^\\prime_2 \\\\\\ \u0026amp;\\dots \\\\\\ H_{n - 1} \u0026amp;= S_nH_n + 2P^\\prime_n \\end{aligned} \\end{equation}\\tag{18}\\label{eq18} $$\nIf we plug in the expression of $H_i$ into that of $H_{i - 1}$, we can get\n$$ \\begin{equation} \\begin{aligned} H_0 \u0026amp;= S_1H_1 + 2P^\\prime_1\\\\\\ \u0026amp;= S_1\\left(S_2H_2 + 2P^\\prime_2\\right) + 2P^\\prime_1 \\\\\\ \u0026amp;= \\left(S_1S_2\\right)H_2 + 2\\left(S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \u0026amp;= \\left(S_1S_2\\right)\\left(S_3H_3 + 2P^\\prime_3\\right) + 2\\left(S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \u0026amp;= \\left(S_1S_2S_3\\right)H_3 + 2\\left(S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \u0026amp;= \\dots \\end{aligned} \\end{equation}\\tag{19}\\label{eq19} $$\nHave you noticed the regularities? If you move on, you will end up with\n$$ H_0 = \\left(S_1S_2 \\dots S_n\\right)H_n + 2\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{20}\\label{eq20}$$\nAccording to the definition of $T$, we have\n$$H_n = H_0 + 2T \\tag{21}\\label{eq21}$$\nBy plugging equation $\\eqref{eq21}$ into equation $\\eqref{eq20}$, we have\n$$ \\begin{equation} \\begin{aligned} H_0 \u0026amp;= \\left(S_1S_2 \\dots S_n\\right)\\left(H_0 + 2T\\right) + 2\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \u0026amp;= \\left(S_1S_2 \\dots S_n\\right)H_0 + 2\\left(S_1S_2 \\dots S_n\\right)T + 2\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\end{aligned} \\end{equation}\\tag{22}\\label{eq22} $$\nTypically we design the network to make $S_1S_2 \\dots S_n = 1$. Let\u0026rsquo;s take voc-fcn8s as an example to verify this point. In this network, we have the following general convolutional layers (both pooling and deconvolutional layers are also counted as convolutional layers).\nName Type Old parameterization ($K, S, P$) Reparameterization ($S, P^\\prime$) conv1_1 Convolution $K_1 = 3, S_1 = 1, P_1 = 100$ $S_1 = 1, P^\\prime_1 = -99$ conv1_2 Convolution $K_2 = 3, S_2= 1, P_2 = 1$ $S_2 = 1, P^\\prime_2 = 0$ pool1 Pooling $K_3 = 2, S_3 = 2, P_3 = 0$ $S_3 = 2, P^\\prime_3 = 0$ conv2_1 Convolution $K_4 = 3, S_4 = 1, P_4 = 1$ $S_4 = 1, P^\\prime_4 = 0$ conv2_2 Convolution $K_5 = 3, S_5 = 1, P_5 = 1$ $S_5 = 1, P^\\prime_5 = 0$ pool2 Pooling $K_6 = 2, S_6 = 2, P_6 = 0$ $S_6 = 2, P^\\prime_6 = 0$ conv3_1 Convolution $K_7 = 3, S_7 = 1, P_7 = 1$ $S_7 = 1, P^\\prime_7 = 0$ conv3_2 Convolution $K_8 = 3, S_8 = 1, P_8 = 1$ $S_8 = 1, P^\\prime_8 = 0$ conv3_3 Convolution $K_9 = 3, S_9 = 1, P_9 = 1$ $S_9 = 1, P^\\prime_9 = 0$ pool3 Pooling $K_{10} = 2, S_{10} = 2, P_{10} = 0$ $S_{10} = 2, P^\\prime_{10} = 0$ conv4_1 Convolution $K_{11} = 3, S_{11} = 1, P_{11} = 1$ $S_{11} = 1, P^\\prime_{11} = 0$ conv4_2 Convolution $K_{12} = 3, S_{12} = 1, P_{12} = 1$ $S_{12} = 1, P^\\prime_{12} = 0$ conv4_3 Convolution $K_{13} = 3, S_{13} = 1, P_{13} = 1$ $S_{13} = 1, P^\\prime_{13} = 0$ pool4 Pooling $K_{14} = 2, S_{14} = 2, P_{14} = 0$ $S_{14} = 2, , P^\\prime_{14} = 0$ conv5_1 Convolution $K_{15} = 3, S_{15} = 1, P_{15} = 1$ $S_{15} = 1, P^\\prime_{15} = 0$ conv5_2 Convolution $K_{16} = 3, S_{16} = 1, P_{16} = 1$ $S_{16} = 1, P^\\prime_{16} = 0$ conv5_3 Convolution $K_{17} = 3, S_{17} = 1, P_{17} = 1$ $S_{17} = 1, P^\\prime_{17} = 0$ pool5 Pooling $K_{18} = 2, S_{18} = 2, P_{18} = 0$ $S_{18} = 2, P^\\prime_{18} = 0$ fc6 Convolution $K_{19} = 7, S_{19} = 1, P_{19} = 0$ $S_{19} = 1, P^\\prime_{19} = 3$ fc7 Convolution $K_{20} = 1, S_{20} = 1, P_{20} = 0$ $S_{20} = 1, P^\\prime_{20} = 0$ score_fr Convolution $K_{21} = 1, S_{21} = 1, P_{21} = 0$ $S_{21} = 1, P^\\prime_{21} = 0$ upscore Deconvolution $K_{22} = 64, S_{22} = 32, P_{22} = 0$ $S_{22} = \\frac{1}{32}, P^\\prime_{22} = -\\frac{1}{2}$ For upscore, it is a deconvolutional layer and so we make use of Theorem to compute its reparameterization parameters. Multiplying the above $S_1S_2 \\dots S_{22}$ will give us $2 ^ 5 \\times 1 ^ {16} \\times \\frac{1}{32} = 1$.\nGiven $S_1S_2 \\dots S_n = 1$, equation $\\eqref{eq22}$ will be simplified into\n$$H_0 = H_0 + 2T + 2\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{23}\\label{eq23}$$\nNow we can derive the equation for computing the offset $T$.\n$$T=-\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{24}\\label{eq24}$$\nI computed $T$ for voc-fcn32s using the following Python codes according to equation $\\eqref{eq24}$ and the result is 19.0, which is exactly the offset of the Crop layer.\n\u0026gt;\u0026gt;\u0026gt; S = [1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1] \u0026gt;\u0026gt;\u0026gt; P = [-99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, -0.5] \u0026gt;\u0026gt;\u0026gt; N = 22 \u0026gt;\u0026gt;\u0026gt; sum = 0 \u0026gt;\u0026gt;\u0026gt; for i in range(N): ... prod = p[i] ... for j in range(i): ... prod *= s[j] ... sum += prod ... \u0026gt;\u0026gt;\u0026gt; T = -sum \u0026gt;\u0026gt;\u0026gt; print(T) 19.0 How MXNet computes the offset Now you know one way to compute $T$. I would like to tell you one more, which is used in MXNet.\nFrom equation $\\eqref{eq19}$, we can write down the equations of $H_0$ expressed in $H_i$ for all $i = 1, 2, \\dots, n$.\n$$ \\begin{align} H_0 \u0026amp;= S_1H_1 + 2P^\\prime_1 \\tag{25}\\label{eq25} \\\\\\ H_0 \u0026amp;= \\left(S_1S_2\\right)H_2 + 2\\left(S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{26}\\label{eq26} \\\\\\ H_0 \u0026amp;= \\left(S_1S_2S_3\\right)H_3 + 2\\left(S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{27}\\label{eq27} \\\\\\ \u0026amp;\\dots \\\\\\ H_0 \u0026amp;= \\left(S_1S_2 \\dots S_n\\right)H_n + 2\\left(S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\tag{28}\\label{eq28} \\end{align} $$\nAs aforementioned, eqution $\\eqref{eq25}$ is a reparameterization of the convolutional layer conv-1 connecting $L_0$ and $L_1$. Obviously, equations $\\eqref{eq26}$ to $\\eqref{eq28}$ all have a similar form. We can actually treat them as a compound convolutional layer connecting $L_0$ and $L_2, L_3, \\dots, L_n$. Let\u0026rsquo;s call the compound convolutional layer connecting $L_0$ and $L_i \\left(i = 1, 2, \\dots, n\\right)$ the $i$-th compound convolutional layer, whose compound stride $S_i^{\\text{compound}}$ and compound offset $P_i^{\\text{compound}}$ are as follows.\n$$ \\begin{equation} \\begin{aligned} \\left(S_1^{\\text{compound}}, P_1^{\\text{compound}}\\right) \u0026amp;= \\left(S_1, P^\\prime_1\\right) \\\\\\ \\left(S_2^{\\text{compound}}, P_2^{\\text{compound}}\\right) \u0026amp;= \\left(S_1S_2, S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \\left(S_3^{\\text{compound}}, P_3^{\\text{compound}}\\right) \u0026amp;= \\left(S_1S_2S_3, S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\\\\\ \u0026amp;\\dots \\\\\\ \\left(S_n^{\\text{compound}}, P_n^{\\text{compound}}\\right) \u0026amp;= \\left(S_1S_2 \\dots S_n, S_1S_2 \\dots S_{n - 1}P^\\prime_n + S_1S_2 \\dots S_{n - 2}P^\\prime_{n - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1\\right) \\end{aligned} \\end{equation}\\tag{29}\\label{eq29} $$\nAs you may have noticed, $P_n^{\\text{compound}}$ is just $-T$. If we can compute $P_n^{\\text{compound}}$, then we know the value of $T$.\nSince $H_0 = 1 \\cdot H_0 + 0$, let\u0026rsquo;s introduce two auxiliary variables $\\left(S_0^{\\text{compound}}, P_0^{\\text{compound}}\\right) = \\left(1, 0\\right)$.\nNow the problem is: given $\\left(S_1, P^\\prime_1\\right), \\left(S_2, P^\\prime_2\\right), \\dots, \\left(S_n, P^\\prime_n\\right)$ and $\\left(S_0^{\\text{compound}}, P_0^{\\text{compound}}\\right)$, how to compute $\\left(S_i^{\\text{compound}}, P_i^{\\text{compound}}\\right)$ for $i = 1, 2, \\dots, n$.\nThis problem can be further reduced to: given $\\left(S_{i - 1}^{\\text{compound}}, P_{i - 1}^{\\text{compound}}\\right)$ and $\\left(S_i, P^\\prime_i\\right)$, how to compute $\\left(S_{i}^{\\text{compound}}, P_{i}^{\\text{compound}}\\right)$ for $i = 1, 2, \\dots, n$.\nAccording to the expressions of $S_{i}^{\\text{compound}}$ and $P_{i}^{\\text{compound}}$, we have\n$$ \\begin{equation} \\begin{aligned} S_{i}^{\\text{compound}} \u0026amp;= S_1S_2 \\dots S_i \\\\\\ \u0026amp;= \\underbrace{S_1S_2 \\dots S_{i - 1}}_{S_{i - 1}^{\\text{compound}}}S_i \\\\\\ \u0026amp;= S_{i - 1}^{\\text{compound}}S_i \\end{aligned} \\end{equation}\\tag{30}\\label{eq30} $$\n$$ \\begin{equation} \\begin{aligned} P_{i}^{\\text{compound}} \u0026amp;= S_1S_2 \\dots S_{i - 1}P^\\prime_i + S_1S_2 \\dots S_{i - 2}P^\\prime_{i - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1 \\\\\\ \u0026amp;= \\underbrace{S_1S_2 \\dots S_{i - 1}}_{S_{i - 1}^{\\text{compound}}}P^\\prime_i + \\underbrace{S_1S_2 \\dots S_{i - 2}P^\\prime_{i - 1} + \\dots + S_1S_2P^\\prime_3 + S_1P^\\prime_2 + P^\\prime_1}_{P_{i - 1}^{\\text{compound}}} \\\\\\ \u0026amp;= S_{i - 1}^{\\text{compound}}P^\\prime_i + P_{i - 1}^{\\text{compound}} \\end{aligned} \\end{equation}\\tag{31}\\label{eq31} $$\nEquations $\\eqref{eq30}$ and $\\eqref{eq31}$ are actually how MXNet compute the compound stride and compound offset. Let\u0026rsquo;s dive into the codes to see how it is implemented.\nIn function filter_map, a convolution with parameters $K$ (kernel), $S$ (stride) and $P$ (pad) is reparameterized to $S$ (stride) and $P^\\prime_i$. (kernel-stride)/2-pad is just $P^\\prime_i$ according to equation $\\eqref{eq11}$.\ndef filter_map(kernel=1, stride=1, pad=0): return (stride, (kernel-stride)/2-pad) In function inv_fp, a deconvolutional layer is transformed to an equivalent convolutional layer according to Theorem. fp_in just stores $\\left(S, P^\\prime\\right)$.\ndef inv_fp(fp_in): return (1.0/fp_in[0], -1.0*fp_in[1]/fp_in[0]) In compose_fp, equations $\\eqref{eq30}$ and $\\eqref{eq31}$ are implemented. fp_first represents $\\left(S_{i - 1}^{\\text{compound}}, P_{i - 1}^{\\text{compound}}\\right)$ and fp_second represents $\\left(S_i, P^\\prime_i\\right)$. The returned result is $\\left(S_i^{\\text{compound}}, P_i^{\\text{compound}}\\right)$.\ndef compose_fp(fp_first, fp_second): return (fp_first[0]*fp_second[0], fp_first[0]*fp_second[1]+fp_first[1]) Finally, in compose_fp_list, $\\left(S_{n}^{\\text{compound}}, P_{n}^{\\text{compound}}\\right)$ are computed iteratively using $\\left(S_1, P^\\prime_1\\right), \\left(S_2, P^\\prime_2\\right), \\dots, \\left(S_n, P^\\prime_n\\right)$ (stored in fp_list) and $\\left(S_0^{\\text{compound}}, P_0^{\\text{compound}}\\right)$ (fp_out) by repeatedly calling compose_fp. You may convince yourself of this point by manually running several steps of the for loop.\ndef compose_fp_list(fp_list): fp_out = (1.0, 0.0) for fp in fp_list: fp_out = compose_fp(fp_out, fp) return fp_out Finer Details In the upscore layer of voc-fcn32s, the feature maps are directly upsampled by a large factor of 32 (this is why it is named voc-fcn32s), which will produce relatively coarse predictions due to missing finer detils from intermediate resolutions. So, in voc-fcn16s and voc-fcn8s, the shrinked feature maps are upsampled more than once before being recovered to the size of the image.\nFor voc-fcn16s, the feature maps from score_fr will first be upsampled by a factor of 2 in upscore2. Then, we generate another set of outputs from pool4 using convolution in score_pool4 and crop it to be the same size as that of upscore2 in score_pool4c. Finally, we combine upscore2 and score_pool4c using element-wise summation in fuse_pool4, upsample it by a factor of 16 in upscore16 and crop it in score to obtain the output. We show the network architecture for this process while omitting the previous layers in the following figure. Moreover, this process is broken down in the table below.\nName Type Params Size pool4 Pooling max 2x2, stride 2 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ score_pool4 Convolution 21 1x1 kernels $21 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ score_fr Convolution 21 1x1 kernels $21 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ upscore2 Deconvolution 21 4x4 kernels, stride 2 $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ score_pool4c Crop axis 2, offset 5 $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ fuse_pool4 Eltwise sum $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ upscore16 Deconvolution 21 32x32 kernels, stride 16 $21 \\times \\left(H + 54\\right) \\times \\left(W + 54\\right)$ score Crop axis 2, offset 27 $21 \\times H \\times W$ As can be seen, the finer details from the intermediate resolution in pool4 are incorporated into later feature maps, which will produce finer outputs than those of fcn-voc32s. Actually, fcn-voc16s utilizes two resolutions of a factor of 16 and a factor of 32.\nWe may combine more resolutions in the same way. In fcn-voc8s, we generate one more set of outputs from pool3 and combine it with later feature maps. The network architecture is similarly shown in the figure below with the process broken down in the following table.\nName Type Params Size pool3 Pooling max 2x2, stride 2 $256 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ score_pool3 Convolution 21 1x1 kernels $21 \\times \\left(\\frac{H + 6}{8} + 24\\right) \\times \\left(\\frac{W + 6}{8} + 24\\right)$ pool4 Pooling max 2x2, stride 2 $512 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ score_pool4 Convolution 21 1x1 kernels $21 \\times \\left(\\frac{H + 6}{16} + 12\\right) \\times \\left(\\frac{W + 6}{16} + 12\\right)$ score_fr Convolution 21 1x1 kernels $21 \\times \\frac{H + 6}{32} \\times \\frac{W + 6}{32}$ upscore2 Deconvolution 21 4x4 kernels, stride 2 $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ score_pool4c Crop axis 2, offset 5 $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ fuse_pool4 Eltwise sum $21 \\times \\left(\\frac{H + 6}{16} + 2\\right) \\times \\left(\\frac{W + 6}{16} + 2\\right)$ upscore_pool4 Deconvolution 21 4x4 kernels, stride 2 $21 \\times \\left(\\frac{H + 6}{8} + 6\\right) \\times \\left(\\frac{W + 6}{8} + 6\\right)$ score_pool3c Crop axis 2, offset 9 $21 \\times \\left(\\frac{H + 6}{8} + 6\\right) \\times \\left(\\frac{W + 6}{8} + 6\\right)$ fuse_pool3 Eltwise sum $21 \\times \\left(\\frac{H + 6}{8} + 6\\right) \\times \\left(\\frac{W + 6}{8} + 6\\right)$ upscore8 Deconvolution 21 16x16 kernels, stride 8 $21 \\times \\left(H + 62\\right) \\times \\left(W + 62\\right)$ score Crop axis 2, offset 31 $21 \\times H \\times W$ In fcn-voc8s, one more intermediate resolution pool3 are incorpoeated. From fcn-voc32s, fcn-voc16s to fcn-voc8s, more intermediate resolutions are incorporated and the results will contain more details, as shown below (taken from the FCN paper).\nConclusion We cover fully convolutional networks in great detail. To summarize, we have learend:\nSemantic segmentation requires dense pixel-level classification while image classification is only in image-level. Fully convolutional networks (FCNs) are a general framework to solve semantic segmentation. The key to generate outputs with the same size as the input in FCNs is to use deconvolution layers, which are just convolutional layers with input and output swapped. The offset parameter in the Crop layers of FCNs can be computed by breaking down the network layer by layer or using an analytic equation. Outputs with higher resolutions from intermediate layers of the network can be incorporated to enhance the details in the segmentation results. ","date":1537010962,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1537010962,"objectID":"51323c66870cff31853e5e8caf6ed8b4","permalink":"https://jianchao-li.github.io/post/understanding-fully-convolutional-networks/","publishdate":"2018-09-15T19:29:22+08:00","relpermalink":"/post/understanding-fully-convolutional-networks/","section":"post","summary":"I will start from the problem of semantic segmentation, introduce how to use CNNs to solve it, and talk about fully convolutional networks, a widely used framework for semantic segmentation, in great details. Moreover, I will analyze the MXNet implementation of FCNs.","tags":[],"title":"Understanding Fully Convolutional Networks","type":"post"},{"authors":[],"categories":[],"content":"I have been very interested in the interplay between vision and natural language for some time. In these years, an emerging research topic combined these two areas. That is, visual question answering (VQA). Recently, I made a dive into this topic and wrote some notes as you are reading now.\nWhat is VQA? VQA is a task that involves understanding the semantic information of both an image and a natuarl language question and returning the answer also expressed in natural language. You may play with the Visual Chatbot to get a sense of VQA.\nAs can be seen, this is a multi-modal task involving two modes of data (an image and a text). To answer the question, both the semantics of the image and the question should be well understood.\nImportance of VQA VQA is an important research topic, mainly for three reasons. The first is a historical one, kind of relevant to the origin of computer vision, a summer project at MIT back in 1966 [1]. Richard Szeliski wrote about this in his famous book [2]:\nin 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman to spend the summer linking a camera to a computer and getting the computer to describe what it saw\nThis see-and-describe summarizes the original goal of the pioneers of computer vision: let the computer see the world around it (expressed in images) and describe it. In terms of this goal, a highly related task is image captioning, which I played with in this post. However, image captioning typically gives a general description of the image. If we would like the computer to describe some specific details, a natural way is to ask it to do so explicitly, which is what we do in VQA.\nThe second reason that accounts for the significance of VQA is its potential to become an AI-complete task [3]. Most tasks in artificial intelligence, especially computer vision, can be kind of boiled down to answering questions over images. For example, image classification is to answer a multiple-choice question of the category of an image.\nThe last but ont least reason is that VQA has many promising applications. The most evident one is human-computer interaction, which benefits from VQA since it teaches a computer both to see and to speak. In the future, a human may be able to talk to an intelligent agent about a scene in natural language. This can further find many applications like navigation for the blind people (asking the navigation agent about what it sees to help the blind people know where to go) and video processing (asking an VQA agent to find out someone or something of interests in a large number of surveillance videos).\nBreaking down VQA Currently, researchers generally break the VQA problem down to four subproblems.\nHow to represent the image Convolutional neural newtorks (CNNs) have achieved great success in many image-related tasks. So many VQA pipelines make use of a pre-trained CNN to extract activations of specific layers as the image\u0026rsquo;s bottom-up features. A relatively new idea is to use some detection networks, like Faster R-CNN, to extract bottom-up attention features, as in the state-of-the-art [4].\nHow to represent the question This subproblem is solved much better using LSTM, possibly with a concatenation with GloVe features.\nHow to combine the two representations There are several possibilities of combing the two representations of images and questions: concatenation, element-wise multiplication/summation and outer product. Outer product is preferred since it allows all elements of the two representations to interact with each other. But it comes with a high dimension and thus large memory consumption and long computation time. A good solution to this problem is compact bilinear coding [5], which projects the outer product to a lower dimensional space.\nHow to generate the answer There are mainly two ways: generating the answer using an RNN or by choosing it from a set of candidate answers as in classification. Most works use the classification approach, including the state-of-the-art [4].\nBottlenecks of VQA There are mainly two bottlenecks of the current VQA research.\nThe first one is on the side of algorithms, specifically, the features of images/questions are computed in advance and then fed into the pipeline and fixed. This is kind of similar to the pre-deep-learning age of computer vision that researchers hand-engineered features (features were not learned end-to-end). It will be more preferable if the features can be learned by back-propagating answer errors to the input images and questions.\nThe second one is on the side of datasets, specifically, the lack of datasets that ask questions which require external knowledge to answer. Incorporating external knowledge (like common sense or those from the encyclopedia) into VQA will push it to be an AI-complete task [3].\nThoughts about the bottlnecks For the first bottleneck that features are not learned, one difficulty of learning those features for the image/question is that the pipeline includes some non-differentiable operations and thus back-propagation cannot be applied. An idea to overcome this difficulty is to use policy gradient [6].\nFor the second bottleneck, the idea is to first collect a dataset for it. And the main challenge lies in how to incorporate the external knowledge into VQA. An idea, proposed in [7], is to learn a mapping from the image and question to a query into the knowledge database and incorporate the results of the query into the pipeline.\nReferences [1] S. Papert. The summer vision project. Technical Report Vision Memo. No. 100, Artificial Intelligence Group, Massachusetts Institute of Technology, 1966.\n[2] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010. http://szeliski.org/Book/.\n[3] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017.\n[4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n[5] A. Fukui, D. H. Park and D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. CoRR abs/1606.01847, 2016.\n[6] J. Johnson, B. Hariharan, L. v. d. Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and Executing Programs for Visual Reasoning. In Proceedings of the International Conference on Computer Vision, 2017.\n[7] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, A. Dick. FVQA: fact-based visual question answering. CoRR abs/1606.05433, 2016.\n","date":1535351675,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1535351675,"objectID":"10deafb2451d430822e7327c466b71c3","permalink":"https://jianchao-li.github.io/post/a-dive-into-visual-question-answering/","publishdate":"2018-08-27T14:34:35+08:00","relpermalink":"/post/a-dive-into-visual-question-answering/","section":"post","summary":"I read some papers on VQA and summarized its state-of-the-art, bottlenecks and possible solutions.","tags":[],"title":"A Dive Into Visual Question Answering","type":"post"},{"authors":[],"categories":[],"content":"I have been fascinated by image captioning for some time but still have not played with it. I gave it a try today using the open source project neuraltalk2 written by Andrej Karpathy.\nThe theory The working mechanism of image captioning is shown in the following picture (taken from Andrej Karpathy).\nThe image is encoded into a feature vector by a convolutional neural network (CNN) and then fed into a recurrent neural network (RNN) to generate the captions. The RNN works word by word. Each time it receives an input word and a hidden state and generates the next word, which is used as the input word in the next time. The CNN feature vector of the image is used as the initial hidden state, which is updated in each time step of the RNN.\nIn the above picture, the RNN receives the initial hidden state and START (a special word incidating the RNN to start generation) and generates the first word straw. Then straw is fed into the RNN together with the updated hidden state to generate hat. Finally, hat is fed into the RNN with the latest hidden state to generate END (a special word indicating the RNN to stop). So the caption of the image is straw hat.\nThe experiment I played with neuraltalk2 to get a sense of how image captioning performs.\nWorking environment I ran the code in a VM instance of Google Cloud. If you also want to use Google Cloud, you may refer to the Google Cloud Tutorial of CS231n to learn about how to set up a virtual instance. The tutorial is a bit long and you should only need to reach Connect to Your Virtual Instance.\nThe following screenshots show the settings of the VM instance. I made several changes:\nChanged Name to neuraltalk2 Changed Region and Zone to us-west1 (Oregon) and us-west1-b Changed Boot disk to Ubuntu 16.04 LTS Checked Allow HTTP traffic and Allow HTTPS traffic Installing Torch neuraltalk2 is written in Torch. So you need to install Torch first. You can simply follow the steps in Getting started with Torch:\n$ git clone https://github.com/torch/distro.git ~/torch --recursive $ cd ~/torch; bash install-deps; $ ./install.sh At the end of the last command, you will be prompted a question. Just answer yes.\nDo you want to automatically prepend the Torch install location to PATH and LD_LIBRARY_PATH in your /home/jianchao/.bashrc? (yes/no) [yes] \u0026gt;\u0026gt;\u0026gt; yes Finally, run\n$ source ~/.bashrc Now Torch should be ready.\nInstalling dependencies I ran neuraltalk2 on the CPU (since GPU is very expensive in Google Cloud). So I only need a part of the dependencies. I ran the following commands from my $HOME directory to install the dependencies.\n$ luarocks install nn $ luarocks install nngraph $ luarocks install image $ # Install Lua CJSON $ wget https://www.kyne.com.au/~mark/software/download/lua-cjson-2.1.0.tar.gz $ tar -xvzf lua-cjson-2.1.0.tar.gz $ cd lua-cjson-2.1.0 $ luarocks make $ cd # go back $HOME $ # Install loadcaffe $ sudo apt-get install libprotobuf-dev protobuf-compiler $ CC=gcc-5 CXX=g++-5 luarocks install loadcaffe $ # Install torch-hdf5 $ sudo apt-get install libhdf5-serial-dev hdf5-tools $ git clone https://github.com/deepmind/torch-hdf5 $ cd torch-hdf5 $ luarocks make hdf5-0-0.rockspec LIBHDF5_LIBDIR=\u0026#34;/usr/lib/x86_64-linux-gnu/\u0026#34; $ cd # go back $HOME Notice that Andrej listed loadcaffe and torch-hdf5 under For training, but they are actually also required for inference. And if you woud like to run neuraltalk2 on a GPU, please follow the README.md to install those additional dependencies.\nCaptioning images Now we can use neuraltalk2 to caption images. Just clone the repository and download the pretrained model. Since I ran it on CPU, I downloaded the CPU model. You may need to download the GPU model to run it on GPU.\n$ git clone https://github.com/karpathy/neuraltalk2.git $ cd neuraltalk2 $ mkdir models $ cd models $ wget --no-check-certificate https://cs.stanford.edu/people/karpathy/neuraltalk2/checkpoint_v1_cpu.zip $ unzip checkpoint_v1_cpu.zip I created another folder images in the root directory of neuraltalk2 to store the test images. I downloaded two datasets for the experiment: the [2017 Val Images of COCO](2017 Val images [5K/1GB]) and the Clothing Co-Parsing (CCP) Dataset.\nAfter everything is ready, just run the following command to apply neuraltalk2 to caption the images. Since I used CPU, I set -gpuid -1.\nth eval.lua -model models/model_id1-501-1448236541.t7_cpu.t7 -image_folder images/ -num_images -1 -gpuid -1 Results COCO In the COCO dataset, images are of various scenes and objects. And neuraltalk2 is able to capture the overall content of what is happening in the image, except for some mistakes like the cat is not sitting on the laptop. But, in general, the captions are very discriminative considering the large differences between images. Given images and captions, it is very easy to tell which image corresponds to which caption. Image captioning makes great sense in this case.\nCCP In the CCP dataset, images are all coming from the clothing domain and thus they are very similar to each other in the overall content. And the differences are mostly reflected in fine-grained details. In this case, the captions of neuraltalk2 which only capture the overall content become meaningless and are not very helpful for distinguishing one image from others. Moreover, the captions make more mistakes, like a lot of false positives of cell phones.\nThoughts For classifying images in the same domain, researchers have come up with fine-grained image classification. Now to caption these images, whose fine-grained details are much more important than the overall content, it makes sense to state that we need fine-grained image captioning.\nTo solve the fine-grained image captioning problem, we need to collect a dataset of images in the same domain with fine-grained captions. The considerable number of advertising captions for clothes/food/cars serve as a good basis. The pipeline of fine-grained image captioning may also be similar to that of general image captioning: a CNN learns a domain-specific representation of the image (maybe via fine-tuning the network in a fine-grained image classification task) and then an RNN generates a fine-grained caption conditioned on the representation. There should be many problems waiting to be discovered and solved in fine-grained image captioning.\n","date":1533730365,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1533730365,"objectID":"cd6fd0baa1b607d466b26a622abe9d16","permalink":"https://jianchao-li.github.io/post/playing-with-image-captioning/","publishdate":"2018-08-08T20:12:45+08:00","relpermalink":"/post/playing-with-image-captioning/","section":"post","summary":"I have been fascinated by image captioning for some time but still have not played with it. I gave it a try today using the open source project neuraltalk2 written by Andrej Karpathy.\nThe theory The working mechanism of image captioning is shown in the following picture (taken from Andrej Karpathy).\nThe image is encoded into a feature vector by a convolutional neural network (CNN) and then fed into a recurrent neural network (RNN) to generate the captions.","tags":[],"title":"Playing With Image Captioning","type":"post"},{"authors":["Jianchao Li","Dan Wang","Canxiang Yan","Shiguang Shan"],"categories":null,"content":"","date":1420070400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1420070400,"objectID":"e6f835046ed150600a0f87cea7ab27ec","permalink":"https://jianchao-li.github.io/publication/icip-15/","publishdate":"2019-06-12T15:30:32.130655Z","relpermalink":"/publication/icip-15/","section":"publication","summary":"","tags":null,"title":"Object Segmentation with Deep Regression","type":"publication"}]