Computational Nucleus Detection in H&E Stained Whole Slide Image

Benjamin Ionescu

Grey Slide Imaging

1.PNG

Introduction

The diagnosis of certain diseases involves extracting a sample of tissue from a patient and analyzing it under a microscope. The samples are stained with certain chemicals to facilitate distinguishing different components in the tissue, such as cell membranes, cytoplasm, nuclei, etc. A pathologist must manually review slides of stained tissue to look for features that signify the presence of a disease. This is a time-consuming and repetitive task which is suitable for being facilitated by computational methods. In order to facilitate the pathologist’s review of slides, the tissue can be scanned into a digital image file, in a process called Whole Slide Imaging. Computer Vision (CV) methods can then be used on the images to automatically detect certain features, such cell membranes, nuclei, and rare events. Subsequently, the morphological characteristics of these features, such as area, perimeter, and circularity, as well as stain intensity can be computed. Grey Slide Imaging is developing a library of such CV algorithms for digital pathology. In this paper, we discuss Grey Slide Imaging’s nucleus detection algorithm for H&E stained Whole Slide Images.

1-1.PNG

Fig. 1: Examples of cropped subsections of H&E stained Whole Slide Images.

Problem Statement

2.PNG

Fig. 2: Example of expected output of solution. The user should be able to select a rectangle and retrieve its computed area, perimeter, centre coordinates, circularity and average brightness.

The problem of nucleus detection for Whole Slide Images can be stated thus: Given images such as in Fig. 1, what method can be used to identify the nuclei, and then to compute their areas, perimeters, circularities, brightness's (stain intensities), positions and overall number? We expect the solution of this problem to provide output images of the kind seen in Fig. 2. In addition to annotated image outputs, we expect the solution to associate the numerical data of the aforementioned characteristics for each detected nucleus. In other words, it should be possible to retrieve the area, perimeter, stain intensity, etc. for a nucleus in a particular rectangle

Nucleus Detection Algorithm

The approach taken to solve the problem outlined in Section 2 is based on classical computer vision methods, as opposed to deep learning-based methods. A training procedure with a large dataset of images, for tuning the weights of a neural network, was not implemented here. Instead, the functions used were deterministic and utilized only three tunable parameters. The algorithm consisted of the following steps:

1. Create a grayscale image by taking a particular linear combination of the input image’s colour channels (Red + Green - Blue)

2. Threshold the result from Step 1, and apply OpenCV’s Connected Components function to distinguish clusters from each other. Thresholding involves one tunable parameter.

3. Filter out false positives, based on two tunable parameters.

4. For each remaining detected nucleus, take the threshold of the red channel of the input image.

5. Compute the aforementioned properties for each identified cluster from Step 2, as well as the bounding rectangle, for image annotation.

 

Note that the Watershed function is not used, despite being important to other nucleus detection algorithms such as QuPath’s. Watershedding tends to divide single nuclei into multiple clusters. Further note that Step 4 represents a second segmentation of the input image. This is to leverage the benefits of two methods of segmentation, as will be explained in the following section.

Technical Details

The nucleus detection algorithm was written in Python, and makes use primarily of the OpenCV and NumPy libraries. In order to achieve robust results for detecting nuclei, it is insufficient to simply convert the colour image to grayscale and then threshold. Extracting the red channel of the image performs slightly better than converting to grayscale, but it remains impossible to tune the parameters of the algorithm to detect a reasonable amount of nuclei while also ignoring a reasonable amount of false candidates. To overcome this, the first step of the algorithm subtracts the blue channel from the sum of the red and green channels: R + G - B. Recall that each pixel in a colour image is actually an array of three values: a blue, green and red value, varying between 0 and 255. We then threshold the R + G - B image, and apply the OpenCV cv2.connectedComponents function to convert the binary into a colour image. The result is a red background, with each nucleus drawn on it in a unique colour. This effectively segments the nuclei in the image from their surroundings as well as other nuclei. An example of the output of cv2.connectedComponents is shown in Fig. 3. Once a segmentation of the image is obtained, we iterate over each unique colour in the segmented image (except red, the background). For each step in the iteration, a binary image is produced in which only one nucleus/cluster, the one associated with the unique colour of that step, appears. In other words, we produce a binary image with a black background

3.PNG

Fig. 3: Output of cv2.connectedComponents, when fed a binary image obtained by thresholding the R + G - B of the original image

and a white blob corresponding to one nucleus. Its area and brightness are calculated and compared to two tunable parameters, for the purposes of filtering out false positives. These parameters are those two mentioned in Step 3 of the procedure from Section 3. If the individual nucleus/cluster meets the conditions to avoid being filtered out, it then needs its segmentation to be refined. As seen in Fig. 3, the detected nuclei are full of holes. To resolve this, in Step 4, we use an additional segmentation method. First, the nucleus/cluster’s bounding rectangle is computed. A crop from the original input image is taken, which excludes everything outside this bounding rectangle. Then, the red channel of the crop is taken and thresholded. The binarized crop replaced the corresponding,

4.PNG

Fig. 4: Segmenting using a thresholding of the R channel (left) vs. the R+G-B image (right). The left is without holes but has several artifacts.

badly segmented region produced by the R + G - B binary image. A comparison of their segmentation results is shown in Fig. 4. The red-channel segmentation method, as mentioned before, fails at simultaneously detecting sufficient nuclei and ignoring false positives. It is therefore not used for initial detection. However, if used on a correctly detected nucleus, the segmentation quality is high. The opposite is true for the R + G - B segmentation method, which is strong at accurately detecting nuclei locations, i.e. their bounding rectangles, but does not segment the nuclei it detects well. Thus the use of two methods. High segmentation quality is necessary for the sake of accurately computing the corresponding brightness and morphological properties. In the second segmentation, there are several smaller clusters that suddenly appear, seen in Fig. 4. These are filtered out by using cv2.Contours to separately identify each cluster’s area within the crop. The cluster with the largest area is kept, which eliminates these artifacts. Finally, the brightness and morphological properties of the nucleus are computed. The area is computed by simply counting the white pixels. The perimeter is computed by drawing a contour of green pixels and counting them. The brightness is computed by taking the crop from the original image using the same bounding rectangle, and setting all non-nucleus pixels to black. The average intensity of all non-black pixels is then readily computed. Fig. 5 shows an example of a nucleus crop on a black background, for calculating brightness.

5.PNG

Fig. 5: The same nucleus from Fig. 4, arranged so as to compute its average brightness. This can be done using the numpy.mean() and numpy.count_nonzero() functions.

Circularity is computed according to the following formula: Circularity = 4*pi*area/(perim**2) Note that circularity is generally between 0 and 1, but can be above 1 due to the discrete nature of the area and perimeter of the objects involved. The centre is determined by computing the centre of mass of the white blob such as in Fig. 4. This vector is with respect to the crop’s origin (top-left corner). Therefore, the coordinates corresponding to the top-left corner of the crop’s bounding rectangle are added to the centre-of-mass coordinates, so that the centre is finally given in the reference frame of the full image. These results are all stored in separate lists in order, so that an index can be used to retrieve the properties for a given detected nucleus from these lists. Two further features were added as a proposed solution to cases where multiple nuclei appear in a single rectangle, or single nuclei are detected with multiple rectangles. A Pygame script was written which allows the user to interface with the annotated image, such as in Fig. 2. In the GUI, a pen is provided that allows the user to manually draw boundaries between nuclei. The pen modifies the binary image (Fig. 4) and segmented image (Fig. 3). To achieve this, the binary and segmentation are produced

Fig. 6: Example of using a pen for manually separating clumped nuclei (top row) and joining them (bottom row). Leftmost image is before applying the pen. Middle image shows green lines indicating where the pen is used. Rightmost image shows the reprocessed image.

from an input image in one function, and the computation of the nuclei’s properties based on the segmentation is done in another function. The first function is run once, and the second is run every time the pen modifies the binary and segmentation. The user can select between 3 pen settings - one which separates nuclei, one which joins two nearby nuclei, and one which displays the properties of a nucleus in a given rectangle that is clicked on. A demonstration of the effects of using the pen for reprocessing is shown in Fig. 6

6.PNG

Discussion

The detection accuracy of the algorithm described above, based on a manual analysis of 120 images, is approximately 90%, without making use of the pen feature. This performance is satisfactory compared to other nucleus detection algorithms such as the one used in QuPath. The same set of images processed with QuPath yielded a detection accuracy of 70%. The metric used was as follows: Detection_accuracy = (TDN - FP) / (TDN + MN) TDN: Total Detected Nuclei FP: False Positives MN: Missed Nuclei This comparison was done without domain expertise in identifying nuclei in H&E stained tissue images. Furthermore, the difference in accuracy is influenced by the nature of the mistakes each method makes.

7.PNG

Fig. 7: Proposed GUI, showing the three pen settings. Clicking on a rectangle when ‘Analyze’ is selected displays the corresponding properties.

8.PNG

Fig. 8: Sample results from QuPath’s nucleus detection algorithm. The tendency to segment a single nucleus into multiple clusters can be seen particularly in the top-right, and there is a somewhat high rate of false positives. Overall, the performance seems qualitatively comparable to our algorithm green lines indicating where the pen is used. Rightmost image shows the reprocessed image.

While the algorithm described here tended to put large boxes around closely clumped nuclei, identifying them as one cluster, QuPath tended to identify many separate boxes for a single nucleus. Multiple detections for the same nucleus were considered to be false positives. For example, a single nucleus with three detections was considered to be one correct detection and two false positives. However, a single detection for multiple nuclei was not considered to imply missed nuclei. Though the same standards were applied to both algorithms in determining their accuracy, the nature of the initial mistakes made by each clearly contributes to the discrepancy. Regardless, the purpose of this algorithm is not to necessarily outperform existing software. A qualitative comparison of the results of this algorithm and the results of QuPath strongly suggest that this algorithm meets the standards of existing nucleus detection software. Samples of QuPath outputs are shown in Fig. 8.

References

For this work, 10 images were taken from various locations of a single .svs image file, acquired from: https://portal.gdc.cancer.gov/repository

Each of the 10 images were divided into 12 instances of 300x300 crops for faster processing time, hence the 120 images used for determining the detection accuracy for QuPath and this algorithm