Text analysis

What questions can we answer using the parsed text data?

The headline text can be answered to answer a wide range of question about North Korean politics, society, and especially its military. One of the biggest debates about the early North Korean regime is about control. Who was in control of the formation of communist North Korea? Was it the North Koreans, the Soviets, or some mixture? Scholars' opinions have been divided on this issue, but there is a convergence of opinion on the importance of individuals. Josef Stalin and Kim Il Sung, the respective leaders of the Soviet Union and North Korea, are undoubtedly leaders of supreme importance for understanding early North Korea, and using the newspaper headlines allow us to understand the public perception of these people at the time. The words used to describe the two leaders in a newly analyzed source would provide new insights about the relative position of these two men and their roles in the newly emerging North Korean state, society, and military.

 

What kind of data visualization would best answer the question?

To answer this question, we must return to the nature and characteristics of the dataset. The dataset of newspaper headlines at hand are in the form of long, and mostly full, sentences, each delivering an unequivocal summary of the subject that is being reported in the article. By simply counting the number of morphemes that occur in the same headline about Stalin or Kim would give us a solid idea about what the editors of this newspaper wanted the soldiers to know about them. A simple chart that aggregates the number of occurrences of every unique morpheme would suffice for this purpose. A treemap would best suit this purpose, as it is able to simultaneously indicate both the proportion of each term and the diversity of terms used at a single point in time.

 

Preparing a treemap using the dataset

A variety of programs can be used to create a treemap, but this project uses the latest version of Microsoft Excel.

  1. Arrange each word into one column on an Excel spreadsheet
    The text file created earlier is separated by line breaks at the end of each headline. Using regular expression, line breaks were added after every word (line break is expressed as "\n"). When these are added, simply copy and paste the entire text to a spreadsheet, and each word will be placed in its own unique cell in one column.

  2. Create a duplicate of the existing column on words
    An identical column of words was created next to the existing column. The new column will "count" the words from the old column, which exists to be "counted." The counting column of words will have repeating words removed using Remove Duplicates from the Data tab.

  3. Use COUNTIF to count word frequency
    To the right of the new column, leave an empty column for counting, using the COUNTIF function. A formula would look like "=COUNTIF($D$2:$D$1682,A2)" with the first two cells inside the brackets indicating the range of words and "A2" indicating the specific word being counted. Note that the dollar signs are attached to the range to prevent the autofill function on Excel from distorting the range of the word list.

  4. Create the treemap
    This is the easiest part: go to the Insert tab and click on Treemap. 
excel countif.jpg
by Hyung-joon Kim