Skip to main navigation Skip to main content

J Humanit AI : Journal of Humanities and AI

OPEN ACCESS
ABOUT
Current Issue
ALL ISSUES
PUBLISHING POLICIES
FOR CONTRIBUTORS
E-Submission

Articles

Page Path

Research Article

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Journal of Humanities and AI 2026;1(1):64-80.
Published online: March 31, 2026

*Department of Korean Language and Literature at Seoul National University

Copyright © Institute for Digital Humanities and Interdisciplinary Studies, Korea University

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 227 Views
  • 31 Download
prev next
  • The date of a historical document, if it was published, can be recognized through the date of its preface or copyright page. However, dating an unpublished manuscript is much more difficult. In the case of old Hangeul documents, various linguistic features can be used to estimate the approximate date of the document, but as the number of features increases, the task tends to go beyond the purview of an individual human researcher, and become more appropriate to AI. This paper shows how artificial neural networks can be trained to estimate the date of documents using material whose date is known. For this purpose, various kinds of neural networks are examined: Bag-of-words model, CNN, RNN and Transformer. In addition, these models can be further sub-divided: unigram or bigram, character(syllable)-based or grapheme(phoneme)-based. After trained on documents with known dates, these models are applied to new (unseen) data, and the results are evaluated.
Among materials on the history of the Korean language, printed editions with colophons have been actively utilized because their publication dates are clear. For example, the prefaces of Seokbosangjeol and Weolinseokbo contain Chinese era names, so these indicate the years of compilation or publishing (Figure 1).
On the other hand, manuscripts have generally been used less frequently in research on the history of the Korean language due to their unclear dates. For example, Daemyeongyeongryeoljeon, a manuscript novel, has no indication of the year of writing (Figure 2).
Nevertheless, given the substantial volume of manuscript materials, it would be beneficial to estimate their dates in some way so that they can be utilized in research on the history of the Korean language.
Linguists can use linguistic clues (e.g. consonants ㅸ, ㅿ, ㅭ or phonological phenomena such as palatalization 아디>아지) to estimate the approximate dates of these documents (Figure 3).
However, there are enormous number of linguistic features which can be used as clues for estimating the date, which cannot be considered by human researchers in their entirety, so they generally use very small part of these evidences. In such cases, machines (especially deep learning models) are excellent in taking all these features into account, so using deep learning for this task seems to be promising.
Many linguists and literary scholars are not familiar with machine learning concepts and techniques. Some of them think that they need not know much about machine learning, expecting that AI will solve their problems. However, when experimenting with machine learning to solve a task, one should make many decisions (about e.g. models, hyperparameters, etc.). These decisions require knowledge of machine learning. We need educated guidance in deciding which option to try.
In order to train a deep learning model, we need many samples, usually several dozens of thousand. The amount of the extant Korean historical documents is fixed. If we set the length of each sample short, we can get many samples. But then we can risk having few clues of the date in a sample. If we set the length of each sample long, we have fewer samples. As a result of some experimenting, I set 300 characters as the length of each sample, which results in about 50 thousand samples (Table 1).
As a first rough attempt, I tried a bag-of-words unigram model. ‘Bag-of-words’ means that each sample is considered as a set of tokens, disregarding the order. ‘Unigram’ means that each single character is a basic unit. (Bigram models mean that each pair of characters, as well as each single character, is a basic unit, which will be touched upon later.)
In order to enter each sample into a neural network, it should be represented as a number or a series of numbers (i.e. a vector). This process is called vectorization, embedding or encoding. Several methods of encoding have been invented. I used one-hot encoding. In this encoding scheme, only the most frequent ten thousand characters were considered. Each sample is represented as a vector of ten thousand digits. ‘1’ means that the sample contains the letter, whereas ‘0’ means otherwise. This process is covered at the Text Vectorization Layer.
As for the architecture of the neural network (Figure 4), I used a dense (also called fully connected) network consisting of three dense layers. In Dense Layer 1, the dimensionality of each vector representing a sample is reduced from 10,000 to 32. In Dense Layers 2 and 3, the dimensionality is further reduced to 16 and finally to 1. This final floating number is supposed to correspond to the estimated year of the input sample.
This model was trained for 100 epochs. The 95th epoch showed the best performance (Figure 5). At this epoch, the losses, i.e. MSE (mean squared error) and MAE (mean absolute error), were as in Table 2.
The error of about 20 years seems to be promising. Even human experts cannot produce this level of accuracy. They usually estimate at the level of half century. The first very simple model achieved a very good (super-human) result!
The above model shows a quite large difference between the training set and the validation set, which means that overfitting is occurring. When overfitting occurs, the model’s performance for new data tends to degrade. Many methods have been created for restricting the model’s power to prevent or alleviate overfitting. The first is layer regularizers, by which the sum of absolute values (L1) or squares (L2) of weights is added to the loss in order to prevent the weights from becoming too large. The second is batch normalization, by which each batch of samples is normalized. The third is dropout, by which a fixed proportion (in our case 0.3) of nodes in each layer is set to zero. In compensation for regularization, I doubled the number of nodes of Dense Layer 1 (32 to 64) and 2 (16 to 32).
This model was also trained for 100 epochs. The 65th epoch showed the best performance (Figure 6). At this epoch, the losses were as in Table 3.
The performance in the test set is most important, because it shows the estimated error rate for new data. The error of 15.85 years seems very good, much better than the previous result of about 20 years. But it seems odd that the performance in the test set is better than that in the training set.
The bizarre phenomenon in which the performance in the test set is better than that in the training set, may be due to the lack of power of the model, compared to the added regularization. So I doubled again the number of nodes of Dense Layer 1 (64 to 128) and 2 (32 to 64), and decreased the dropout rate from 0.3 to 0.2.
This model was also trained for 100 epochs. The 67th epoch showed the best performance (Figure 7). At this epoch, the losses were as in Table 4.
The bizarre phenomenon in which the performance in the test set is better than that in the training set, still appears. The good news is that the error has decreased to 11.4 years.
In the above experiment, each single character was considered as a basic unit. Of course, whether a particular single character appears in a sample can be a good indicator of the date of the sample, but in addition, whether a particular sequence of two consecutive characters appears in a sample can also be a good clue. For example, the sequence of 아 and 비 can be considered as an evidence that umlaut has not yet occurred in this sample, whereas the sequence of 애 and 비 can be an (although not perfect) indicator of umlaut.
Therefore, I experimented with bigrams. The library Keras provides the option ‘ngrams=2’ when generating the object of the class TextVectorization. As we should consider bigrams as well as unigrams, I increased the number of token types from 10,000 to 20,000. The architecture of the model is the same as the previous one. In Dense Layer 1, the dimensionality of each sample is reduced from 20,000 to 128, which is further reduced to 64 and to 1 in Dense Layer 2 and 3 respectively. I added the three kinds of regularization to Dense Layer 1 and 2.
This model was also trained for 100 epochs. The 83th epoch showed the best performance (Figure 8). At this epoch, the losses were as in Table 5.
The bizarre phenomenon in which the performance in the test set is better than that in the training set, still appears. The good news is that the error has decreased to 8.54 years.
In the area of computer vision, CNN (convolutional neural network) models have been popular. In order to extract local patterns from 2-dimensional image, CNN uses a small 2-D window. As a text is a 1-dimensional sequence, 1-D window is used in CNN models for textual data.
In order to enter a sample (1-D sequence of tokens), we should encode each token (character) as a vector. This task is covered in Embedding Layer (Figure 9). In our case, each character was encoded as a 32-dimensional vector. In Convolution Layer 1, a window of size 3 moves from the beginning to the end of a sample, and extracts information. Due to this process, the dimensionality of each sample is reduced from 300 to 298. In the following two Convolution Layers, the same process is repeated, and the dimensionality is further reduced to 296 and to 294. In Convolution Layer 3, each node corresponds to seven tokens in the input sample. In Global Pooling Layer, all the information is synthesized. In Dense Layer, each sample is reduced to one floating point number, which is the estimated year of the sample.
This model was trained for 50 epochs. The 24th epoch showed the best performance (Figure 10). At this epoch, the losses were as in Table
Overfitting occurs, as the difference between the performance in the training set and that in the test set is quite large. The error in the test set is about 27 years, which is not so good.
CNN is appropriate for extracting position-invariant information. ‘Position-invariant’ means that it doesn’t matter where in the input sample a feature is located. This accords with image data, but not with textual data, in which the positional information matters. For textual data, RNN (recurrent neural network) is better suited than CNN. Among several variants of RNN, LSTM performs very well, so I chose that (Figure 11).
Just like CNN, each token is encoded as a vector of 64 numbers in Embedding Layer. Next, in the bidirectional LSTM Layer, each sample is summarized as a single vector, going forward and backward inside the sample. Finally, in Dense Layers, each sample outputs a single number, the estimated year.
This model was trained for 30 epochs. The 25th epoch showed the best performance (Figure 12). At this epoch, the losses were as in Table 7.
Overfitting doesn’t occur, as the difference between the performance in the training set and that in the test set is small. The error in the test set is about 11 years, which is the best record so far.
In Hangeul, a character consists of three graphemes (an onset, a nucleus and a coda). The information of which grapheme follows which grapheme (especially across syllable/character boundaries) is very important, and can contribute to the estimation of the date. Therefore, I experimented by decomposing each character into graphemes.
Each sample consists of up to 300 characters, and when decomposed into graphemes, the length (number of graphemes) of a sample increases. The longest sample was 933 tokens long. The number of the kinds of characters was 13737, but when decomposed into graphemes, the number of the kinds of tokens decreases to 9805. Among these 9805 tokens, only the 5000 most frequent tokens are considered.
The CNN unigram model showed the MAE 20.73 years, whereas the RNN model showed the MAE 16.00 years.
The importance of bigram models increases when using grapheme decomposition. For example, the sequence of ㄷ and ㅣ can be considered as an evidence that palatalization has not yet occurred in this sample, whereas the sequence of ㅈ and ㅣ can be an (although not perfect) indicator of palatalization. I only considered the 20,000 most frequent unigrams and bigrams. The MAE of the CNN model was 21.55 years, the RNN model 13.59 years, the Bag-of-words model 10.51 years, and the Bag-of-words model using TF-IDF 9.87 years. These models showed small increases in performance, but the margins are not so large.
When an input sample is a sequence of tokens and the order of tokens matters, the traditional approach to capturing these sequential patterns has been RNN, including LSTM. But as RNN must process input tokens one by one and it is impossible to parallelize, it takes enormous time to train an RNN model. The transformer model, released in 2017, captures the dependency patterns among input tokens through attention mechanism and it is possible to parallelize, it takes much less time to train a transformer model. Therefore, I experimented with a transformer model, using the TransformerEncoder layer in the library Keras. The architecture of the model is almost identical to the RNN model, only replacing the LSTM layer with the TransformerEncoder layer.
This model was trained for 30 epochs. The 25th epoch showed the best performance (Figure 13). At this epoch, the losses were as in Table 8.
Overfitting doesn’t occur, as the difference between the performance in the training set and that in the test set is not so large. The error in the test set is about 10.84 years, which is the best record so far.
A heatmap is a plot showing the activation level of each node in a neural network. In the case of images, a node corresponds to a pixel or a tiny local set of pixels. For example, in an image of an African elephant, pixels corresponding to its trunk are more important than pixels corresponding to the sky, the grass or the body in judging the identity of the object in the image (Figure 14).
When constructing a heatmap, the nodes in the last convolution layer are considered, and the heat scores are aggregated per each token. The higher the frequency of the token is, the higher the aggregated heat score is, so we need to normalize the effect of frequency.
The following two plots (Figure 15) show the relationship between each token’s heat score and frequency. Before normalizing, the two shows strong correlation, but after normalized, the effect of frequency is near to zero. (The two peaks in the plot corresponds to unigrams and bigrams.
I ordered the tokens (unigrams and bigrams) according to the normalized heat scores (Tables 9 and 10). As can be seen in these tables, the model pays attention to tokens characteristic to the date of the sample.
I applied the LSTM model to documents of the 15th century (Table 11). The MSE is 163.10 and the MAE is 7.46 years, which shows that the model’s prediction is quite accurate. This result is unsurprising because these documents were seen repeatedly in the training process. Speaking metaphorically, a student can get a high score in a test if the questions were already known to her, which is a matter of course.
More important is the performance when the model is applied to data which were never seen by the model in the training process. For this purpose, I used documents from Jangseogak in the Academy of Korean Studies. These documents are mostly manuscripts, so the precise dates are not known. However, some researchers have been estimating the dates of these documents using linguistic features and historical events recorded in the documents.
Table 12 shows the date of these documents predicted by the LSTM model.
Experts in these documents says that these predictions are quite accurate. Although the predictions as to some very short documents went somewhat wide, most predictions were near the estimation by domain experts.
Table 13 shows the date of these documents predicted by the Transformer model.
The predictions of these two models are quite similar, as can be seen in Table 14. The MAE between these two predictions is 32.52 years. This shows that the predictions of these two models are quite stable and reliable.
A document can consist of many samples, and the model’s predictions as to samples coming from the same document can vary. The standard deviation of these predictions indicates the homogeneity or heterogeneity of the document. According to the standard deviations, Cheonjusilui is most homogeneous, whereas Gyehaebanjeongrok is most heterogeneous, and the two models agree in this respect too (Tables 12 and 13).
The plots in Figure 16 are the histograms showing the dates of the samples of Gyehaebanjeongrok and Cheonjusilui predicted by the LSTM model. The samples of Cheonjusilui are quite homogeneous, whereas the samples Gyehaebanjeongrok of are divided into two clusters.
This pattern is repeated exactly in Figure 17, showing the dates of the samples of Gyehaebanjeongrok and Cheonjusilui predicted by the Tramsformer model.
In order to estimate the date of historical documents, I trained several neural network models using document with known dates. As a result of the experiments, the Bag-of-words bigram model, the LSTM model and the Transformer model showed the best results. Whether we use characters (syllables) or graphemes (phonemes) as basic units doesn’t matter so much. It is promising that the LSTM and Transformer models brought forward very similar results. This indicates the stability and reliability of the models. It is also helpful that we can investigate the variability among samples coming from the same document.
Figure 1
Prefaces to Seokbosangjeol and Weolinseokbo
jhai-2026-0005f1.jpg
Figure 2
The first and last pages of Daemyeongyeongryeoljeon
jhai-2026-0005f2.jpg
Figure 3
Linguistic clues indicating the dates in Seokbosangjeol, and Gomunjinbo Eonhae
jhai-2026-0005f3.jpg
Figure 4
Architecture of the bag-of-words unigram model
jhai-2026-0005f4.jpg
Figure 5
Loss during training of the Bag-of-words
jhai-2026-0005f5.jpg
Figure 6
Loss during training of the regularized Bag-of-words unigram model
jhai-2026-0005f6.jpg
Figure 7
Loss during training of the power-increased Bag-of-words unigram model
jhai-2026-0005f7.jpg
Figure 8
Loss during training of the bag-of-words bigram model
jhai-2026-0005f8.jpg
Figure 9
Architecture of the CNN model
jhai-2026-0005f9.jpg
Figure 10
Loss during training of the CNN model
jhai-2026-0005f10.jpg
Figure 11
Architecture of the RNN model
jhai-2026-0005f11.jpg
Figure 12
Loss during training of the RNN model
jhai-2026-0005f12.jpg
Figure 13
Loss during training of the Transformer model
jhai-2026-0005f13.jpg
Figure 14
Heatmap showing the activation level of each pixel of an image
jhai-2026-0005f14.jpg
Figure 15
Plots showing heat score vs. frequency
jhai-2026-0005f15.jpg
Figure 16
Histograms by the LSTM model
jhai-2026-0005f16.jpg
Figure 17
Histograms by the Transformer model
jhai-2026-0005f17.jpg
Table 1.
Trade-off between the length of a sample and the number of samples
Table 1.
Length of a sample # of samples Training set (64%) Validation set (16%) Test set (20%)
100 145,700 93,248 23,312 29,140
200 72,850 46,624 11,656 14,570
300 48,504 31,042 7,761 9,701
350 41,605 22,627 6,657 8,321
400 36,425 23,312 5,828 7,285
Table 2.
Loss of the Bag-of-words unigram model
Table 2.
MSE (years squared) MAE (years)
Training set 55.65 4.60
Validation set 1332.23 19.06
Test set 1579.12 20.68
Table 3.
Loss of the regularized Bag-of-words unigram model
Table 3.
MSE (years squared) MAE (years)
Training set 2769.32 40.62
Validation set 585.44 11.79
Test set 827.23 15.85
Table 4.
Loss of the power-increased Bag-of-words unigram model
Table 4.
MSE (years squared) MAE (years)
Training set 1080.93 24.94
Validation set 489.35 10.01
Test set 712.51 11.40
Table 5.
Loss of the bag-of-words bigram model
Table 5.
MSE (years squared) MAE (years)
Training set 1036.47 24.09
Validation set 392.33 7.19
Test set 482.18 8.54
Table 6.
Loss of the CNN model
Table 6.
MSE (years squared) MAE (years)
Training set 568.11 16.03
Validation set 1752.27 26.32
Test set 2111.83 27.10
Table 7.
Loss of the RNN model
Table 7.
MSE (years squared) MAE (years)
Training set 103.44 6.80
Validation set 575.09 11.26
Test set 649.61 11.60
Table 8.
Loss of the Transformer model
Table 8.
MSE (years squared) MAE (years)
Training set 80.84 5.45
Validation set 594.07 10.56
Test set 698.46 10.84
Table 9.
Unigram graphemes with the highest normalized heat scores
Table 9.
jhai-2026-0005i1.jpg
Table 10.
Bigram graphemes with the highest normalized heat scores
Table 10.
jhai-2026-0005i2.jpg
Table 11.
Ground truth and predicted years of the documents of the 15th century
Table 11.
title year pred
0 석보상절03 1447 1444.26123
1 석보상절03 1447 1449.315063
2 석보상절03 1447 1464.890869
3 석보상절03 1447 1447.827393
4 석보상절03 1447 1434.583618
... ... ... ...
6464 진언권공 1496 1476.577759
6465 진언권공 1496 1530.682373
6466 진언권공 1496 1498.449951
6467 진언권공 1496 1485.773193
6468 진언권공 1496 1499.771729
6469 rows × 3 columns
Table 12.
Predictions and standard deviations of the LSTM model
Table 12.
Dates of the Jangseogak documents predicted by the LSTM model Standard deviations of predicted dates per each document
pred 열성지장통기 1728.978923 pred
title 열성후비지문 1743.873079 title
계해반정록 1755.16872 완월회맹연 1836.439343 천주실의 5.623350
고문백선 1813.535319 유씨삼대록 1817.196457 실록초본 9.907878
국조고사 1794.245642 유이양문록 1840.931107 선보잡락언해 13.722932
낙성비룡 1775.297715 윤하정삼문취록 1848.837982 임신평난록 21.163782
명행정의록 1846.645625 임신평난록 1866.658294 한조삼성기봉 30.573881
무오연행록 1841.143464 정미가례시일기 1835.123678 엄씨효문청행록 32.308008
벽허담관제언록 1848.944505 조야기문 1750.045986 윤하정삼문취록 33.930213
병자록 1743.86153 조야첨재 1828.512066 명행정의록 36.441314
사문대의록 1765.28703 조야회통 1828.990723 벽허담관제언록 37.182772
산성일기_병자 1783.439747 학석집(한글) 1878.469922 유이양문록 37.575597
선보잡락언해 1888.506897 한조삼성기봉 1859.522524 ......
선택요람 1853.460227 현몽쌍룡기 1831.715601 열성지장통기 80.957604
선부군언행유사 1766.100093 홍경내전 1781.141518 선부군언행유사 81.263283
신미록 1791.836272 화씨충효록 1836.499052 열성후비지문 81.842455
실록초본 1890.142447 화정선행록 1831.80543 조야기문 82.426335
엄씨효문청행록 1851.211567 천주실의 1896.516627 병자록 83.255140
신미록 91.391200
홍경내전 104.377288
계해반정록 110.574844
Table 13.
Predictions and standard deviations of the Transformer model
Table 13.
Dates of the Jangseogak documents predicted by the LSTM model Standard deviations of predicted dates per each document
pred 열성지장통기 1774.778172 pred
title 열성후비지문 1781.626921 title
계해반정록 1776.894681 완월회맹연 1850.06383 천주실의 5.582687
고문백선 1820.294533 유씨삼대록 1843.345866 실록초본 9.633203
국조고사 1830.799083 유이양문록 1864.132671 선보잡락언해 11.201670
낙성비룡 1802.047043 윤하정삼문취록 1855.230312 임신평난록 23.407985
명행정의록 1862.019467 임신평난록 1868.725029 한조삼성기봉 27.563781
무오연행록 1843.966993 정미가례시일기 1794.487139 윤하정삼문취록 33.265788
벽허담관제언록 1867.81161 조야기문 1764.18373 명행정의록 33.336713
병자록 1767.580819 조야첨재 1849.432562 엄씨효문청행록 34.391305
사문대의록 1767.979588 조야회통 1855.332041 벽허담관제언록 34.903389
산성일기_병자 1776.561644 학석집(한글) 1877.044336 유이양문록 38.701650
선보잡락언해 1900.322198 한조삼성기봉 1873.623453 ......
선택요람 1817.476385 현몽쌍룡기 1850.699927 사문대의록 82.510204
선부군언행유사 1810.446473 홍경내전 1834.164509 병자록 82.621584
신미록 1818.575 화씨충효록 1856.086874 홍경내전 84.433661
실록초본 1898.757509 화정선행록 1853.198077 신미록 84.848653
엄씨효문청행록 1852.775256 천주실의 1897.263221 조야기문 84.979514
열성후비지문 89.220922
정미가례시일기 89.991773
계해반정록 110.655007
Table 14.
LSTM and Transformer models’ predictions on the Jangseogak documents
Table 14.
title pred pred_lstm
0 계해반정록 1853.388672 1821.7692
1 계해반정록 1876.112305 1856.4450
2 계해반정록 1609.185303 1618.0725
3 계해반정록 1710.976562 1600.5903
4 계해반정록 1885.3573 1891.7070
... ... ... ...
49474 천주실의 1893.72644 1900.3711
49475 천주실의 1896.940674 1890.8717
49476 천주실의 1887.2854 1897.4349
49477 천주실의 1897.868408 1891.5488
49478 천주실의 1889.480225 1886.4396
49479 rows × 3 columns
  • Bishop, Christopher. 2006. Pattern Recognition and Machine Learning. Springer.
  • Bishop, Christopher and Hugh Bishop. 2023. Deep Learning: Foundations and Concepts. Springer.
  • Chollet, François. 2022. Deep learning with python, 2nd edition. Manning Publications.
  • Géron, Aurélien. 2022. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 3rd edition. O'Reilly Media.

Download Citation

Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

Format:

Include:

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING
J Humanit AI. 2026;1(1):64-80.   Published online March 31, 2026
Download Citation

Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

Format:
Include:
DATING HISTORICAL DOCUMENTS USING DEEP LEARNING
J Humanit AI. 2026;1(1):64-80.   Published online March 31, 2026
Close

Figure

  • 0
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
DATING HISTORICAL DOCUMENTS USING DEEP LEARNING
Image Image Image Image Image Image Image Image Image Image Image Image Image Image Image Image Image
Figure 1 Prefaces to Seokbosangjeol and Weolinseokbo
Figure 2 The first and last pages of Daemyeongyeongryeoljeon
Figure 3 Linguistic clues indicating the dates in Seokbosangjeol, and Gomunjinbo Eonhae
Figure 4 Architecture of the bag-of-words unigram model
Figure 5 Loss during training of the Bag-of-words
Figure 6 Loss during training of the regularized Bag-of-words unigram model
Figure 7 Loss during training of the power-increased Bag-of-words unigram model
Figure 8 Loss during training of the bag-of-words bigram model
Figure 9 Architecture of the CNN model
Figure 10 Loss during training of the CNN model
Figure 11 Architecture of the RNN model
Figure 12 Loss during training of the RNN model
Figure 13 Loss during training of the Transformer model
Figure 14 Heatmap showing the activation level of each pixel of an image
Figure 15 Plots showing heat score vs. frequency
Figure 16 Histograms by the LSTM model
Figure 17 Histograms by the Transformer model
DATING HISTORICAL DOCUMENTS USING DEEP LEARNING
Length of a sample # of samples Training set (64%) Validation set (16%) Test set (20%)
100 145,700 93,248 23,312 29,140
200 72,850 46,624 11,656 14,570
300 48,504 31,042 7,761 9,701
350 41,605 22,627 6,657 8,321
400 36,425 23,312 5,828 7,285
MSE (years squared) MAE (years)
Training set 55.65 4.60
Validation set 1332.23 19.06
Test set 1579.12 20.68
MSE (years squared) MAE (years)
Training set 2769.32 40.62
Validation set 585.44 11.79
Test set 827.23 15.85
MSE (years squared) MAE (years)
Training set 1080.93 24.94
Validation set 489.35 10.01
Test set 712.51 11.40
MSE (years squared) MAE (years)
Training set 1036.47 24.09
Validation set 392.33 7.19
Test set 482.18 8.54
MSE (years squared) MAE (years)
Training set 568.11 16.03
Validation set 1752.27 26.32
Test set 2111.83 27.10
MSE (years squared) MAE (years)
Training set 103.44 6.80
Validation set 575.09 11.26
Test set 649.61 11.60
MSE (years squared) MAE (years)
Training set 80.84 5.45
Validation set 594.07 10.56
Test set 698.46 10.84
title year pred
0 석보상절03 1447 1444.26123
1 석보상절03 1447 1449.315063
2 석보상절03 1447 1464.890869
3 석보상절03 1447 1447.827393
4 석보상절03 1447 1434.583618
... ... ... ...
6464 진언권공 1496 1476.577759
6465 진언권공 1496 1530.682373
6466 진언권공 1496 1498.449951
6467 진언권공 1496 1485.773193
6468 진언권공 1496 1499.771729
6469 rows × 3 columns
Dates of the Jangseogak documents predicted by the LSTM model Standard deviations of predicted dates per each document
pred 열성지장통기 1728.978923 pred
title 열성후비지문 1743.873079 title
계해반정록 1755.16872 완월회맹연 1836.439343 천주실의 5.623350
고문백선 1813.535319 유씨삼대록 1817.196457 실록초본 9.907878
국조고사 1794.245642 유이양문록 1840.931107 선보잡락언해 13.722932
낙성비룡 1775.297715 윤하정삼문취록 1848.837982 임신평난록 21.163782
명행정의록 1846.645625 임신평난록 1866.658294 한조삼성기봉 30.573881
무오연행록 1841.143464 정미가례시일기 1835.123678 엄씨효문청행록 32.308008
벽허담관제언록 1848.944505 조야기문 1750.045986 윤하정삼문취록 33.930213
병자록 1743.86153 조야첨재 1828.512066 명행정의록 36.441314
사문대의록 1765.28703 조야회통 1828.990723 벽허담관제언록 37.182772
산성일기_병자 1783.439747 학석집(한글) 1878.469922 유이양문록 37.575597
선보잡락언해 1888.506897 한조삼성기봉 1859.522524 ......
선택요람 1853.460227 현몽쌍룡기 1831.715601 열성지장통기 80.957604
선부군언행유사 1766.100093 홍경내전 1781.141518 선부군언행유사 81.263283
신미록 1791.836272 화씨충효록 1836.499052 열성후비지문 81.842455
실록초본 1890.142447 화정선행록 1831.80543 조야기문 82.426335
엄씨효문청행록 1851.211567 천주실의 1896.516627 병자록 83.255140
신미록 91.391200
홍경내전 104.377288
계해반정록 110.574844
Dates of the Jangseogak documents predicted by the LSTM model Standard deviations of predicted dates per each document
pred 열성지장통기 1774.778172 pred
title 열성후비지문 1781.626921 title
계해반정록 1776.894681 완월회맹연 1850.06383 천주실의 5.582687
고문백선 1820.294533 유씨삼대록 1843.345866 실록초본 9.633203
국조고사 1830.799083 유이양문록 1864.132671 선보잡락언해 11.201670
낙성비룡 1802.047043 윤하정삼문취록 1855.230312 임신평난록 23.407985
명행정의록 1862.019467 임신평난록 1868.725029 한조삼성기봉 27.563781
무오연행록 1843.966993 정미가례시일기 1794.487139 윤하정삼문취록 33.265788
벽허담관제언록 1867.81161 조야기문 1764.18373 명행정의록 33.336713
병자록 1767.580819 조야첨재 1849.432562 엄씨효문청행록 34.391305
사문대의록 1767.979588 조야회통 1855.332041 벽허담관제언록 34.903389
산성일기_병자 1776.561644 학석집(한글) 1877.044336 유이양문록 38.701650
선보잡락언해 1900.322198 한조삼성기봉 1873.623453 ......
선택요람 1817.476385 현몽쌍룡기 1850.699927 사문대의록 82.510204
선부군언행유사 1810.446473 홍경내전 1834.164509 병자록 82.621584
신미록 1818.575 화씨충효록 1856.086874 홍경내전 84.433661
실록초본 1898.757509 화정선행록 1853.198077 신미록 84.848653
엄씨효문청행록 1852.775256 천주실의 1897.263221 조야기문 84.979514
열성후비지문 89.220922
정미가례시일기 89.991773
계해반정록 110.655007
title pred pred_lstm
0 계해반정록 1853.388672 1821.7692
1 계해반정록 1876.112305 1856.4450
2 계해반정록 1609.185303 1618.0725
3 계해반정록 1710.976562 1600.5903
4 계해반정록 1885.3573 1891.7070
... ... ... ...
49474 천주실의 1893.72644 1900.3711
49475 천주실의 1896.940674 1890.8717
49476 천주실의 1887.2854 1897.4349
49477 천주실의 1897.868408 1891.5488
49478 천주실의 1889.480225 1886.4396
49479 rows × 3 columns
Table 1. Trade-off between the length of a sample and the number of samples
Table 2. Loss of the Bag-of-words unigram model
Table 3. Loss of the regularized Bag-of-words unigram model
Table 4. Loss of the power-increased Bag-of-words unigram model
Table 5. Loss of the bag-of-words bigram model
Table 6. Loss of the CNN model
Table 7. Loss of the RNN model
Table 8. Loss of the Transformer model
Table 9. Unigram graphemes with the highest normalized heat scores
Table 10. Bigram graphemes with the highest normalized heat scores
Table 11. Ground truth and predicted years of the documents of the 15th century
Table 12. Predictions and standard deviations of the LSTM model
Table 13. Predictions and standard deviations of the Transformer model
Table 14. LSTM and Transformer models’ predictions on the Jangseogak documents