DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Jinho Park

doi:10.66532/jhai.2026.0005

Articles

Page Path

Research Article

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Jinho Park^*

Journal of Humanities and AI 2026;1(1):64-80.
Published online: March 31, 2026

DOI: https://doi.org/10.66532/jhai.2026.0005

^*Department of Korean Language and Literature at Seoul National University

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

227 Views
31 Download

prev next

Full Article

Download PDF

Abstract
1. SETTING THE SCENE
2. PREPARING THE DATA
3. BAG-OF-WORDS UNIGRAM MODEL
4. ADDING REGULARIZATION TO A BAG-OF-WORDS / UNIGRAM MODEL
5. INCREASING THE MODEL’S POWER
6. BAG-OF-WORDS / BIGRAM MODEL
7. CNN MODEL
8. RNN MODEL
9. DECOMPOSITION INTO GRAPHEMES
10. TRANSFORMER MODEL
11. HEATMAP
12. INFERENCE AND EVALUATION
13. CONCLUSION
REFERENCES

Abstract

The date of a historical document, if it was published, can be recognized through the date of its preface or copyright page. However, dating an unpublished manuscript is much more difficult. In the case of old Hangeul documents, various linguistic features can be used to estimate the approximate date of the document, but as the number of features increases, the task tends to go beyond the purview of an individual human researcher, and become more appropriate to AI. This paper shows how artificial neural networks can be trained to estimate the date of documents using material whose date is known. For this purpose, various kinds of neural networks are examined: Bag-of-words model, CNN, RNN and Transformer. In addition, these models can be further sub-divided: unigram or bigram, character(syllable)-based or grapheme(phoneme)-based. After trained on documents with known dates, these models are applied to new (unseen) data, and the results are evaluated.
Keywords: deep learning, neural network, dating, historical document, inference

1. SETTING THE SCENE

Among materials on the history of the Korean language, printed editions with colophons have been actively utilized because their publication dates are clear. For example, the prefaces of Seokbosangjeol and Weolinseokbo contain Chinese era names, so these indicate the years of compilation or publishing (Figure 1).

On the other hand, manuscripts have generally been used less frequently in research on the history of the Korean language due to their unclear dates. For example, Daemyeongyeongryeoljeon, a manuscript novel, has no indication of the year of writing (Figure 2).

Nevertheless, given the substantial volume of manuscript materials, it would be beneficial to estimate their dates in some way so that they can be utilized in research on the history of the Korean language.

Linguists can use linguistic clues (e.g. consonants ㅸ, ㅿ, ㅭ or phonological phenomena such as palatalization 아디>아지) to estimate the approximate dates of these documents (Figure 3).

However, there are enormous number of linguistic features which can be used as clues for estimating the date, which cannot be considered by human researchers in their entirety, so they generally use very small part of these evidences. In such cases, machines (especially deep learning models) are excellent in taking all these features into account, so using deep learning for this task seems to be promising.

Many linguists and literary scholars are not familiar with machine learning concepts and techniques. Some of them think that they need not know much about machine learning, expecting that AI will solve their problems. However, when experimenting with machine learning to solve a task, one should make many decisions (about e.g. models, hyperparameters, etc.). These decisions require knowledge of machine learning. We need educated guidance in deciding which option to try.

2. PREPARING THE DATA

In order to train a deep learning model, we need many samples, usually several dozens of thousand. The amount of the extant Korean historical documents is fixed. If we set the length of each sample short, we can get many samples. But then we can risk having few clues of the date in a sample. If we set the length of each sample long, we have fewer samples. As a result of some experimenting, I set 300 characters as the length of each sample, which results in about 50 thousand samples (Table 1).

3. BAG-OF-WORDS UNIGRAM MODEL

As a first rough attempt, I tried a bag-of-words unigram model. ‘Bag-of-words’ means that each sample is considered as a set of tokens, disregarding the order. ‘Unigram’ means that each single character is a basic unit. (Bigram models mean that each pair of characters, as well as each single character, is a basic unit, which will be touched upon later.)

In order to enter each sample into a neural network, it should be represented as a number or a series of numbers (i.e. a vector). This process is called vectorization, embedding or encoding. Several methods of encoding have been invented. I used one-hot encoding. In this encoding scheme, only the most frequent ten thousand characters were considered. Each sample is represented as a vector of ten thousand digits. ‘1’ means that the sample contains the letter, whereas ‘0’ means otherwise. This process is covered at the Text Vectorization Layer.

As for the architecture of the neural network (Figure 4), I used a dense (also called fully connected) network consisting of three dense layers. In Dense Layer 1, the dimensionality of each vector representing a sample is reduced from 10,000 to 32. In Dense Layers 2 and 3, the dimensionality is further reduced to 16 and finally to 1. This final floating number is supposed to correspond to the estimated year of the input sample.

This model was trained for 100 epochs. The 95th epoch showed the best performance (Figure 5). At this epoch, the losses, i.e. MSE (mean squared error) and MAE (mean absolute error), were as in Table 2.

The error of about 20 years seems to be promising. Even human experts cannot produce this level of accuracy. They usually estimate at the level of half century. The first very simple model achieved a very good (super-human) result!

4. ADDING REGULARIZATION TO A BAG-OF-WORDS / UNIGRAM MODEL

The above model shows a quite large difference between the training set and the validation set, which means that overfitting is occurring. When overfitting occurs, the model’s performance for new data tends to degrade. Many methods have been created for restricting the model’s power to prevent or alleviate overfitting. The first is layer regularizers, by which the sum of absolute values (L1) or squares (L2) of weights is added to the loss in order to prevent the weights from becoming too large. The second is batch normalization, by which each batch of samples is normalized. The third is dropout, by which a fixed proportion (in our case 0.3) of nodes in each layer is set to zero. In compensation for regularization, I doubled the number of nodes of Dense Layer 1 (32 to 64) and 2 (16 to 32).

This model was also trained for 100 epochs. The 65th epoch showed the best performance (Figure 6). At this epoch, the losses were as in Table 3.

The performance in the test set is most important, because it shows the estimated error rate for new data. The error of 15.85 years seems very good, much better than the previous result of about 20 years. But it seems odd that the performance in the test set is better than that in the training set.

5. INCREASING THE MODEL’S POWER

The bizarre phenomenon in which the performance in the test set is better than that in the training set, may be due to the lack of power of the model, compared to the added regularization. So I doubled again the number of nodes of Dense Layer 1 (64 to 128) and 2 (32 to 64), and decreased the dropout rate from 0.3 to 0.2.

This model was also trained for 100 epochs. The 67th epoch showed the best performance (Figure 7). At this epoch, the losses were as in Table 4.

The bizarre phenomenon in which the performance in the test set is better than that in the training set, still appears. The good news is that the error has decreased to 11.4 years.

6. BAG-OF-WORDS / BIGRAM MODEL

In the above experiment, each single character was considered as a basic unit. Of course, whether a particular single character appears in a sample can be a good indicator of the date of the sample, but in addition, whether a particular sequence of two consecutive characters appears in a sample can also be a good clue. For example, the sequence of 아 and 비 can be considered as an evidence that umlaut has not yet occurred in this sample, whereas the sequence of 애 and 비 can be an (although not perfect) indicator of umlaut.

Therefore, I experimented with bigrams. The library Keras provides the option ‘ngrams=2’ when generating the object of the class TextVectorization. As we should consider bigrams as well as unigrams, I increased the number of token types from 10,000 to 20,000. The architecture of the model is the same as the previous one. In Dense Layer 1, the dimensionality of each sample is reduced from 20,000 to 128, which is further reduced to 64 and to 1 in Dense Layer 2 and 3 respectively. I added the three kinds of regularization to Dense Layer 1 and 2.

This model was also trained for 100 epochs. The 83th epoch showed the best performance (Figure 8). At this epoch, the losses were as in Table 5.

The bizarre phenomenon in which the performance in the test set is better than that in the training set, still appears. The good news is that the error has decreased to 8.54 years.

7. CNN MODEL

In the area of computer vision, CNN (convolutional neural network) models have been popular. In order to extract local patterns from 2-dimensional image, CNN uses a small 2-D window. As a text is a 1-dimensional sequence, 1-D window is used in CNN models for textual data.

In order to enter a sample (1-D sequence of tokens), we should encode each token (character) as a vector. This task is covered in Embedding Layer (Figure 9). In our case, each character was encoded as a 32-dimensional vector. In Convolution Layer 1, a window of size 3 moves from the beginning to the end of a sample, and extracts information. Due to this process, the dimensionality of each sample is reduced from 300 to 298. In the following two Convolution Layers, the same process is repeated, and the dimensionality is further reduced to 296 and to 294. In Convolution Layer 3, each node corresponds to seven tokens in the input sample. In Global Pooling Layer, all the information is synthesized. In Dense Layer, each sample is reduced to one floating point number, which is the estimated year of the sample.

This model was trained for 50 epochs. The 24th epoch showed the best performance (Figure 10). At this epoch, the losses were as in Table

Overfitting occurs, as the difference between the performance in the training set and that in the test set is quite large. The error in the test set is about 27 years, which is not so good.

8. RNN MODEL

CNN is appropriate for extracting position-invariant information. ‘Position-invariant’ means that it doesn’t matter where in the input sample a feature is located. This accords with image data, but not with textual data, in which the positional information matters. For textual data, RNN (recurrent neural network) is better suited than CNN. Among several variants of RNN, LSTM performs very well, so I chose that (Figure 11).

Just like CNN, each token is encoded as a vector of 64 numbers in Embedding Layer. Next, in the bidirectional LSTM Layer, each sample is summarized as a single vector, going forward and backward inside the sample. Finally, in Dense Layers, each sample outputs a single number, the estimated year.

This model was trained for 30 epochs. The 25th epoch showed the best performance (Figure 12). At this epoch, the losses were as in Table 7.

Overfitting doesn’t occur, as the difference between the performance in the training set and that in the test set is small. The error in the test set is about 11 years, which is the best record so far.

9. DECOMPOSITION INTO GRAPHEMES

In Hangeul, a character consists of three graphemes (an onset, a nucleus and a coda). The information of which grapheme follows which grapheme (especially across syllable/character boundaries) is very important, and can contribute to the estimation of the date. Therefore, I experimented by decomposing each character into graphemes.

Each sample consists of up to 300 characters, and when decomposed into graphemes, the length (number of graphemes) of a sample increases. The longest sample was 933 tokens long. The number of the kinds of characters was 13737, but when decomposed into graphemes, the number of the kinds of tokens decreases to 9805. Among these 9805 tokens, only the 5000 most frequent tokens are considered.

The CNN unigram model showed the MAE 20.73 years, whereas the RNN model showed the MAE 16.00 years.

The importance of bigram models increases when using grapheme decomposition. For example, the sequence of ㄷ and ㅣ can be considered as an evidence that palatalization has not yet occurred in this sample, whereas the sequence of ㅈ and ㅣ can be an (although not perfect) indicator of palatalization. I only considered the 20,000 most frequent unigrams and bigrams. The MAE of the CNN model was 21.55 years, the RNN model 13.59 years, the Bag-of-words model 10.51 years, and the Bag-of-words model using TF-IDF 9.87 years. These models showed small increases in performance, but the margins are not so large.

10. TRANSFORMER MODEL

When an input sample is a sequence of tokens and the order of tokens matters, the traditional approach to capturing these sequential patterns has been RNN, including LSTM. But as RNN must process input tokens one by one and it is impossible to parallelize, it takes enormous time to train an RNN model. The transformer model, released in 2017, captures the dependency patterns among input tokens through attention mechanism and it is possible to parallelize, it takes much less time to train a transformer model. Therefore, I experimented with a transformer model, using the TransformerEncoder layer in the library Keras. The architecture of the model is almost identical to the RNN model, only replacing the LSTM layer with the TransformerEncoder layer.

This model was trained for 30 epochs. The 25th epoch showed the best performance (Figure 13). At this epoch, the losses were as in Table 8.

Overfitting doesn’t occur, as the difference between the performance in the training set and that in the test set is not so large. The error in the test set is about 10.84 years, which is the best record so far.

11. HEATMAP

A heatmap is a plot showing the activation level of each node in a neural network. In the case of images, a node corresponds to a pixel or a tiny local set of pixels. For example, in an image of an African elephant, pixels corresponding to its trunk are more important than pixels corresponding to the sky, the grass or the body in judging the identity of the object in the image (Figure 14).

When constructing a heatmap, the nodes in the last convolution layer are considered, and the heat scores are aggregated per each token. The higher the frequency of the token is, the higher the aggregated heat score is, so we need to normalize the effect of frequency.

The following two plots (Figure 15) show the relationship between each token’s heat score and frequency. Before normalizing, the two shows strong correlation, but after normalized, the effect of frequency is near to zero. (The two peaks in the plot corresponds to unigrams and bigrams.

I ordered the tokens (unigrams and bigrams) according to the normalized heat scores (Tables 9 and 10). As can be seen in these tables, the model pays attention to tokens characteristic to the date of the sample.

12. INFERENCE AND EVALUATION

I applied the LSTM model to documents of the 15th century (Table 11). The MSE is 163.10 and the MAE is 7.46 years, which shows that the model’s prediction is quite accurate. This result is unsurprising because these documents were seen repeatedly in the training process. Speaking metaphorically, a student can get a high score in a test if the questions were already known to her, which is a matter of course.

More important is the performance when the model is applied to data which were never seen by the model in the training process. For this purpose, I used documents from Jangseogak in the Academy of Korean Studies. These documents are mostly manuscripts, so the precise dates are not known. However, some researchers have been estimating the dates of these documents using linguistic features and historical events recorded in the documents.

Table 12 shows the date of these documents predicted by the LSTM model.

Experts in these documents says that these predictions are quite accurate. Although the predictions as to some very short documents went somewhat wide, most predictions were near the estimation by domain experts.

Table 13 shows the date of these documents predicted by the Transformer model.

The predictions of these two models are quite similar, as can be seen in Table 14. The MAE between these two predictions is 32.52 years. This shows that the predictions of these two models are quite stable and reliable.

A document can consist of many samples, and the model’s predictions as to samples coming from the same document can vary. The standard deviation of these predictions indicates the homogeneity or heterogeneity of the document. According to the standard deviations, Cheonjusilui is most homogeneous, whereas Gyehaebanjeongrok is most heterogeneous, and the two models agree in this respect too (Tables 12 and 13).

The plots in Figure 16 are the histograms showing the dates of the samples of Gyehaebanjeongrok and Cheonjusilui predicted by the LSTM model. The samples of Cheonjusilui are quite homogeneous, whereas the samples Gyehaebanjeongrok of are divided into two clusters.

This pattern is repeated exactly in Figure 17, showing the dates of the samples of Gyehaebanjeongrok and Cheonjusilui predicted by the Tramsformer model.

13. CONCLUSION

In order to estimate the date of historical documents, I trained several neural network models using document with known dates. As a result of the experiments, the Bag-of-words bigram model, the LSTM model and the Transformer model showed the best results. Whether we use characters (syllables) or graphemes (phonemes) as basic units doesn’t matter so much. It is promising that the LSTM and Transformer models brought forward very similar results. This indicates the stability and reliability of the models. It is also helpful that we can investigate the variability among samples coming from the same document.

Figure 1

Prefaces to Seokbosangjeol and Weolinseokbo

Figure 2

The first and last pages of Daemyeongyeongryeoljeon

Figure 3

Linguistic clues indicating the dates in Seokbosangjeol, and Gomunjinbo Eonhae

Figure 4

Architecture of the bag-of-words unigram model

Figure 5

Loss during training of the Bag-of-words

Figure 6

Loss during training of the regularized Bag-of-words unigram model

Figure 7

Loss during training of the power-increased Bag-of-words unigram model

Figure 8

Loss during training of the bag-of-words bigram model

Figure 9

Architecture of the CNN model

Figure 10

Loss during training of the CNN model

Figure 11

Architecture of the RNN model

Figure 12

Loss during training of the RNN model

Figure 13

Loss during training of the Transformer model

Figure 14

Heatmap showing the activation level of each pixel of an image

Figure 15

Plots showing heat score vs. frequency

Figure 16

Histograms by the LSTM model

Figure 17

Histograms by the Transformer model

Table 1.

Trade-off between the length of a sample and the number of samples

Table 1.
Length of a sample	# of samples	Training set (64%)	Validation set (16%)	Test set (20%)
100	145,700	93,248	23,312	29,140
200	72,850	46,624	11,656	14,570
300	48,504	31,042	7,761	9,701
350	41,605	22,627	6,657	8,321
400	36,425	23,312	5,828	7,285

Table 2.

Loss of the Bag-of-words unigram model

Table 2.
	MSE (years squared)	MAE (years)
Training set	55.65	4.60
Validation set	1332.23	19.06
Test set	1579.12	20.68

Table 3.

Loss of the regularized Bag-of-words unigram model

Table 3.
	MSE (years squared)	MAE (years)
Training set	2769.32	40.62
Validation set	585.44	11.79
Test set	827.23	15.85

Table 4.

Loss of the power-increased Bag-of-words unigram model

Table 4.
	MSE (years squared)	MAE (years)
Training set	1080.93	24.94
Validation set	489.35	10.01
Test set	712.51	11.40

Table 5.

Loss of the bag-of-words bigram model

Table 5.
	MSE (years squared)	MAE (years)
Training set	1036.47	24.09
Validation set	392.33	7.19
Test set	482.18	8.54

Table 6.

Loss of the CNN model

Table 6.
	MSE (years squared)	MAE (years)
Training set	568.11	16.03
Validation set	1752.27	26.32
Test set	2111.83	27.10

Table 7.

Loss of the RNN model

Table 7.
	MSE (years squared)	MAE (years)
Training set	103.44	6.80
Validation set	575.09	11.26
Test set	649.61	11.60

Table 8.

Loss of the Transformer model

Table 8.
	MSE (years squared)	MAE (years)
Training set	80.84	5.45
Validation set	594.07	10.56
Test set	698.46	10.84

Table 9.

Unigram graphemes with the highest normalized heat scores

Table 9.

Table 10.

Bigram graphemes with the highest normalized heat scores

Table 10.

Table 11.

Ground truth and predicted years of the documents of the 15th century

Table 11.
	title	year	pred
0	석보상절03	1447	1444.26123
1	석보상절03	1447	1449.315063
2	석보상절03	1447	1464.890869
3	석보상절03	1447	1447.827393
4	석보상절03	1447	1434.583618
...	...	...	...
6464	진언권공	1496	1476.577759
6465	진언권공	1496	1530.682373
6466	진언권공	1496	1498.449951
6467	진언권공	1496	1485.773193
6468	진언권공	1496	1499.771729
6469 rows × 3 columns

Table 12.

Predictions and standard deviations of the LSTM model

Table 12.
Dates of the Jangseogak documents predicted by the LSTM model				Standard deviations of predicted dates per each document
	pred	열성지장통기	1728.978923		pred
title		열성후비지문	1743.873079	title
계해반정록	1755.16872	완월회맹연	1836.439343	천주실의	5.623350
고문백선	1813.535319	유씨삼대록	1817.196457	실록초본	9.907878
국조고사	1794.245642	유이양문록	1840.931107	선보잡락언해	13.722932
낙성비룡	1775.297715	윤하정삼문취록	1848.837982	임신평난록	21.163782
명행정의록	1846.645625	임신평난록	1866.658294	한조삼성기봉	30.573881
무오연행록	1841.143464	정미가례시일기	1835.123678	엄씨효문청행록	32.308008
벽허담관제언록	1848.944505	조야기문	1750.045986	윤하정삼문취록	33.930213
병자록	1743.86153	조야첨재	1828.512066	명행정의록	36.441314
사문대의록	1765.28703	조야회통	1828.990723	벽허담관제언록	37.182772
산성일기_병자	1783.439747	학석집(한글)	1878.469922	유이양문록	37.575597
선보잡락언해	1888.506897	한조삼성기봉	1859.522524	......
선택요람	1853.460227	현몽쌍룡기	1831.715601	열성지장통기	80.957604
선부군언행유사	1766.100093	홍경내전	1781.141518	선부군언행유사	81.263283
신미록	1791.836272	화씨충효록	1836.499052	열성후비지문	81.842455
실록초본	1890.142447	화정선행록	1831.80543	조야기문	82.426335
엄씨효문청행록	1851.211567	천주실의	1896.516627	병자록	83.255140
				신미록	91.391200
				홍경내전	104.377288
				계해반정록	110.574844

Table 13.

Predictions and standard deviations of the Transformer model

Table 13.
Dates of the Jangseogak documents predicted by the LSTM model				Standard deviations of predicted dates per each document
	pred	열성지장통기	1774.778172		pred
title		열성후비지문	1781.626921	title
계해반정록	1776.894681	완월회맹연	1850.06383	천주실의	5.582687
고문백선	1820.294533	유씨삼대록	1843.345866	실록초본	9.633203
국조고사	1830.799083	유이양문록	1864.132671	선보잡락언해	11.201670
낙성비룡	1802.047043	윤하정삼문취록	1855.230312	임신평난록	23.407985
명행정의록	1862.019467	임신평난록	1868.725029	한조삼성기봉	27.563781
무오연행록	1843.966993	정미가례시일기	1794.487139	윤하정삼문취록	33.265788
벽허담관제언록	1867.81161	조야기문	1764.18373	명행정의록	33.336713
병자록	1767.580819	조야첨재	1849.432562	엄씨효문청행록	34.391305
사문대의록	1767.979588	조야회통	1855.332041	벽허담관제언록	34.903389
산성일기_병자	1776.561644	학석집(한글)	1877.044336	유이양문록	38.701650
선보잡락언해	1900.322198	한조삼성기봉	1873.623453	......
선택요람	1817.476385	현몽쌍룡기	1850.699927	사문대의록	82.510204
선부군언행유사	1810.446473	홍경내전	1834.164509	병자록	82.621584
신미록	1818.575	화씨충효록	1856.086874	홍경내전	84.433661
실록초본	1898.757509	화정선행록	1853.198077	신미록	84.848653
엄씨효문청행록	1852.775256	천주실의	1897.263221	조야기문	84.979514
				열성후비지문	89.220922
				정미가례시일기	89.991773
				계해반정록	110.655007

Table 14.

LSTM and Transformer models’ predictions on the Jangseogak documents

Table 14.
	title	pred	pred_lstm
0	계해반정록	1853.388672	1821.7692
1	계해반정록	1876.112305	1856.4450
2	계해반정록	1609.185303	1618.0725
3	계해반정록	1710.976562	1600.5903
4	계해반정록	1885.3573	1891.7070
...	...	...	...
49474	천주실의	1893.72644	1900.3711
49475	천주실의	1896.940674	1890.8717
49476	천주실의	1887.2854	1897.4349
49477	천주실의	1897.868408	1891.5488
49478	천주실의	1889.480225	1886.4396
49479 rows × 3 columns

REFERENCES

Bishop, Christopher. 2006. Pattern Recognition and Machine Learning. Springer.
Bishop, Christopher and Hugh Bishop. 2023. Deep Learning: Foundations and Concepts. Springer.
Chollet, François. 2022. Deep learning with python, 2nd edition. Manning Publications.
Géron, Aurélien. 2022. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 3rd edition. O'Reilly Media.

Figure & Data

References

Citations

Citations to this article as recorded by

Cite

CITE

export

Copy Download

Format
XML Download

Download Citation

Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

Format:

RIS — For EndNote, ProCite, RefWorks, and most other reference management software
BibTeX — For JabRef, BibDesk, and other BibTeX-specific software

Include:

Citation for the content below
Citation and abstract for the content below

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

J Humanit AI. 2026;1(1):64-80. Published online March 31, 2026

DOI: https://doi.org/10.66532/jhai.2026.0005

Download Citation

Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

Format:

RIS — For EndNote, ProCite, RefWorks, and most other reference management software
bib — For JabRef, BibDesk, and other BibTeX-specific software

Include:

Citation for the content below
Citation and abstract for the content below

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

J Humanit AI. 2026;1(1):64-80. Published online March 31, 2026

DOI: https://doi.org/10.66532/jhai.2026.0005

Figure

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Figure 1 Prefaces to Seokbosangjeol and Weolinseokbo

Figure 2 The first and last pages of Daemyeongyeongryeoljeon

Figure 3 Linguistic clues indicating the dates in Seokbosangjeol, and Gomunjinbo Eonhae

Figure 4 Architecture of the bag-of-words unigram model

Figure 5 Loss during training of the Bag-of-words

Figure 6 Loss during training of the regularized Bag-of-words unigram model

Figure 7 Loss during training of the power-increased Bag-of-words unigram model

Figure 8 Loss during training of the bag-of-words bigram model

Figure 9 Architecture of the CNN model

Figure 10 Loss during training of the CNN model

Figure 11 Architecture of the RNN model

Figure 12 Loss during training of the RNN model

Figure 13 Loss during training of the Transformer model

Figure 14 Heatmap showing the activation level of each pixel of an image

Figure 15 Plots showing heat score vs. frequency

Figure 16 Histograms by the LSTM model

Figure 17 Histograms by the Transformer model

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Figure 16

Figure 17

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Length of a sample	# of samples	Training set (64%)	Validation set (16%)	Test set (20%)
100	145,700	93,248	23,312	29,140
200	72,850	46,624	11,656	14,570
300	48,504	31,042	7,761	9,701
350	41,605	22,627	6,657	8,321
400	36,425	23,312	5,828	7,285

	MSE (years squared)	MAE (years)
Training set	55.65	4.60
Validation set	1332.23	19.06
Test set	1579.12	20.68

	MSE (years squared)	MAE (years)
Training set	2769.32	40.62
Validation set	585.44	11.79
Test set	827.23	15.85

	MSE (years squared)	MAE (years)
Training set	1080.93	24.94
Validation set	489.35	10.01
Test set	712.51	11.40

	MSE (years squared)	MAE (years)
Training set	1036.47	24.09
Validation set	392.33	7.19
Test set	482.18	8.54

	MSE (years squared)	MAE (years)
Training set	568.11	16.03
Validation set	1752.27	26.32
Test set	2111.83	27.10

	MSE (years squared)	MAE (years)
Training set	103.44	6.80
Validation set	575.09	11.26
Test set	649.61	11.60

	MSE (years squared)	MAE (years)
Training set	80.84	5.45
Validation set	594.07	10.56
Test set	698.46	10.84

	title	year	pred
0	석보상절03	1447	1444.26123
1	석보상절03	1447	1449.315063
2	석보상절03	1447	1464.890869
3	석보상절03	1447	1447.827393
4	석보상절03	1447	1434.583618
...	...	...	...
6464	진언권공	1496	1476.577759
6465	진언권공	1496	1530.682373
6466	진언권공	1496	1498.449951
6467	진언권공	1496	1485.773193
6468	진언권공	1496	1499.771729
6469 rows × 3 columns

Dates of the Jangseogak documents predicted by the LSTM model				Standard deviations of predicted dates per each document
	pred	열성지장통기	1728.978923		pred
title		열성후비지문	1743.873079	title
계해반정록	1755.16872	완월회맹연	1836.439343	천주실의	5.623350
고문백선	1813.535319	유씨삼대록	1817.196457	실록초본	9.907878
국조고사	1794.245642	유이양문록	1840.931107	선보잡락언해	13.722932
낙성비룡	1775.297715	윤하정삼문취록	1848.837982	임신평난록	21.163782
명행정의록	1846.645625	임신평난록	1866.658294	한조삼성기봉	30.573881
무오연행록	1841.143464	정미가례시일기	1835.123678	엄씨효문청행록	32.308008
벽허담관제언록	1848.944505	조야기문	1750.045986	윤하정삼문취록	33.930213
병자록	1743.86153	조야첨재	1828.512066	명행정의록	36.441314
사문대의록	1765.28703	조야회통	1828.990723	벽허담관제언록	37.182772
산성일기_병자	1783.439747	학석집(한글)	1878.469922	유이양문록	37.575597
선보잡락언해	1888.506897	한조삼성기봉	1859.522524	......
선택요람	1853.460227	현몽쌍룡기	1831.715601	열성지장통기	80.957604
선부군언행유사	1766.100093	홍경내전	1781.141518	선부군언행유사	81.263283
신미록	1791.836272	화씨충효록	1836.499052	열성후비지문	81.842455
실록초본	1890.142447	화정선행록	1831.80543	조야기문	82.426335
엄씨효문청행록	1851.211567	천주실의	1896.516627	병자록	83.255140
				신미록	91.391200
				홍경내전	104.377288
				계해반정록	110.574844

Dates of the Jangseogak documents predicted by the LSTM model				Standard deviations of predicted dates per each document
	pred	열성지장통기	1774.778172		pred
title		열성후비지문	1781.626921	title
계해반정록	1776.894681	완월회맹연	1850.06383	천주실의	5.582687
고문백선	1820.294533	유씨삼대록	1843.345866	실록초본	9.633203
국조고사	1830.799083	유이양문록	1864.132671	선보잡락언해	11.201670
낙성비룡	1802.047043	윤하정삼문취록	1855.230312	임신평난록	23.407985
명행정의록	1862.019467	임신평난록	1868.725029	한조삼성기봉	27.563781
무오연행록	1843.966993	정미가례시일기	1794.487139	윤하정삼문취록	33.265788
벽허담관제언록	1867.81161	조야기문	1764.18373	명행정의록	33.336713
병자록	1767.580819	조야첨재	1849.432562	엄씨효문청행록	34.391305
사문대의록	1767.979588	조야회통	1855.332041	벽허담관제언록	34.903389
산성일기_병자	1776.561644	학석집(한글)	1877.044336	유이양문록	38.701650
선보잡락언해	1900.322198	한조삼성기봉	1873.623453	......
선택요람	1817.476385	현몽쌍룡기	1850.699927	사문대의록	82.510204
선부군언행유사	1810.446473	홍경내전	1834.164509	병자록	82.621584
신미록	1818.575	화씨충효록	1856.086874	홍경내전	84.433661
실록초본	1898.757509	화정선행록	1853.198077	신미록	84.848653
엄씨효문청행록	1852.775256	천주실의	1897.263221	조야기문	84.979514
				열성후비지문	89.220922
				정미가례시일기	89.991773
				계해반정록	110.655007

	title	pred	pred_lstm
0	계해반정록	1853.388672	1821.7692
1	계해반정록	1876.112305	1856.4450
2	계해반정록	1609.185303	1618.0725
3	계해반정록	1710.976562	1600.5903
4	계해반정록	1885.3573	1891.7070
...	...	...	...
49474	천주실의	1893.72644	1900.3711
49475	천주실의	1896.940674	1890.8717
49476	천주실의	1887.2854	1897.4349
49477	천주실의	1897.868408	1891.5488
49478	천주실의	1889.480225	1886.4396
49479 rows × 3 columns

Table 1. Trade-off between the length of a sample and the number of samples

Table 2. Loss of the Bag-of-words unigram model

Table 3. Loss of the regularized Bag-of-words unigram model

Table 4. Loss of the power-increased Bag-of-words unigram model

Table 5. Loss of the bag-of-words bigram model

Table 6. Loss of the CNN model

Table 7. Loss of the RNN model

Table 8. Loss of the Transformer model

Table 9. Unigram graphemes with the highest normalized heat scores

Table 10. Bigram graphemes with the highest normalized heat scores

Table 11. Ground truth and predicted years of the documents of the 15th century

Table 12. Predictions and standard deviations of the LSTM model

Table 13. Predictions and standard deviations of the Transformer model

Table 14. LSTM and Transformer models’ predictions on the Jangseogak documents

Articles

Page Path

DATING HISTORICAL DOCUMENTS USING DEEP LEARNING

Full Article

Abstract

1. SETTING THE SCENE

2. PREPARING THE DATA

3. BAG-OF-WORDS UNIGRAM MODEL

4. ADDING REGULARIZATION TO A BAG-OF-WORDS / UNIGRAM MODEL

5. INCREASING THE MODEL’S POWER

6. BAG-OF-WORDS / BIGRAM MODEL

7. CNN MODEL

8. RNN MODEL

9. DECOMPOSITION INTO GRAPHEMES

10. TRANSFORMER MODEL

11. HEATMAP

12. INFERENCE AND EVALUATION

13. CONCLUSION

REFERENCES

Figure & Data

References

Citations

CITE

Download Citation

Format:

Include:

Figure

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Figure 16

Figure 17

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

ABOUT

CURRENT ISSUE

ALL ISSUES

PUBLISHING POLICIES

FOR CONTRIBUTORS

E-Submission