Article Text

Download PDFPDF

Original research
Feasibility of extracting cancer stage and metastasis codes from health insurance claims of outpatients and expressibility in ICD-11: a cross-sectional study using national health insurance data from South Korea
  1. Young-Taek Park1,
  2. Dongwoon Han2,
  3. Kyoung-Hoon Kim3,
  4. Hoguen Kim4,
  5. Hojung Joseph Yoon5,
  6. Chris Lane6,
  7. Byeo-Ri Kim7,
  8. Joo-Yeon Jeong8
  1. 1HIRA Research Institute, Health Insurance Review & Assessment Service (HIRA), Wonju-si, Republic of Korea
  2. 2Department of Preventive Medicine, Hanyang University, Seoul, Republic of Korea
  3. 3Department of Health Administration, Kongju National University, Gongju-si, Republic of Korea
  4. 4Healthcare Review Committee, Health Insurance Review & Assessment Service (HIRA), Seoul, Republic of Korea
  5. 5CEO's office, Moden Medical Group, Minneapolis (MN), Minnesota, USA
  6. 6Health Workforce Analytics and Intelligence, Ministry of Health, Wellington, New Zealand
  7. 7Division of ICD-11 Domestic Implementation, Health Insurance Review & Assessment Service (HIRA), Wonju-si, Republic of Korea
  8. 8Division of Medical Loss Compensation, Health Insurance Review & Assessment Service (HIRA), Wonju-si, Republic of Korea
  1. Correspondence to Dr Dongwoon Han; dwhan{at}hanyang.ac.kr

Abstract

Objectives This study aimed to evaluate the incidence of health insurance claims recording the cancer stage and TNM codes representing tumor extension size (T), lymph node metastasis (N), and distant metastasis (M) for patients diagnosed with cancer and to determine whether this extracted data could be applied to the new ICD-11 codes.

Design A cross-sectional study design was used, with the units of analysis as individual outpatients. Two dependent variables were extraction feasibility of cancer stage and TNM metastasis information from each claim. Expressibility of the two variables in ICD-11 was descriptively analysed.

Setting and participants The study was conducted in South Korea and study participants were outpatients: lung cancer (LC) (46616), stomach cancer (SC) (50103) and colorectal cancer (CC) (54707). The data set consisted of the first health insurance claim of each patient visiting a hospital from 1 July to 31 December 2021.

Results The absolute extraction success rates for cancer stage based on claims with cancer stage was 33.3%. The rates for stage for LC, SC and CC were 30.1%, 35.5% and 34.0%, respectively. The rate for TNM was 11.0%. The relative extraction success rates for stage compared with that for CC (the reference group) were lower for patients with LC (adjusted OR (aOR), 0.803; 95% CI 0.782 to 0.825; p<0.0001) but higher for SC (aOR 1.073; 95% CI 1.046 to 1.101; p<0.0001). The rates of TNM compared that for CC were 40.7% lower for LC (aOR, 0.593; 95% CI 0.569 to 0.617; p<0.0001) and 43.0% lower for SC (aOR 0.570; 95% CI 0.548 to 0.593; p<0.0001). There were limits to expressibility in ICD-11 regarding the detailed cancer stage and TNM metastasis codes.

Conclusion Extracting cancer stage and TNM codes from health insurance claims were feasible, but expressibility in ICD-11 codes was limited. WHO may need to create specific cancer stage and TNM extension codes for ICD-11 due to the absence of current rules in ICD-11.

  • health informatics
  • public health
  • registries
  • information technology
  • oncology

Data availability statement

Data are available upon reasonable request. Data may be obtained from a third party and are not publicly available. Data may be obtained from a third party and are not publicly available. Data from this study are national health insurance claims stored on a secure server by the Health Insurance Review & Assessment Servivce (HIRA) in Korea and cannot be used withouth permission. Individual researchers may request permission for its use through HIRA per existing protocols and administrative procedures.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

STRENGTHS AND LIMITATIONS OF THIS STUDY

  • One of the key strengths of this study resides in its robust quantitative analysis, achieved by using a substantial volume of empirical health insurance claim data, ensuring a comprehensive and reliable assessment of the research objectives.

  • The study’s multidisciplinary approach significantly enhances its credibility and depth, drawing on the expertise of a diverse team that includes medical specialists, a pathologist, certified health information managers with in-depth knowledge of ICD-11, and statisticians, all contributing as coauthors.

  • A notable limitation of this study is that its findings, derived from health insurance claims in South Korea, may not be generalisable to other data sources, such as national cancer registries.

Introduction

Recently, the World Health Organization (WHO) released the International Classification of Diseases 11th revision (ICD-11) for Mortality and Morbidity Statistics classification and recommended replacing the ICD-10 with the ICD-11 scheme.1 Many nations are trying to introduce the new coding scheme in their healthcare systems.2–4 ICD-11 represents a radical change in recording diagnoses from a basic diagnosis to a more elaborate and flexible system of recording, including the mechanisms of stem codes, cluster coding and post-coordination.5–8

This ICD-11 has various extension codes supporting specific clinical abstraction and classification9 and introduces a postcoordination scheme linking two or more codes into a cluster, which explains a clinical concept.10 By adding extension codes to the main stem codes, medical professionals can record and express more detailed diagnoses of patients in their medical charts or other documents depending on the focal observation point and location.

Given that cancer is a major cause of death worldwide, reaching almost 10 million deaths in 2020,11 we can expect that there will be many cancer cases requiring changes in diagnosis recording in the transition from ICD-10 to ICD-11. In recording cancer diagnoses, it is very crucial to record the stage of cancer progress and biological metastasis status because medical professionals choose appropriate cancer treatments at the point of care based on knowledge of the correct stage and metastasis.

Assuming that ICD-11 has a notation scheme expressing the cancer stage and aetiological facts, is it possible to extract cancer stage and TNM metastasis (representing tumor extension size (T), lymph node metastasis (N), and distant metastasis (M)) from any national or standardised health insurance claims for future ICD-11 use? Furthermore, could we record that information in ICD-11 formats? This study defines the feasibility of extracting cancer stage and TNM information as the extraction success or failure of that information from standardised documents or national health insurance claims. If there are a certain number of health insurance claims, then what number or proportion of health insurance claims had stage and TNM codes, respectively? Can we record them in ICD-11? They become two main research questions.

There have been several studies on ICD-11. Most studies of ICD-11 have dealt with transitioning diagnosis codes from ICD-10 to ICD-11 and mapping between ICD-10 and ICD-11 or comparing two coding results in terms of accuracy.12–14 Some studies have targeted specific topics in ICD-11 such as pain, mental health and psychiatric disorders.15 16 These pioneering studies have clearly represented significant achievements in the development and advancement of ICD-11 codes. However, there have been few studies on the feasibility of extracting the extension codes from documents, explicitly focusing on cancer and investigating the expressibility in ICD-11. This might be due to the relatively short period of time since ICD-11 was released and to a lack of empirical data. Thus, there is a need for research on the degree to which the extension codes for ICD-11 can be extracted from some nationalised health insurance claims.

Studies of ICD-11 specifically focusing on cancer are important from several perspectives. First, studies on ICD-11 and cancer are scarce, although cancer is considered a significant disease category. Second, research on this topic could explore various information sources for extracting the extension codes of cancer from health insurance claims. The results would provide information on the feasibility of constructing longitudinal data on an ICD-11 basis from data based on ICD-10 codes. Third, the findings from this study could be used for future coding plans and for advancing the current ICD-11 as it pertains to cancer. Many international scholars and colleagues may be interested in methods for extracting information from medical records, clinical documents and health insurance claims and transitioning to the ICD-11 format.

The objective of this study was to test the feasibility of extracting information from the nationalised health insurance claims and to figure out how to express it in the ICD-11 format targeting several cancer types. In this study, we also hypothesised that there would be variations in extracting information on cancer stage and TNM metastasis depending on the type of cancer.

Methods

Study design and setting

A cross-sectional study design was adopted. The unit of analysis was individual patient with lung, stomach or colorectal cancer (CC). This study used national outpatient health insurance claims. The study selected the first claims of patients visiting hospitals due to those cancers from 1 July to 31 December 2021.

Regarding the study setting, the study was conducted in South Korea, which has adopted a national health insurance system. This requires each healthcare institution (HI) to fill out a health insurance claim form and submit it to the Health Insurance Review and Assessment Service (HIRA) in order to obtain reimbursement from the National Health Insurance Service (NHIS). The HIRA reviews all health insurance claims, including both inpatient and outpatient services and, thus, this study could use HIRA’s data. In the form, the HI is required to provide one primary diagnosis along with an unlimited number of secondary diagnoses, and the claims form include fields for cancer stage and TNM metastasis. Korea currently uses the Korean Standard Classification of Diseases, 8th revision (KCD-8), which is slightly modified from ICD-10.

The following presents how codes of KCD-8 and ICD-10 differ in lung, stomach and CCs. For lung cancer (LC), ICD-10 has codes: C33, C340-C343, C348, C349. KCD-8 has the same codes as those of ICD-10 except the following: C340, C341, C343, C348 and C349 have the additional last digits of 0, 1 or 9. For stomach cancer (SC), ICD-10 has codes: C160-C166, C168, C169. The KCD-8 has the same codes as those of ICD-10, but the codes have the additional last digits of 0, 1 or 9. For CC, ICD-10 has the same codes as those of KCD-8: C180-189, C19 and C20. This study used three-digit codes if KCD-8 and ICD-10 had the same three-digit codes and first four-digit codes in other cases. Thus, this study used a data set in which KCD-8 codes have the same codes as those of ICD-10. ICD-10 and the KCD-8 modified version of ICD-10 does not have any information on state and TNM. However, the national health insurance claim form has fields recording stage and TNM information. The current study used this information.

For tumours, where specified under the national health insurance programme, the HI should also provide the stage of cancer or TNM codes representing tumour extension status and size (T), metastasis status of regional lymph node (N) and presence of distant metastasis (M) to other human body or organ areas, respectively, in order to get reimbursement for specific drugs or treatments from NHIS. Under the current rules, the HI can record either stage, TNM or both. For example, if a patient has been diagnosed with ‘malignant neoplasm of the pylorus, advanced’, ‘C1641’ (KCD-8) (‘malignant neoplasm: pylorus’, ‘C16.4’ (ICD-10)) and he/she is in the tumour stage of ‘stage IIIA’, then the HI should put ‘C1641/3A’ in the claim form of HIRA. If the patient has a TNM classification of T2aN2M0 (this is one of international standards for TNM),17–20 then the HI should put ‘C1641/T2a/N2/M0’. Unlike the international standards, ‘/’ is used in Korean insurance claims. Thus, Korea follows international standards regarding tumour stage and TNM codes except this ‘/’ code. These things are specified in the coding rules of HIRA regarding the claim submission guidelines.

Sample size and study participants

Regarding the sample size, this study has binary dependent variables (success or failure of extracting stage codes and TNM notation). The required sample size was calculated on the basis of a formula commonly used as a generalised rule,21 22 n=4 x p x (1-p)/B2, where n=sample size, p is the probability of extracting cancer stage information from the population (this study used p=0.5 conservatively following statistical recommendations21 22), and B is the bound on the error of estimation (this study applied B=0.05). The required sample size is 400 (4×0.5×0.5/(0.05×0.05)=400). However, our study had many more participants than the calculated sample size and so this calculated sample size was not a consideration.

Figure 1 presents how all the health insurance claims data were retrieved and processed for this study. To select study participants, the following steps were taken. First, this study pulled out all the outpatient health insurance claims of patients who visited hospitals due to three types of cancer (lung, stomach and CCs) from 1 July to 31 December, 2021. Second, the health insurance claims selected were only those submitted from tertiary hospitals and general hospitals with national health insurance and medical aids. Third, the Charlson comorbidity index (CCI) was calculated and information extracted on patient residential location information and whether patients had inpatient records or not. Fourth, the data were cleaned. For example, if stage fields were recorded as ‘3C’ or ‘3c’ in the case of SC, then this study corrected the fields with ‘IIIC’. The fields filled with TNM metastasis data were also checked. Fifth, the claims were sorted in the chronological order for each patient and the first claim was selected. Finally, the cleaned (checked) data were analysed by type of cancer. In terms of diagnosis codes, C33 and C34 were selected for LC, C16 for SC and C18, C19 and C20 for CC. This study focused on three specific cancer types, chosen primarily for their high annual incidence rates among both men and women in Korea, as evidenced by cancer incidence data spanning from 1999 to 2020. This approach ensures that the study addresses cancers of significant public health relevance and provides insights into the most common cancer types within the Korean population.23 It was expected that studies on this issue would contribute to the development of more correct records of patient diagnoses. Cancer itself has different incidence, prevalence, treatment and aetiologies depending on the type and demographic characteristics of the patient.24 25 Hence, this study controlled for patients’ individual characteristics as covariates.

Figure 1

Data handling process. ¹Health Insurance Review & Assessment (HIRA) Service (Korea); ²TNM: cancer size(T), lymph node metastasis (N), distant metastasis (M); ³International Classification of Diseases.

To reduce and address bias inherent in the claim records, professional experts participated as coauthors. Thus, three medical doctors, two certified health information managers and one statistician having doctoral degree participated in this study. In particular, one of the coauthors is a retired faculty member in oncology. Whenever there were some issues related with coding and professional medical knowledge, then his opinion was sought. This study also tried to reduce any coding errors through a visual inspection of each variable’s frequency.

In all health insurance claim forms of the national health insurance systems in Korea, it is mandatory for HIs to put patients’ diagnosis codes in the claim form. National health insurance rules also require HIs to include stage and TNM classification status in health insurance claims depending on health insurance reimbursement guidelines.

Data sources

Data were sourced from both HIRA and the Ministry of the Interior and Safety (MOIS). All the clinical information was from HIRA. All the health insurance claims were retrieved from the HIRA’s data warehouse systems. However, HIRA data only provide the patient’s address in terms of a legal district area code number called ‘Legal Dong Code’, which does not provide information on whether a location is urban or rural. By merging HIRA’s data with data having the legal district area code number and location code from MOIS, this study identified patients’ location in terms of whether patients lived in urban or rural areas and used it as one of the control variables. Data from MOIS are publicly available (https://www.code.go.kr/stdcodesrch/codeAllDownloadL.do). Regarding potential sources of bias coming from handling and managing data sets, this study directly used a raw data set without any data correction to minimise this effect, because any corrections might have changed the study results. This study also considered that we could not know the clinical situation and patients’ status at the time of care. Regarding the confidentiality and risk issues of the study data, the project director of this study, and the corresponding author, received the IRB training and its certificate. The data analysis was conducted in confidential settings, and any risks to study participants were minimised by using their health insurance administrative data.

Dependent and independent variables

The dependent variable of this study was a success or failure in extracting cancer diagnosis stage and metastasis (eg, TNM), respectively. This was measured by a binary scale (yes (1) or no (0)): ‘1’ if a patient has a claim from which it was possible to extract the codes regarding cancer stage or TNM, otherwise, ‘0’. For example, if it was possible to extract information indicating the stage from a single health insurance claim, then we defined it as ‘feasible’ to extract cancer stage information and coded it as ‘success’, while inability to extract the information was coded as ‘failure’. We used the same method to determine the feasibility of extracting TNM classification. This study had two outcome variables. One is for the cancer stage codes. Cancer stage codes can also be extracted from TNM information. For example, T1N0M0 means stage IA in the staging system for SC of the 8th edition of the American Joint Committee on Cancer.26 However, this study did not consider them in stage codes to simplify reporting of the study results reporting although it was possible to count those cases. The other outcome variable was TNM codes in terms of whether it was possible to extract TNM information from health insurance claims. This study investigated all stage and TNM codes from fields of the health insurance claim form in Korea. In the field, HIs were supposed to input information on the cancer stage and the TNM classification. Table 1 presents all the stages and the TNM classification codes.

Table 1

All stage and TNM classification types retrieved from the fields of national health insurance claims forms

The only targeted independent variable was the type of cancer. There were three types of cancer: lung, stomach and CC. Two dummy variables were created and CC was chosen as the reference group. There were several control variables. They were the patient’s general characteristics, namely age, sex, type of medical coverage (health insurance or Medical Aid paid by the Korean government), type of HI, the locality where patients lived (eg, urban or rural area) and CCI. A location was considered rural if patients lived in an area with fewer than 100 000 residents, while other locations were considered urban. This variable that distinguishes urban and rural areas is the administrative district code, which is determined by the population of the local area in Korea. This variable is also frequently used in academic research related to Korea.27 For patients’ comorbidity index, the CCI representing 1-year mortality was calculated based on 19 disease groups.28 Each group has weights of 1, 2, 3 and 6, depending on type of disease and the sum of these weights constitutes the CCI. After initial development, several other scholars updated the original methods.29 30 This study used the latest version, validated by several other scholars, which was proposed by Quan, with 17 disease groups.30 For calculating CCI, this study used health insurance claims from a diagnosis database table, TWJHE400, of HIRA between 1 January 2020 and 31 December 2021.

Statistical analysis

As a first step, descriptive statistics were generated for the independent and dependent variables according to the type of cancer (lung, stomach, or colorectal). The three groups of patients were then compared in terms of differences in frequency or mean, depending on the variable. Analysis of variance was used to test mean differences and the Mantel-Haenszel χ2 test for frequency differences. Health insurance claims with missing data for stage and TNM fields were recorded as ‘0’ meaning that it was not possible to extract the extension codes regarding cancer stage or TNM for ICD-11.

Statistical analysis

As a first step, descriptive statistics were generated for the independent and dependent variables according to the type of cancer (lung, stomach, or colorectal). The three groups of patients were then compared in terms of differences in frequency or mean, depending on the variable. Analysis of variance (ANOVA) was used to test mean differences and the Mantel-Haenszel chi-square test for frequency differences. Health insurance claims with missing data for stage and TNM fields were recorded as ‘0’ meaning that it was not possible to extract the extension codes regarding cancer stage or TNM for ICD-11.

Correlations among the covariates were checked to determine whether any needed to be excluded from the final model. There were no high correlations between covariates. For the final model, logistic regression was used to investigate the difference in success in extracting stage or TNM among cancer types after controlling all covariates.

Finally, ‘the extraction success rate’ was defined as ‘an absolute rate’ or ‘a crude rate’ calculated without controlling any control variables. In contrast, we also defined ‘the relative rate’ as a rate having a comparable reference and being calculated after controlling other control variables in the model. We use two terms, ‘absolute extraction success rate’ and ‘relative extraction success rate’, in order to avoid any confusion. But we also use two terms, ‘the absolute rate’ and ‘the relative rate’, after their initial use, respectively, in order to avoid wording redundancy. SAS V.9.4 was used in all statistical analyses.

Patient and public involvement

None.

Results

General characteristics of the study participants

Table 2 presents the general characteristics of the study participants. There were differences in sex, age, local area, financing source, inpatient records, the type of HI and severity among the three types of patients with cancer. Regarding the dependent variables, the absolute extraction success rate of the cancer stage based on stage information without TNM was 33.3%. The rates for stage for LC, SC and CC were 30.1%, 35.5% and 34.0%, respectively. The rate for TNM was 11.0%. The rates for TNM for LC, SC and CC were 14.7%, 8.9% and 8.8%, respectively.

Table 2

General characteristics of the study participants (N=151 426)

Extraction of stage and TNM codes by type of cancer

Table 3 presents the association between cancer type and success in extracting the cancer stage and TNM codes, after controlling for patients’ characteristics. For stage codes, the odds of success were 20% lower for patients with LC than in patients with CC (adjusted OR (aOR) 0.80, 95% CI 0.782 to 0.825, p<0.0001), but 7% higher for patients with SC than for patients with CC (aOR 1.07, 95% CI 1.046 to 1.101, p<0.0001). For TNM codes, the relative success rate, compared with patients with CC, was 40.7% lower in patients with LC (aOR, 0.593; 95% CI 0.569 to 0.617; p<0.0001) and 43.0% lower in those with SC (aOR 0.570; 95% CI 0.548 to 0.593; p<0.0001).

Table 3

Extraction feasibility of cancer stage and TNM codes by type of cancer

Frequency of cancer stage, TNM metastasis, and their expressibility in ICD-11

Table 4 presents the five most frequent combinations of cancer type, stage and TNM. Stage 1 was the most frequent stage for each of the three cancer types. The next highest frequency was for stage 0 in LC and SC, but for stage 2 in CC. A generalisable pattern was not observed for TNM among the three types of cancer. Regarding the transferability from stage and TNM metastasis, only a few codes could be recorded in ICD-11 by using &XS76, &X9R, &XS1G. In other cases, there was no way to record stage in ICD-11. For TNM metastasis, none of current international standards could not be recorded in ICD-11 form.

Table 4

The five most frequent combinations of stage and TNM by type of cancer

Discussion

This study tested the feasibility of extracting the extension codes from documents based on ICD-10 for ICD-11 focusing on patients with cancer. The study used outpatient health insurance claims for patients with three types of cancer: lung, stomach and colorectal. It was found that the extraction success rate for the cancer stage was 33.3%, with the highest rate for SC (35.5%), followed by CC (34.0%) and LC (30.0%). The extraction success rate for TNM was 11.0%, with the highest rate for CC (14.7%), followed by SC (8.9%) and LC (8.8%).

The extraction success rate for the cancer stage was approximately 33%. Although it is debatable whether the rate is high enough or not, this result aligns with those of previous studies on whether ICD-11 extension codes could be extracted from any circumstances and postcoordination using data with ICD-10.9 13 This result from the selected sample of claims may apply to all claims for the three cancer types in the Korean health insurance system. If the government or HIRA required that HI provides the exact codes as a mandatory rule, then the extraction success rate would be increased. Although the extraction success rate is dependent on these conditions, our study result clearly suggests one thing. If we currently need longitudinal data after implementing ICD-11 regarding cancer stages, it can be expected that 33% of health insurance claims for lung, stomach and CCs may have extension codes in ICD-11.

The extraction success rate for TNM was very low. The possibility of extracting TNM information from health insurance claims was approximately 10%. The reason why the rate for TNM was much lower than for stage might be due to health insurance claim filing policies in Korea, as mentioned above. HIRA requires HIs to file claims, including either cancer stage or TNM codes. From the standpoint of medical professionals, recording the cancer stage code is simpler than recording TNM. In addition, it might be related to the general international guidelines for TNM staging. The guidelines say that ‘TNM staging applies only to cases that have been microscopically confirmed to be malignant’.31 It could place a burden on HIs and, thus they might choose a simpler method of recording the cancer stage code, rather than putting TNM in health insurance claim forms. This is why the rate for the stage may be higher than that for TNM. When providing medical treatment for cancer, recording patients’ biological and clinical status and the progress of cancer is a complex task. If the government or insurance agencies required medical professionals to record their detailed clinical status, it is likely that compliance with these requirements would be much lower than with the existing simpler guidelines.

Regarding the significance of our findings, it is not straightforward to categorise them as significant or to definitively state whether a 33% extraction success rate extracting cancer stage codes is high. However, it is evident that, in the context of Korea, up to 33% of cancer stage and 11% of TNM information can be successfully retrieved from the health insurance claim data of patients with cancer. This benchmark provides a valuable reference point for other nations with similar national health insurance programmes or health services. By comparing their data extraction rates to ours, they can gauge the effectiveness of their systems in extracting cancer stage and TNM information from centralised datasets. Such comparisons could yield important research implications.

Regarding the study’s clinical relevance, the health insurance data were not linked to clinical data in this study. This is a major limitation. However, a small number of medical specialists working in hospitals were asked to give their opinions on the study results through informal meetings when interpreting the research results. They said that most electronic medical records or clinical data for patients with cancer have information on cancer stage and TNM metastasis. To account for the low cancer stage and TNM incidence rate in health insurance claims, most of the medical specialists we contacted suggested that health insurance rules might make this difference. In Korea, there are certain drugs or treatments that must be used or provided at certain cancer stages. National health insurance rules require that HIs must put the stage of cancer on the claim form after using those drugs to receive reimbursement from the NHIS. Thus, if health insurance rules mandatorily require HIs to put stage and TNM information in health insurance claims, then the extraction or incidence rate would be high because of those rules. However, if a procedure or action by the HIs is nothing to do with any reimbursement, then the extraction or incidence rate would be low because there is no reason for HIs to put cancer stage and TNM information in health insurance claims. Thus, those claims having missing data in the stage and TNM fields would likely be due to healthcare provision not directly related to the reimbursement rules. Subsequent studies are definitely necessary to evaluate the differences in cancer stage and TNM metastasis recording rates between health insurance claim data and clinical data.

Interestingly, this study found that there was considerable variation in figures for cancer stages and TNM according to cancer type. The differences might be related to the complexity of medical decision-making and administrative support for filing health insurance claims in terms of types of cancer. They might affect the stage and TNM coding differences between CC and LC. Thus, surgeons may more easily detect stage and TNM information in cases of CC than in cases of the two other cancers, which may contribute to higher recording rates in health insurance claims. However, this is tentative speculation, and these discrepancies merit further investigation due to the lack of previous studies.

Regarding the frequency of cancer TNM metastasis, the fourth most frequent TNM status for LC was T1N0M0. Cases of T1N0M0 seem to include all of T1mi, T1a and T1b. In all these cases, there is an invasion of tumour cells. T1mi refers to a small invasion that can be identified with a microscope. T1a is less than 2 cm, and T1b is within 2–3 cm. Medical professionals often could not differentiate the size and thus put T1N0M0. For stomach and CC, there was a high frequency of TxN1M0. Tx is a case where the T stage cannot be determined. The N1 lymph node spread should be observed only in LC, but it was also observed in patients with stomach and CC. This might have been due to complicated filing guidelines or mistakes in filling out the insurance claim forms as could be frequently observed in medical bills.32 Medical professionals might provide TNM information without differentiating the types of cancer. Currently, HIRA does not present or require specific guidelines other than using international standards on stages and TNM. HIRA may need to simplify their claim filing guidelines on TNM along the lines of the WHO, which uses five codes such as XS76 (stage 0), XS1G (stage 1), XS4p (stage 2), XS6H (stage 3) and XS9R (stage 4) in order to reduce coding errors. If HIRA continues to adopt complicated international standards described in other studies,17 20 then frequent coding errors may continue.

In addition, this study also discovered that there were not specific extension codes in ICD-11 for biological node metastasis in relation to cancer other than the four stage codes. This is an issue that needs attention from the WHO. The tumour spread staging scale values of ICD-11 are XS76 for stage 0, XS1G for stage I, XS4p for stage II, XS6H for stage III and XS9R for stage IV. However, extension codes for further specific stages, such as IA1, IIA, IIIA and so on do not exist. Moreover, there is no code for TNM classification within the scheme of ICD-11. The WHO needs to consider further specific guidelines or suggestions on this issue as mentioned in one of the previous studies.7 In order to minimise difficulties coming from drastic changes in the diagnosis coding scheme, the simplest approach would be for the WHO to use the current international standards at the end of the stem code of ICD-11.

Strength and limitations

This study boasts significant strengths from two key perspectives. First, it employs a comprehensive quantitative statistical analysis based on a substantial dataset of empirical health insurance claims from the national health insurance programme. This approach not only ensures a robust foundation for the study but also provides valuable insights that can aid in estimating the initial extraction rates for future research under similar conditions. Second, the involvement of a multidisciplinary team of experts, including medical specialists, oncologists, certified health information managers well-versed in ICD-11, and statisticians as coauthors, greatly enhances the study’s validity. Their combined expertise contributes significantly to the reliability and relevance of the study’s findings. However, this study has a number of limitations. First, regarding TNM, this study directly used a raw data set without any data correction because any correction might have changed the study results, and we could not know the clinical situation and patients’ status at the time of care. Admittedly the accuracy of the ICD-11 coding and mapping compared with ground truth cannot be ascertained. Second, to test the accuracy of the study results, the health insurance claim data should be compared with clinical data. However, this study only used the health insurance claim data and so could not accurately check the claim data against clinical data. Third, concerning the accuracy of the extracted codes, a potential limitation of this study is the lack of appropriate evaluation methods for these codes. Issues such as upcoding and billing fraud, which are inherent in health insurance claims and billing data,33–35 could not be evaluated in this study. However, measures were taken to minimise inaccuracies like typographical errors in coding, as detailed in the research methods section, to enhance the reliability of the retrieved data. Fourth, this study could not consider the other type of cancer classification, ICD for Oncology, third edition (ICD-O-3), proposed by WHO. It has anatomical and morphological coding systems. This is because HIRA’s claim form only considers stage and TNM metastasis following international guidelines other than ICD-O-3. Finally, the interpretation of the study results may be limited to health insurance claims in Korea. In addition, other data sources, such as national cancer registries, may produce different study results. However, similar approaches are certainly applicable to other nations if there are any health insurance claims or standardised clinical documents at the national level.

Conclusion

ICD-11 has a more complex coding system than ICD-10. As an exploratory study for the future use of longitudinal data, this study examined whether it is feasible to convert healthcare data based on ICD-10 to ICD-11, focusing on stage and TNM node metastasis of lung, stomach and CC. This study showed the feasibility of extracting information on cancer stage and biological TNM from ICD-10 records for use with ICD-11. The success rate in extracting stage and biological TNM information was approximately 33% and 11%, respectively. However, the rate varied depending on the type of cancer. Although extracting the codes was feasible, it is necessary for us to keep in mind that the rate could further improve depending on the insurance policies and reimbursement guidelines. Furthermore, this study has also highlighted the need to specify further detailed protocols on cancer stage and biological TNM in ICD-11 because specific codes mentioned above are missing in ICD-11. It is recommended that the WHO considers the current TNM structure (international standards) as a basis for extension codes in ICD-11.

Data availability statement

Data are available upon reasonable request. Data may be obtained from a third party and are not publicly available. Data may be obtained from a third party and are not publicly available. Data from this study are national health insurance claims stored on a secure server by the Health Insurance Review & Assessment Servivce (HIRA) in Korea and cannot be used withouth permission. Individual researchers may request permission for its use through HIRA per existing protocols and administrative procedures.

Ethics statements

Patient consent for publication

Ethics approval

The study was approved by the Institutional Review Board in Korea on 12 May 2022 (IRB number: 2022-049-001).

Acknowledgments

We thank the HIRA Research Institute, Health Insurance Review and Assessment Service (HIRA) for providing the insurance claim data and their administrative and technical support.

References

Footnotes

  • Twitter @dwhan2

  • Contributors Y-TP had full access to all the data in the study and took the responsibility for the integrity of the data and the accuracy of the data analysis. Y-TP is the guarantor of the study. Y-TP and K-HK conceptualised the study and design. Y-TP, DH and HK retrieved and analysed the data. YTP, DH, HK, HJY, B-RK and J-YJ conducted literature review. YTP, DH, HK, HJY, CL, B-RK and K-HK contributed to data interpretation. Y-TP, DH, CL and K-HK wrote the first draft of the manuscript. Y-TP, DH, HJY, CL, B-RK, J-YJ and K-HK reviewed and edited the first draft and significantly contributed to the manuscript. All authors have read and agreed to the published version of the manuscript.

  • Funding This study was supported by the Health Insurance Review and Assessment Service (HIRA)’s internal research funds in South Korea. Award/grant number N/A.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.