POTENTIAL APPLICATIONS OF LINGUISTIC STEGANOGRAPHIC TRIGGER-CONTAINERS FOR TEXT WATERMARKING PURPOSES 1

. The paper gives analysis of feasibility of using linguistic steganographic trigger-containers as means of linguistic-based text watermarking. The proposed approah is based on the previous experimental research in the context of Russian native speaker text juncture perception. It was posited that specific minimal text modifications addressed in the paper may be used as means of text watermarking with the aim of tracking the leak of information for the purposes of taking le-gal actions, enforcing non-disclosure agreements, and testing for internal vulnerabilities. There is analyzed the viability of altering paragraph juncture points in Russian texts and, through the use of the corresponding trigger-containers, usage of this alteration as a means of linguistic watermarking.


Introduction
There are two main approaches to protecting sensitive information: cryptography and steganography. The basis of cryptography is altering a message in such a way as to encrypt it using various ciphers. In this way the message is altered and protected, but the fact is that an attempt to conceal or protect information is evident. Decrypting these ciphers means restoring the encrypted message into its initial state [1].
On the opposite end there exists steganography: the practice of concealing the fact of transferring confidential information itself. While steganography may on some level use cryptographic ciphers, the core of steganographic information protection is the subterfuge-based approach aimed at hiding the hidden message within plain text in such a manner that the altered text becomes virtually indistinguishable from the original unaltered text [2].
While the practice of steganography has existed throughout human history, there is little research that determines the efficacy of different methods thereof. In our work we set out to create a strong, datadriven, scientific groundwork based on an experimental approach that involves native speakers. While at this stage we analyze Russian written texts, the methodology may be expanded for other languages.
Our previous research has allowed us to gain and interpret data pertaining to how Russian native speakers choose to re-separate into words and paragraphs a sensical text that has been altered by deleting spaces between words, removing capital letters, punctuation symbols, and paragraph breaks.
Initially this data has been gathered for the purposes of trigger-container based steganographic inquiry. Trigger-containers are minimal, distinct text alterations that refer to a previously agreed-upon message that is to be received upon discovering the corresponding alteration [3].
The results of our perception experiment have indicated that altering the location of paragraph junctures (at least for Russian texts) appears to be a suitable variable for these purposes. We are now ready to propose a new application for these trigger-containers in text watermarking.

Background
Our research approach is based on using experimental perceptual methods and statistically analyzing the resulting data. As mentioned in the introduction, the stimulus was a sensical Russian text that has been transformed into a continuous string of letters. The initial text was a test text for the purposes of text companding [4]. The participants (n = 102) were tasked with restoring the text by separating it into paragraphs, phrases, syntagmas and words. For the purposes of this paper, we shall ignore the phrase and syntagma juncture data and focus on paragraph and word junctures. Effectively, the resulting data is an approximation and the range of how Russian native speakers view the range of possible juncture points for words and paragraphs in Russian texts. In other words, the data allows us to gleam insight into how far we may manipulate a linguistic variable in the form of word and paragraph juncture without raising suspicion.
The data has been processed and we have calculated the probability of a certain juncture being restored (if it corresponded with one in the initial text presentation) or inserted (if it was not present in the initial text) by the participants of our experiment. This probability shows, in essence, how Russian native speakers perceive a certain juncture to be, which is valuable data for the purpose of discovering optimal trigger-containers for specific applications.

Results, Proposed Approach and Discussion
In Table 1 we present the statistical data for paragraph and word junctures. The data for "Word Anomalies" represents all data for word junctures with sub-100% restoration/insertion rate. In other words, if all participants unanimously agreed that a juncture between words should exist at a certain point, that juncture would not be counted in the "Word Anomalies" as a unanimously agreed-upon juncture does not leave any ambivalence necessary for trigger-containers to function (Table 1).
We highlight sample variance, count and mean parameters as noteworthy. We have discovered a trend according to which the rise of sample variance inversely corresponds to the size of an element into which the string of characters is separated while restoring a text to its initial form. While the data shows that in "Words", sample variance is lower than in "paragraphs", the initial assessment is proven to be likely when we discard the junctures in the "Words" paradigm universally inserted by our participants (Fig. 1).

Fig. 1. Word anomalies paradigm juncture insertion rate analysis
The x-axis lists the word juncture by order of appearance in the text. The y-axis shows the insertion rate. As is evident, the mean for juncture insertion in the initial "Words" paradigm was higher than 94, while the altered "Word Anomalies" selection's mean drops to 76, presenting us with enough ambivalence to consider certain unique cases of separating texts into words as a viable steganographic trigger-container.
Paragraph junctures, however, may be even more promising. The initial stimulus text contained only 5 paragraphs, yet the count parameter for "Paragraphs" clearly shows, that Russian native speakers determined 27 possible junctures for separating the text into paragraphs (Fig. 2). The x-axis lists the paragraph junctures by order of appearance in the text. The y-axis shows the insertion rate. As we can see, no paragraph juncture had an insertion rate higher that 80%. Some lowerscoring cases might be suboptimal for the purposes of steganography, but those in the 20-80% range demonstrate by the virtue of their existence the inherent variability of the range of possibility when separating Russian texts into paragraphs. Therefore we posit that manipulating paragraph junctures is a promising method for steganographic information encryption in the context of trigger-containers, as this method can be used by implanting a single alteration into a text.
To further increase the ambivalence of a juncture, we superimpose one paradigm over the other. For this method we only consider the junctures that appear in both paradigms for overlaying purposes. One item of not we have to adress immediately is juncture 21, where we can observe an anomaly. More native speakers believe that this juncture is that of different paragraph than that of different words. This is possible in the following way. One cannot start a new paragraph in the middle of a word in Russian. It may be different in some other languages, but we are currently unaware of such cases. What is notable here that the word with this anomalous juncture in the middle is a complex word consisting of two sensical roots. This allows one to realistically separate a word into paragraphs seemingly in the middle of a word (Fig. 3). Having discussed the data and the findings, we may finally proceed to how they relate to text watermarking. The goal of watermarking is to insert an alteration into a text or image before sending or publishing it. For most applications, this is done to protect copyright. In broad terms, the performance of a watermarking can be described by four factors, which are imperceptibility, robustness, capacity and security (see, e.g. [5]). Imperceptibility is the visual similarity of the watermarked text to the nonwatermarked original. Robustness is how resistant the watermark is to attacks. Capacity is how much data the watermark potentially contains. Finally, security relates to watermark resistance to manipulation.
One of the approaches in watermarking is linguistic-based, where natural language methods are used to embed watermarks into a text. This method is useful for watermarking texts. Usually it is prudent to separate syntactic and semantic-based approaches in linguistic watermarking. The former deal with altering the syntax of a text, whereas the latter mean altering semantic elements of a text. Ultimately, both approaches aim to alter the linguistic features of a text without altering its meaning.
We propose using paragraph juncture alteration as a method of linguistic watermarking. Our data shows that the location of paragraph junctures is easily shifted and is connected to a high degree of detection ambivalence. There is no consensus among Russian native speakers on the topic of which locations within the text must be used when separating a text into distinct paragraphs. Therefore, altering this variable satisfies the factor of imperceptibility.
We believe that the optimal application of our proposed watermarking method would be source tracking. This watermarking application is based on embedding different watermarks based on the same technique into each distributed copy of a sensitive confidential text. In praxis, this could be useful for tracking parties that break the non-disclosure agreements by sharing the original text with third parties or the general public or for the purpose of internal security drills and discovering unreliable assets within the company by sending out falsified watermarked pseudo-confidential information and tracking any data leaks that follow. The former application is especially useful when the parties in question need to be brought into the court of law, as watermark personalization provides a reliable way to establish connection between the leaked document and the source of the leak.
The robustness of the proposed method would depend on how many copies of a sensitive document a third party has in their possession. If a third party would manage to gain access to multiple variations of the same watermarked document, it would become fairly trivial for them to recognize that paragraph junctures may be a watermark and promptly destroy them by removing paragraph junctures altogether or, in an even more troubling scenario, altering them, with intent or by chance, in such a way that draws suspicion to another innocent party. If this method is to function, it is imperative that security measures be in place that prevent any chance of any single party ever possessing more than one copy of a confidential document with watermarks embedded in this manner. A possible solution would be the embedding of a second false watermark that uses a different embedding method that is deliberately detectable but technically meaningless, which may misdirect the attacker.
The capacity is an interesting factor to consider. In our proposed application, the capacity plays little role by virtue of our method of using trigger-containers. By their very nature, trigger containers may transmit a technically unlimited volume of data due to the fact that the message is predetermined and is simply waiting for the trigger-container to appear. This is further bolstered by the fact that in our proposed method, the watermark only contains information about the source of the leak and the variation of the trigger-container watermark may be easily correlated to the source.
Regarding the final factor, security, it is evident that upon examination of different variations of the same watermark based on this method side by side, the attacker would have little trouble altering or destroying the watermark. Furthermore, it is possible that the watermark may be destroyed by simply changing the formatting of the text [6]. However, this is when we have to consider the data received from the native speakers. The range of the possible locations of paragraph junctures in the Russian texts is vast and may be readily manipulated without raising suspicion.
One possible implementation of the proposed trigger-container based paragraph-juncture watermark could be represented in the following manner (Table 2). In Table 2, the "Receiving party ID" contains 5 fictional receiving parties of a hypothetical sensitive document. Let us imagine that the confidential document contains 45 sentences. We employ sentences in this method because paragraph junctures are simultaneously in all cases sentence junctures in Russian. The "Watermark code" contains a simplified visualization of how the watermark is individualized for each party. For example, for "John Adams" the text would be separated into three paragraphs each consisting of 15 sentences. For "Microcosm Ltd." the first paragraph would contain 13 sentences whereas the second and third would both have 16 sentences, etc.
The visualization clearly shows how simple this method would be to implement. It could feasibly be automated for larger-scale use, but such an algorithm would need to consider extreme cases. Some sort of limits on the variation would possibly need to be imposed upon it if this embedding process is to be automated. One solution would be to have a human expert check the results for feasibility. For additional security, the expert can be only provided with the watermark code and not the ID of the receiving party. The number of possible combinations can be manipulated via altering the initial number of paragraphs. However, it should be noted that the entropy of such a way of watermarking would be finite and at a certain number of recipients may require increasing the text size to accommodate more paragraph junctures.

Conclusion
Visual analysis perception experiment carried out on Russian native speakers enabled us to determine that paragraph junctures in Russian written texts can be manipulated by changing their location. This in conjunction with using trigger-containers steganographically protects the sense information in written texts by inserting a trigger -a minimal alteration -into the text with the meaning of the secret message agreed upon beforehand and allows to design a possible method of watermarking for using the paragraph junction-based trigger containers in future. The proposed approach is envisaged to be most useful in source tracking in scenarios where the source of a leak needs to be determined. The method based on this approach is to move paragraph junctures in the text and to create a table of corresponding paragraph location coordinates in the text and the identificators of a receiving party for every variation of the watermark.
Our preliminary analysis indicates that this method will have high capacity and imperceptibility. As for robustness and security, the main threats when employing this watermarking technique include the attacker gaining multiple copies of the watermarked document intended for different receiving parties. If that happens, it is possible to determine that the paragraph juncture is the watermark and to destroy or modify it. To prevent this, it might be possible to insert and additional false watermark with higher detection rates to prevent the destruction of the linguistic trigger-container watermark.
Currently, the method based on the proposed approach presupposes the use in Russian texts that are written, printed or electronic. It is unclear if it will prove effective in other languages. To determine that, the additional research needs to be carried out for examining how the native speakers of other languages perceive necessary junctures for paragraphs in their target languages.