Insight: How our text generation works

Aug 04 2015

Since automated text generation must be a mystery for one who does not work with it day for day, we want to give you an insight. We have written a training that shows the basic functions of our text generation language ATML3 using the example ‘car tires’. Imagine you have an online shop where you want to sell car tires. You have got all the data about your tires so now you want to generate human readable texts from it. There shall be diverse texts for each product that are absolutely unique to avoid duplicate content.

First of all we categorize your data to have a structured basis for our text engine. (To present it more clearly we left out some data here.)

ID manufacturer brand name tire_type vehicle_type load_index speed_rating
1 Michelin Primacy 3 FSL summer car 95 XL W
2 Continental Crosscontact Winter winter SUV 110XL H

The following ATML3 training consists of four sections:

  1. Properties, where you map and process the given data.
  2. ProductTypes, where you can for example determine a sentence order.
  3. Sentences, where you determine, which statements can be made from the data and which logical conclusions you can draw from it.
  4. Lookups, which can be perceived as a dicitionary for your texts.

1. Properties

Let us first of all have a look at the properties. What we basically do here is preparing the data so we can later generate texts from it. We pick the data from the database, check if it is suitable for our needs, create a logic, create data groups and define vocables and phrases.

You will always find a short code snippet with a desciption below. Please note that our training was written for German text generation; an English translation is provided if necessary for understanding.

codesnippet_properties

The property 'lastindex_nicht_korrekt' turns true, if the numeric value for 'load_index' is between 19 and 204. If yes, the data will be used for the automated text.

codesnippet_names_manufacturers

Via 'mappingExpression' the data from the correspondent data fields in our table above is picked. 'truthExpression' verifies the value in the data field. In this case it checks, if there is a value in the data field at all. 'voc' determines the vocable that will be shown in the final sentence – in this cases it is just the value of the data fields, meaning the name of the tire and the manufacturer. There is only one 'voc' for 'name' and 'manufacturer' in our example but there can be multiple 'vocs' as well.

codesnippet_vehicle_type_voc

Here the data from the datafield vehicle_type is picked but it is replaced by the correspondent vocable from the lookup-table that you find at the end of the training. For example, if the vehicle_type is 'car' it is replaced by 'PKW' (german for 'passenger car').

codesnippet_nr5

What you see above is a so-called 'no.5 command'. It connects to our world knowledge database and requests information about the speed rating of the tire based on actual industry norms. If the value in our database for speed-rating is V, VR, W, ZR or Y, a sentence for the maximum speed will be generated (see below in the section 3. Sentences: 'Geschwindigkeit_max')

codesnippet_phrases

Different phrases are determined for each manufacturer; you find them in the variable 'voc'. The phrases have to be written in the correct case to fit into the final sentence. To provide a clear overview, only two manufacturers are listed here; more are possible, of course. Various phrases can be added to each manufacturer. When forming the final sentence the phrase is chosen randomly. The English translations for the added phrases are:

Bridgestone: 'optimal security and reliability' and 'highest efficiency and first-class quality'

Continental: 'quality with tradition' and 'highest quality standards'

codesnippet_manufacturer_phrases

'manufacturer_phrases' is a group property: It integrates the properties with the phrases for each manufacturer.

2. ProductTypes

The next section following the properties in our ATML3 training are the productTypes. What we do here is to determine a sentence order for our texts. It is also possible to determine various sentence orders (not to be seen in our example).

codesnippet_productTypes

The green marked lines are the sentence names (see 3. Sentences) brought in the right order. The English translation is: 'checking value range', 'introduction', 'speed_max', 'cta' (call to action)

3. Sentences

This third section is somehow the heart of our training. You will again find the green marked sentence names that have already been brought in the right order in the section productTypes.

codesnippet_wertebereichpruefen

The first sentence is 'checking the value range' ('Wertebereich prüfen'). Via the 'triggers' you can define rules for when the sentence shall be generated. If the trigger is 'Auto' the sentence will be always generated. In this case it is only generated, if the variable 'lastindex_nicht_korrekt' (explanation see 1. Properties) is true. In 'variants' diverse formulations for your sentence can be saved; even in different languages. Last but not least, 'text' is the actual text that will be shown in the end. It includes so-called 'container' (marked by brackets) for variable content. In this case it consists of one single container.

codesnippet_introduction

The sentence called 'introduction' ('Einleitung') is always generated. 'text' includes a few container:

[appeal:reader,id=reader] makes it possible to change between a formal and an informal way of speaking to the reader, which is especially important in the German language.

[vehicle_type_voc;trailing:-] adds the correct vocable for vehicle_type to the text that has been predefined in our properties. 'trailing' adds an additional character like in this case a hyphen.

[G:verb=suchen,grammar-from=reader] and [G:verb=sein,grammar-from=reader] make it possible to inflect verbs in a grammatical correct way. Again, this is very useful for the German language if you have to switch between a formal and an informal way of speaking to the customer. In this case the verbs 'to search' and 'to be' are inflected.

Via 'synonyms' you are able to add synonyms for words, like chaning 'great' into 'excellent' incidentally.

The given sentence can in the end look like this (the English translation does not fit 1:1):

Wenn [Sie/du] einen [Pkw-]Reifen [suchen/suchst], [sind/bist] [Sie/du] mit dem [N1] hervorragend beraten.

If [you] are [searching] for a [passenger car] tire, the [N1] is a great choice.

codesnippet_speedmax_cta

As described in properties the sentence 'speed_max' ('Geschwindigkeit_max') uses data from the world knowledge database and is only generated for specific speed_rating_tire_classes. If generated the sentence gives information about the maximum speed you can have with this kind of tires.

The sentence 'CtA' is the 'Call to Action' at the end of the text. It works like the 'introduction' ('Einleitung') and uses variable data.

4. Lookups

The lookups are like a dictionary where the software can look up words. In this case it is used for changing the language from English to German.

codesnippet_lookups

The final output

After determining all this properties, sentences, synonyms and so on we can finally generate triple-spaced texts from it for your (imaginary) car tires online shop. Below you see a German text as it could be created from our example training. Please note that the given English translation is only an approach to the German version and will not be generated from the training above automatically (this will be added in a later blog post).

Wenn Sie einen Pkw-Reifen suchen, sind Sie mit dem Primacy 3 FSL bestens beraten. Der Geschwindigkeitsindex W lässt Geschwindigkeiten bis zu 270 km/h zu. Kaufen Sie jetzt einen Reifen von Michelin.

If you are looking for a passenger car tire the Primacy 3 FSL is a great choice. The speed index W makes it possible to drive with up to 270km/h. Buy a tire from Michelin now.

Category: ATML3 Tagged: ATML3 training automated text generation

Comments