CSV Encoding

Each CSV file is encoded as follows:

  • field separator is ;
  • line separator is \n

Some of the fields of these CSV files contain arbitrary text. For example, the username on Facebook, Instagram, the content of a message can all contain arbitrary unicode characters (emojis, etc.) which tend to trigger bugs in CSV readers not used to handle such content.

Furthermore, CSV readers should specifically expect these fields to contain:

  • the field or line separators we use (; and \n)
  • the line separators used on various OSes where the user wrote her message (\n on unix, \r on osx, \r\n on windows)
    To handle these cases, we automatically quote fields that need it with ".

Although we do not export Twitter data, the following examples illustrate the process we apply on every other social channel.

For example, a tweet that contains a ;: such as this example will be encoded as follows.

"@fhouste Bonjour, IDCAB dessert uniquement les gares françaises > https://t.co/3aTQkDyf86 :) Bonne fin de journée !"

Similarly, a tweet that contains a \n such as this other example will be encoded as follows.

"Do you want a phone that is cheaper than Apple? 
Then you should Buy from Grapple. 
- Chloe <U+1F600><U+1F600><U+1F600>"

Technically, CSV readers that do not handle carriage returns correctly within quoted fields will be unable to read this data correctly.