понедельник, 10 сентября 2012 г.

Essential string processing functions in Common Lisp

It should be noted that true Common Lisp somewhat lacks in several important parts of string-processing, and it shows sometime. Today I needed to heavily process large body of regular text and will write here some functions which are AFAIK considered "standard" in modern languages and which not so easily accessible and/or amazingly intuitive in CL.

In all following code snippets token input stands for input string.

  1. Trimming string from spaces, tabs and newlines

    (string-trim '(#\Space #\Newline #\Return #\Linefeed #\Tab) input))

    All named characters are listed in Hyperspec, 13.1.7 Character Names.

  2. Replacing by regular expressions

    Provided by CL-PPCRE package.

    In next snippet I remove all tokens enclosed in square brackets from the input string:

    (ql:quickload :cl-ppcre)
    (cl-ppcre:regex-replace-all "\\[[^]]+\\]" input "")

    Honestly, I don't know when you can need simple regex-replace and not regex-replace-all. Also, note the double-escaping of special symbols (\\[ instead of \[).

  3. Splitting string by separator symbol

    Provided by CL-UTILITIES package.

    In next snippet I split the input string by commas:

    (ql:quickload :cl-utilities)
    (cl-utilities:split-sequence #\, input)
  4. Making same modification on every string in given list

    In next snippet I trim spaces around all strings in list input-list:

    (map 'list 
         (lambda (input) (string-trim " " input)) 

    However, way better is to wrap the transformation for the string in separate function and call the mapping referencing just the name of transformation:

    (defun trim-spaces (input)
      "Remove trailing and leading spaces from input string"
      (string-trim '(#\Space) input))
    (map 'list #'trim-spaces input)

    Do not forget that string is just a sequence of a characters, and all sequence-operating functions can work on strings in either "abcd" form or '(#\a #\b #\c #\d) form. This applies only to sequence-operating functions, however.

  5. Removing the characters from string by condition

    In the next snippet I leave only the alphanumeric characters in the input string:

    (remove-if-not #'alphanumericp input)

    There are remove-if also.

    As with map, you can make arbitrary complex predicates either with lambdas or wrapping them in separate functions.