Package hudson.util

Class TextFile

  • public class TextFile
    extends Object
    Represents a text file. Provides convenience methods for reading and writing to it.
    Kohsuke Kawaguchi
    • Field Detail

      • file

        public final File file
    • Constructor Detail

      • TextFile

        public TextFile​(@NonNull
                        File file)
    • Method Detail

      • exists

        public boolean exists()
      • lines

        public Stream<String> lines()
                             throws IOException
        Read all lines from the file as a Stream. Bytes from the file are decoded into characters using the UTF-8 charset. If timely disposal of file system resources is required, the try-with-resources construct should be used to ensure that BaseStream.close() is invoked after the stream operations are completed.
        the lines from the file as a Stream
        IOException - if an I/O error occurs opening the file
      • head

        public String head​(int numChars)
                    throws IOException
        Reads the first N characters or until we hit EOF.
      • fastTail

        public String fastTail​(int numChars,
                               Charset cs)
                        throws IOException
        Efficiently reads the last N characters (or shorter, if the whole file is shorter than that.)

        This method first tries to just read the tail section of the file to get the necessary chars. To handle multi-byte variable length encoding (such as UTF-8), we read a larger than necessary chunk.

        Some multi-byte encoding, such as Shift-JIS, doesn't allow the first byte and the second byte of a single char to be unambiguously identified, so it is possible that we end up decoding incorrectly if we start reading in the middle of a multi-byte character. All the CJK multi-byte encodings that I know of are self-correcting; as they are ASCII-compatible, any ASCII characters or control characters will bring the decoding back in sync, so the worst case we just have some garbage in the beginning that needs to be discarded. To accommodate this, we read additional 1024 bytes.

        Other encodings, such as UTF-8, are better in that the character boundary is unambiguous, so there can be at most one garbage char. For dealing with UTF-16 and UTF-32, we read at 4 bytes boundary (all the constants and multipliers are multiples of 4.)

        Note that it is possible to construct a contrived input that fools this algorithm, and in this method we are willing to live with a small possibility of that to avoid reading the whole text. In practice, such an input is very unlikely.

        So all in all, this algorithm should work decently, and it works quite efficiently on a large text.