textmogrify

Textmogrify is a pre-alpha text manipulation library that hopefully works well with fs2.

Usage

This library is currently available for Scala binary versions 2.13 and 3.2.

To use the latest version, include the following in your build.sbt:

libraryDependencies ++= Seq(
  "pink.cozydev" %% "textmogrify" % "0.0.7"
)

Lucene

Currently the core functionality of textmogrify is implemented with Lucene via the Lucene module.

libraryDependencies ++= Seq(
  "pink.cozydev" %% "textmogrify-lucene" % "0.0.7"
)

The Lucene module lets you use a Lucene Analyzer to modify text, additionally it provides helpers to use Analyzers with an fs2 Stream.

Basics

Typical usage is to use the AnalyzerBuilder to configure an Analyzer and call .tokenizer[F] to get a Resource[F, String => F[Vector[String]]]:

import textmogrify.lucene.AnalyzerBuilder
import cats.effect.IO

val tokenizer = AnalyzerBuilder.default.withLowerCasing.withASCIIFolding.tokenizer[IO]

val tokens: IO[Vector[String]] = tokenizer.use(
  f => f("I Like Jalapeños")
)

Because this documentation is running in mdoc, we'll import an IO runtime and run explicitly:

import cats.effect.unsafe.implicits.global

tokens.unsafeRunSync()
// res0: Vector[String] = Vector("i", "like", "jalapenos")

We can see that our text was lowercased and the unicode ñ replaced with an ASCII n.

Languages

Textmogrify comes with support for multiple languages. When setting up an AnalyzerBuilder you'll have access to language specific options once you call one of the helper language methods like english or french. Specifying a language preserves the configuration set beforehand.

val base = AnalyzerBuilder.default.withLowerCasing.withASCIIFolding

val en = base.english.withPorterStemmer.tokenizer[IO]
val fr = base.french.withFrenchLightStemmer.tokenizer[IO]
val es = base.spanish.withSpanishLightStemmer.tokenizer[IO]

All of en, fr, and es will both lowercase and asciifold their inputs in addition to using their language specific stemmers.

Pipelines

Another common use is to construct a Pipe, or Stream to Stream function using an Analyzer. Let's say we have some messages we want to analyze and index as part of some search component. Given a raw Msg type and an analyzed Doc type, we want to transform a Stream[F, Msg] into a Stream[F, Doc].

import fs2.Stream

case class Msg(id: Int, msg: String)
case class Doc(id: Int, tokens: Vector[String])

val input = Stream(
  Msg(0, "How do i trim my cats nails?"),
  Msg(1, "trimming cat nail"),
  Msg(2, "cat scratching furniture"),
)
import fs2.Pipe

val normalizeMsgs: Pipe[IO, Msg, Doc] = msgs => {
  val tokenizer = AnalyzerBuilder.english
    .withLowerCasing
    .withCustomStopWords(Set("how", "do", "i", "my"))
    .withPorterStemmer
    .tokenizer[IO]
  Stream.resource(tokenizer)
    .flatMap(f => msgs.evalMap(m => f(m.msg).map(ts => Doc(m.id, ts))))
}

We can then run our stream of Msg through our tokenizer Pipe to get our Docs:

val docs: Stream[IO, Doc] = input.through(normalizeMsgs)
docs.compile.toList.unsafeRunSync()
// res1: List[Doc] = List(
//   Doc(id = 0, tokens = Vector("trim", "cat", "nail")),
//   Doc(id = 1, tokens = Vector("trim", "cat", "nail")),
//   Doc(id = 2, tokens = Vector("cat", "scratch", "furnitur"))
// )

Be careful not to construct the tokenizer within a loop, we want to create it once and reuse it throughout the Pipe.