textmogrify
Textmogrify is a pre-alpha text manipulation library that hopefully works well with fs2.
Usage
This library is currently available for Scala binary versions 2.13 and 3.2.
To use the latest version, include the following in your build.sbt
:
libraryDependencies ++= Seq(
"pink.cozydev" %% "textmogrify" % "0.0.7"
)
Lucene
Currently the core functionality of textmogrify is implemented with Lucene via the Lucene module.
libraryDependencies ++= Seq(
"pink.cozydev" %% "textmogrify-lucene" % "0.0.7"
)
The Lucene module lets you use a Lucene Analyzer to modify text, additionally it provides helpers to use Analyzer
s with an fs2 Stream.
Basics
Typical usage is to use the AnalyzerBuilder
to configure an Analyzer
and call .tokenizer[F]
to get a Resource[F, String => F[Vector[String]]]
:
import textmogrify.lucene.AnalyzerBuilder
import cats.effect.IO
val tokenizer = AnalyzerBuilder.default.withLowerCasing.withASCIIFolding.tokenizer[IO]
val tokens: IO[Vector[String]] = tokenizer.use(
f => f("I Like Jalapeños")
)
Because this documentation is running in mdoc, we'll import an IO runtime and run explicitly:
import cats.effect.unsafe.implicits.global
tokens.unsafeRunSync()
// res0: Vector[String] = Vector("i", "like", "jalapenos")
We can see that our text was lowercased and the unicode ñ
replaced with an ASCII n
.
Languages
Textmogrify comes with support for multiple languages.
When setting up an AnalyzerBuilder
you'll have access to language specific options once you call one of the helper language methods like english
or french
.
Specifying a language preserves the configuration set beforehand.
val base = AnalyzerBuilder.default.withLowerCasing.withASCIIFolding
val en = base.english.withPorterStemmer.tokenizer[IO]
val fr = base.french.withFrenchLightStemmer.tokenizer[IO]
val es = base.spanish.withSpanishLightStemmer.tokenizer[IO]
All of en
, fr
, and es
will both lowercase and asciifold their inputs in addition to using their language specific stemmers.
Pipelines
Another common use is to construct a Pipe
, or Stream
to Stream
function using an Analyzer
.
Let's say we have some messages we want to analyze and index as part of some search component.
Given a raw Msg
type and an analyzed Doc
type, we want to transform a Stream[F, Msg]
into a Stream[F, Doc]
.
import fs2.Stream
case class Msg(id: Int, msg: String)
case class Doc(id: Int, tokens: Vector[String])
val input = Stream(
Msg(0, "How do i trim my cats nails?"),
Msg(1, "trimming cat nail"),
Msg(2, "cat scratching furniture"),
)
import fs2.Pipe
val normalizeMsgs: Pipe[IO, Msg, Doc] = msgs => {
val tokenizer = AnalyzerBuilder.english
.withLowerCasing
.withCustomStopWords(Set("how", "do", "i", "my"))
.withPorterStemmer
.tokenizer[IO]
Stream.resource(tokenizer)
.flatMap(f => msgs.evalMap(m => f(m.msg).map(ts => Doc(m.id, ts))))
}
We can then run our stream of Msg
through our tokenizer Pipe
to get our Doc
s:
val docs: Stream[IO, Doc] = input.through(normalizeMsgs)
docs.compile.toList.unsafeRunSync()
// res1: List[Doc] = List(
// Doc(id = 0, tokens = Vector("trim", "cat", "nail")),
// Doc(id = 1, tokens = Vector("trim", "cat", "nail")),
// Doc(id = 2, tokens = Vector("cat", "scratch", "furnitur"))
// )
Be careful not to construct the tokenizer
within a loop, we want to create it once and reuse it throughout the Pipe
.