Concepts¶
A zero-to-sixty guide on getting up to speed with the KTypes Library.
Getting Started¶
Install the KTypes library with pip
$ python3 -m pip install ktypes
To use the KTypes library, import the types
submodule:
from ktypes import types
All known types, primitive and user-defined, are available as attributes of the
types
module. This provides a simple interface by which the formal types can be
accessed while avoiding name collisions between Python objects and KTypes types.
Primitive Types¶
Currently KTypes supports three primitive types; str
, int
, and float
which
can be accessed by attributing the types
module. Using the same syntax to construct
instances of standard Python objects, we can construct a new token of a known type.
All we need to supply is a data instance matching the type we want to construct (e.g.
for an int, we would supply a Python int), but when in doubt, KTypes constructors all
support ingesting a string format as the data instance”
x = types.int(42)
y = types.int("53")
print(x)
# 42 : int
print(y)
# 53 : int
Primitive types support basic operations (+|-|*|/
) in the obvious way. If an
operation is not generally well defined (e.g. subtracting strings), a library error
will be thrown. Similarly, applying a basic operation on two tokens of different types
will always throw an error. Unlike C/C++ or Java, KTypes does not support type coercion.
x = types.int(12)
y = types.int(4)
a = types.str("hello")
b = types.str("world")
print(x * y)
# 48 : int
print(a + b)
# helloworld : str
Primitive types can be converted back to standard Python objects to perform unsafe
operations as well by using the .value
attribute of the token. In general,
this is not encouraged. All type conversions should be done using the KTypes type-casting
system (underdevelopment)
x = types.int(9)
y = types.int(3 + x.value)
print(x.value)
# 3
print(y)
# 12 : int
Functions¶
In addition to type annotations for data, KTypes also supports type annotations for
functions defined between formally typed data. Using the types.function
decorator,
we can mark standard Python methods as formally typed functions and use these functions
in formally typed settings.
In order for the function decorator to behave properly, decorated functions must be fully type specified using the annotation syntax introduced in Python 3
@types.function
def twice(x : types.int) -> types.int:
return x + x
print(twice)
twice : int -> int
In the example above, we decorate the fully-annotated method twice
to convert
it into a formally typed function. The syntax for notating formal function types
is borrowed from Haskell, as evidenced by the __str__
method defined for the decorated
twice
formal function object. As this notation implies, multiple argument formal
function are in fact curried functions. Formal functions can be evaluated with either the
traditional syntax, or with the more ‘correct’ curried notation
@types.function
def sum(x : types.int, y : types.int) -> types.int:
return x + y
x = types.int(10)
y = types.int(20)
print(sum)
# sum : int -> int -> int
print(sum(x, y))
# 30 : int
print(sum(x)(y))
# 30 : int
As expected, partial currying of an argument returns a partially evaluated function
which can is still a formal function, and which can be used later. Borrowing the sum
function from above:
add5 = sum(types.int(5))
print(add5)
# klambda<sum> : int -> int
y = add5(types.int(15))
print(y)
# 20 : int
While there is no requirement that only formal function are used during data manipulations, best practice would have it that one ought always use formal functions to manipulate data. The strength of a formal type system is in the transparency and confidence it provides when manipulating data. These benefits would be lost if standard Python methods were used instead.
Constructions¶
Beyond the primitive types and formal function types, the KTypes library also supports the type-theoretic construction of product and coproduct types, or, equivalently, “and” and “or” types, respectively. These methods of constructing new types gives the KTypes ecosystem the additional level or expressivity and richness required to represent most data schemas.
Product Types¶
The first construction is the (Cartesian) Product. Whereas Formal Type Theory defines this procedure to be a binary operation on two component types, due to the prevalence of product types in real world applications, the KTypes library treats all product types as n-products, thus avoiding the unnecessary syntax required to iteratively apply a binary product type constructor.
Constructing a product type is remarkably simple. A dict
is used to encode all
information required to express the n-product, and passing the dict into the types
module is sufficient for the library to complete the type inference and construct the
appropriate type.
Because two components of an n-product can be the same type, a unique identifier would be
required to distinguish between them. Solving for this problem, all n-products are defined
with named components. A dict
handles this association perfectly.
schema = {
"name": types.str,
"age": types.int,
}
The above schema
defines the information necessary to infer a 2-product type
containing both str
and int
types. To the object oriented programmer, this
type of construction should bear a striking resemblance to defining a class
. The
keys in the dict
correspond to the attribute names and the values correspond to the
formal type of each attribute.
Given such a dict
such as schema
from above, constructing a product type is easy:
types.person = schema
print(types.person)
# person
This syntax has two key requirements which are not immediately obvious.
- The attribute name of
types
which we are assigning to name the new product type, in this case,person
, must not already be associated with a type. In other words, redefining named types is not a supported action, but defining a new type is. - A
dict
must be supplied on the right hand side, and it must be well-formed. That is, it must have attribute names as keys and known types as values. Note that the known type values need not be primitive types.
Note also that the order provided in the dict
is relevant to the definition of
the n-product type, a different order will product a different n-product type.
A token of an n-product type can be constructed by supplying a dict
very similar
to the schema
which was used to construct the n-product type, but replacing the
values of the dict with actual tokens of the required type. Once a new type is defined,
it can be accessed by attributing the types module. N-product types are constructed
by the same syntax which constructs instances of standard Python objects. Note that
the instance data supplied must be the appropriate dict
or a type mismatch error
will be thrown.
data = {
"name": types.str("Charles"),
"age": types.int("21")
}
person_token = types.person(data)
print(person_token)
# [Charles : str, 21 : int] : person
Coproduct Types¶
Coproduct types serve the same role as unions in C/C++ and Haskell. Following the
logic of left/right injections which, in a formal Type-Theoretic setting, construct
tokens of the coproduct type, the coproduct constructor is a binary operation on
two known types. Currently the coproduct constructor is non-commutative, and thus
the coproduct types A | B
and B | A
are distinct.
Coproduct types can also be understood as “or” types, hence the bar “or-operator” used in the syntax of the coproduct constructor. As a result, constructing a coproduct type is quite intuitive:
types.person_or_errormsg = types.person | types.str
print(types.person_or_errormsg)
# person | str
A token of a coproduct type can be constructed by passing in one of the allowable types of the coproduct into the constructor. Building on the previous examples:
coprod_token1 = types.person_or_errormsg(types.str("error"))
coprod_token2 = types.person_or_errormsg(person_token))
print(coprod_token1)
# inr(error : str) : person | str
print(coprod_token2)
# inl([Charles : str, 21 : int] : person) : person | str
Product Functions¶
Beyond grouping relevant data together, product types are also useful as they permit formally typed functions to be defined over them. Functions defined over product types can be single-domain typed to avoid any unnecessary complications with currying and multiple arguments.
The KTypes library includes a built-in product function constructor which takes in a curried formal function and outputs a single-domain typed formal function defined over the product type inferred from the input function’s signature.
As an example, give some function f : int -> int -> int
, supplying f
as the
input to the KTypes product function constructor would yield a single-domain typed
formal function defined on the product type int % int
. In other words, we would
obtain f' : int & int -> int
which only takes one argument, a token of the product
type int & int
.
The product function construct can be called from the types
module as demonstrated
below:
# define the product type
types.int_pair = {
"a": types.int,
"b": types.int,
}
@types.function
def sum(x : types.int, y : types.int) -> types.int:
return x + y
prod_sum = types.ind_prod(sum)
print(sum)
# sum : int -> int -> int
print(prod_sum)
# klambda : int_pair -> int
Coproduct Functions¶
Similarly, the KTypes library gives the power to define functions out of coproduct
types. The syntax is aliased to the pipe (|
) operator, and provides a modular
approach that allows reduces if/else
clutter when defining such functions.
Given a coproduct type A | B
and two functions f : A -> C
and g : B -> C
the KTypes library interprets the well-defined expression f | g
as defining
a new function out of the coproduct type A | B
whose operation is uniquely defined
by evaluating either f
or g
. In other words, f | g
is a formally typed
function and we express it as f | g : A | B -> C
.
Note
While Formal Type Theory does not distinguish between simple types, their closure under products/coproducts, and formal function types, the KTypes library does not currently permit the inclusion of formal functions in products/coproducts. Effectively, the library enforces a strict differentiation between data and data manipulations.
Thus, there should be no ambiguity when the pipe operator is used. When A
is a
simple type and f
is a formal function, the operation A | f
is not well
defined an will throw an exception.
In defining formal functions out of coproduct types in this way, there are three
requirements to ensure the well-formed-ness of the expression f | g
.
- Both functions
f
andg
must land in the same type. - Both functions
f
andg
must not be defined over the same domain. - Both functions
f
andg
must be single-domain typed functions. They may not curry arguments.
Of course it is assumed that both f
and g
are formal functions notated by the
types.function
decorator. The follow example shows this construction in practice:
types.int_or_str = types.int | types.str
@types.function
def f(x : types.int) -> types.str:
return types.str("It is an integer")
@types.function
def g(x : types.str) -> types.str:
return types.str("String is '") + x + types.str("'")
to_str = f | g
print(to_str)
# klambda : int | str -> str
token = types.int_or_str("hello")
print(to_str(token))
# It was a string 'hello' : str
Predicates¶
The KTypes library also provides functionality to define subtypes out of pre-existing types within the type universe. Subtypes are built from existing types but enforce a custom defined boolean predicate which is checked at runtime to ensure type-correctness.
Predicates can be user defined functions or lambda expressions, which take in as their
sole argument the piece or raw data which should be evaluated, and returns true or false
depending on if the raw data matches. Predicated types are created as follows using the
.where(...)
attribute method.
def is_na(raw_data):
return raw_data == "N/A"
types.nan = types.str.where(predicate=is_na)
print(types.nan.matches("hello"))
# False
print(types.nan.matches("N/A"))
# True
In the example above, we see that types.nan
is a user-defined subtype of the
primitive type str
where the only token of the type is "N/A"
. Because
predicates are simple Python methods, the provide additional power and usability
to the KTypes framework. More complex predicates can enforce regex matching, gate
ranges and specific values, or even be used to model enums.
Parsing¶
The primary goal of the KTypes library is to facilitate the ingesting and processing of ad-hoc data formats. These are sources of data which result from possibly real-world sources which may be incomplete, corrupt, or unstandardized. Our goals is to bring these ad-hoc data formats into a well typed environment where reasoning and manipulations can be made precise.
As a result, the KTypes framework takes inspiration and follows much of the design as the PADS project. Specifically, we introduce a built-in parser to assist in the automated ingestion of ad-hoc data formats. Data can be directly parsed into a type using an instance of the the types.parser class, which takes in a known type and a parse format specification string.
A parse format specification string is a Python str which describes how a known type should be translated into a string raw data representation. Currently, the parser is only compatible with product types built from primitive/or types, but this is an active domain for development.
Each attribute of the product type should be enclosed in $
(dollar signs), and
any non-enclosed characters are treated as delimiters which complete the data
representation. The following example demonstrates a typical .csv
format for the
user
product type.
types.user = {
"name": types.str,
"age": types.int,
"email": types.str,
}
parse_format = "$name$, $age$, $email$\n"
Given the user
type and the parse_format
as above, constructing a KTypes
parser is simple:
parser = types.parser(types.user, parse_format)
The type specification and parse format specification are abstracted into separate entities to allow various data formats all to be parsed into the same underlying type, which reduces the need to modify downstream code. Thus, changes to the data source can in general be handled by updating the parse format, reducing the risk of propagating changes to downstream code and logic.
The parser
can then be used to parse raw string data into well-formed known types.
Specifically, parser
will create tokens of the user
type.
A parser has two methods of parsing raw string data: instance and stream
Instance¶
The .parse_instance(...)
method takes in a Python str
object and greedily
parses out a single token. The following example illustrates how it works. Given the
parser
as defined above:
raw_data = "John Howerson, 42, howerson@email.org\n ignored content"
result = parser.parse_instance(raw_data)
print(result)
# [John Howerson : str, 42 : int, howerson@email.org : str] : user
The output of .parse_instance(...)
, if the parsing succeeds, is a well-formed
token of the type supplied when the parser
was defined. This can then be used
as an input to any formal function and all KTypes manipulations are supported.
Notice that raw_data
must not totally match the type, and greedy parsing ensures
that the longest-first matching substring of raw_data
is used. Longest-first is
the longest substring after the first match, where all in-between substrings are
also matches.
In other words, the parse will keep consuming characters until a match is encountered. If the next (lookahead) character causes a match failure, then this substring will be used as the raw data instance; otherwise, the parser will continue consuming characters and building up the raw data instance until the lookahead token induces a match failure.
Stream¶
The KTypes parser is also available as a stream parser. Using the .parse_stream(...)
method, the parser will ingest the supplied raw string data instance and store all parsed
tokens as a list. This is a persistent list, so long as the reset
argument is not
set to True
.
The .parse_stream(...)
method returns a window to this persistent list after each
call, thus all tokens can be obtained by saving the final return value of the method.
The following example demonstrates these two features.
raw_data1 = "John Howerson, 42, howerson@email.org\nBob Billins, 23, bill@mail.com\n"
result = parser.parse_stream(raw_data)
print(result)
for r in result:
print(r)
# [John Howerson : str, 42 : int, howerson@email.org : str] : user
# [Bob Billins : str, 23 : int, bill@mail.com : str] : user
result = parser.parse_stream("James Ray")
result = parser.parse_stream(", 31", ")
result = parser.parse_stream("ray.j@mailmain.edu\n")
for r in result:
print(r):
# [John Howerson : str, 42 : int, howerson@email.org : str] : user
# [Bob Billins : str, 23 : int, bill@mail.com : str] : user
# [James Raw : str, 31 : int, ray.j@mailmain.edu : str] : user