mirror of
https://github.com/kubernetes-sigs/kustomize.git
synced 2026-05-17 18:25:26 +00:00
429 lines
18 KiB
Markdown
429 lines
18 KiB
Markdown
## What is this?
|
||
### In short
|
||
Be the GoDoc.org of k8s configuration files.
|
||
|
||
### More explicitly
|
||
Support k8s document indexing from open-source configurations in order to make
|
||
it easy for people to learn to use a new feature, explore k8s configs in a
|
||
central hub, and see some metrics about kustomize use.
|
||
|
||
We want people to be able to support three main classes of queries:
|
||
|
||
1. Structured document queries: how should I use the following fields
|
||
- Grace periods: `spec:template:spec:terminationGracePeriod`?
|
||
- Kustomize inline patch: `patches:patch`?
|
||
|
||
2. Key value queries: how should I use this more specific use case of a
|
||
structure configuration.
|
||
- HorizontalPodAutoScalers: `kind=HorizontalPodAutoScaler`?
|
||
- Patches on StatefulSets: `patches:target:kind=StatefulSet`?
|
||
|
||
3. Full text search: search the comments and the document text from any
|
||
type of k8s config file.
|
||
|
||
## Road map
|
||
There is a lot that can be added in order to improve the state of this
|
||
application. Some more details along with general thoughts and comments can be
|
||
found in the Roadmap.md file in this directory. This README contains only
|
||
what can be considered as mostly complete and iterable parts of this project.
|
||
|
||
## Running this project
|
||
Everything is configured using kubernetes, so it should be easy for people to
|
||
spin this up on any k8s cluster. Everything should just work (TM).
|
||
|
||
The config files live in the `config` directory.
|
||
|
||
```
|
||
config
|
||
├── base
|
||
│ └── kustomization.yaml
|
||
├── crawler
|
||
│ ├── base
|
||
│ │ ├── github_api_secret.txt
|
||
│ │ └── kustomization.yaml
|
||
│ ├── cronjob
|
||
│ │ ├── cronjob.yaml
|
||
│ │ └── kustomization.yaml
|
||
│ └── job
|
||
│ ├── job.yaml
|
||
│ └── kustomization.yaml
|
||
├── elastic
|
||
│ └── ...
|
||
├── redis
|
||
│ ├── document_keystore
|
||
│ │ ├── kustomization.yaml
|
||
│ │ ├── redis.yaml
|
||
│ │ └── service.yaml
|
||
│ └── http_cache
|
||
│ ├── kustomization.yaml
|
||
│ ├── redis.yaml
|
||
│ └── service.yaml
|
||
├── webapp
|
||
│ ├── backend
|
||
│ │ ├── deployment.yaml
|
||
│ │ ├── kustomization.yaml
|
||
│ │ └── service.yaml
|
||
│ └── frontend
|
||
│ ├── deployment.yaml
|
||
│ ├── kustomization.yaml
|
||
│ └── service.yaml
|
||
└── schema_files
|
||
└── kustomization_index
|
||
├── es_index_mappings.json
|
||
└── es_index_settings.json
|
||
```
|
||
|
||
To get everything up and running you have to:
|
||
|
||
1. Get some instance of elasticsearch working... and configure the
|
||
configmapGenerator in `config/base` to point to the right endpoint(s). The
|
||
configurations that need this value to be populated are the following:
|
||
- `config/crawler/cronjob` to run periodic crawls.
|
||
- `config/crawler/job` to run crawls on demand.
|
||
- `config/webapp/backend` to run the search server.
|
||
|
||
2. Configure the elasticsearch indices:
|
||
```
|
||
kustomize build config/schema_files/kustomization_index | kubectl apply -f -
|
||
```
|
||
This will run a `curl` command that reads json data from a ConfigMap. This will
|
||
setup the schema. If you want to make more complex modifications to the
|
||
schema, you should refer to the elastic docs to figure out whether the mapping
|
||
can be added to the current index, or whether you will need to copy the
|
||
existing index into a different one with the appropriate mappings. Modifications
|
||
can be made by using the elasticsearch go library and writing a simple program,
|
||
or it can be made with any http command to the appropriate server endpoint from
|
||
within the cluster. Unfortunately I did not have the time to write a few helper
|
||
tools for this. Feel free to contact me if you need help with modifying
|
||
elasticsearch configs, I'm by no means an expert, but I can try to help.
|
||
|
||
3. (Optional) run the redis http chache for the crawler:
|
||
```
|
||
kubectl apply -k config/redis/http_cache
|
||
```
|
||
This will create a deployment for the cache, and a service. The crawler should
|
||
be configured to connect to the `http_cache` if it exists, but you can always
|
||
check the logs to make sure it connects, and that the identifiers match in the
|
||
crawler configuration and for the service endpoint.
|
||
|
||
The please be aware that the cache does not have a persistent volume.
|
||
|
||
4. Configure the main redis instance:
|
||
```
|
||
kubectl apply -k config/redis/document_keystore
|
||
```
|
||
This will create a StatefulSet with a volume of 4GiB for a redis instance.
|
||
|
||
5. Get an access token from GitHub.
|
||
|
||
To be able to kindly ask GitHub for it's data on k8s config files, you'll need
|
||
to create an access\_token. From my understanding, this is the only way to do
|
||
these code search queries (without first specifying a repository).
|
||
|
||
To generate a token, go to your GitHub's account in Settings > Developer
|
||
Settings > Personal access tokens. It should look like this.
|
||
|
||

|
||
|
||
From here you want to generate a new token and have the following
|
||
configuration:
|
||
|
||

|
||
|
||
If you have uses for any other data from this token, (org data, or something
|
||
else) you can pick and choose, but be careful since it can grant this
|
||
application access to your notifications, etc. However, any such extension
|
||
is explicitly a non-goal and would not be maintained by this project.
|
||
|
||
6. Launch the crawler:
|
||
```
|
||
kustomize build config/crawler/cronjob | kubectl apply -f -
|
||
```
|
||
This will periodically run the crawler every day according to the cron timing
|
||
rules in the cronjob.yaml file.
|
||
|
||
Instead, to get the crawler running now, you can run:
|
||
```
|
||
kustomize build config/crawler/cronjob | kubectl apply -f -
|
||
```
|
||
which will launch a non-periodic version of the crawler. It will take a few
|
||
minutes for the crawler to split the search, but then config files should
|
||
start to get populated within 20 minutes. It may take a while to do the
|
||
first crawl, since it has to fetch rate-limited endpoints for each new file it
|
||
finds. It should get significantly faster to update in the future.
|
||
|
||
5. Launch the search backend
|
||
```
|
||
kustomize build config/webapp/backend | kubectl apply -f -
|
||
```
|
||
|
||
6. Launch the search frontend
|
||
```
|
||
kustomize build config/webapp/frontend | kubectl apply -f -
|
||
```
|
||
|
||
## Notes about the components
|
||
|
||
### Elasticsearch
|
||
I will add a basic working setup soon. I just did the lazy thing and used an
|
||
already packaged solution. Most clouds will provide their own elastic
|
||
environments, however, Elasticsearch is also working on their own
|
||
implementation of a
|
||
, which might
|
||
be worth checking out. Please note that it comes with its own license
|
||
agreement.
|
||
|
||
### Redis
|
||
There are two Redis instances that are used in this application.
|
||
|
||
One of them is configured to have on disk persistence, so make sure to have
|
||
that set up in your kubernetes cluster. Also note that it is running on a
|
||
single master node (i.e. it does not automatically shard keys to multiple head
|
||
nodes as part of a highly available cluster). Since it's storing a sparse
|
||
graph, I can't imagine this being much of an issue, but it's probably worth
|
||
mentioning.
|
||
|
||
The other Redis instance is running as a HTTP (RFC 7234) cache for etags from
|
||
GitHub (or any other document store from which we could crawl/index). This one
|
||
does not require full persistent storage on disk. The caching strategy is an
|
||
LRU cache which is probably a good starting point. It might be worth it to
|
||
investigate other cache policies, but I think LRU will work well since
|
||
documents may or may not expire anyway, and the amount of memory allocated for
|
||
keys is fairly large, so eviction of frequently used documents seems unlikely
|
||
anyway.
|
||
|
||
### Nginx + Angular
|
||
There is a Dockerfile included for generating the container image with Nginx
|
||
(using the default package) and adding all of the supporting compiled angular
|
||
files. Any modifications to the code-base should be compatible with this setup,
|
||
so all that's needed is to rebuild the container image, and possibly modify
|
||
the image tags in the k8s file.
|
||
|
||
### Supporting Go binaries
|
||
There are a few go binaries that each have their own Dockerfile to build
|
||
containers in which to run them on k8s, namely the crawler and the search
|
||
service. Their configurations are not optimal (read: needs to be cleaned up),
|
||
but they are functional.
|
||
|
||
## Technical details
|
||
|
||
### Overall design and imlpementation
|
||
|
||
There are a few components that are all running together in order to get
|
||
the overall application to work smoothly. This section will provide a brief
|
||
overview of each component with the following sections going into more details.
|
||
|
||
The overall structure is outlined in the following figure:
|
||

|
||
|
||
#### Crawler
|
||
The leftmost component consists of a crawler with an http cache of GitHub
|
||
queries does two things, it first looks at the list of documents in
|
||
elasticsearch and tries to update them. In doing so, it maintains a set of
|
||
newly updated files to exclude them from other parts of the crawl.
|
||
|
||
To find newly added documents, the crawler crawls any new dependencies
|
||
introduced in the document updating step and it also queries GitHub for the
|
||
most recently indexed kustomization.\* files. Each new file will be processed
|
||
for efficient text queries and put into the document index. Any new dependency
|
||
will also incur more crawl operations. Finally, a graphical
|
||
representation of the documents and their dependencies is built in Redis to be
|
||
used for graph algorithms such as PageRank and component analysis.
|
||
|
||
#### Data library
|
||
There are a few helper libaries for dealing with Elasticsearch, Redis and
|
||
documents. This is not persistent, nor is it centralized. They act as small
|
||
components that help to package common pieces of code. Eventually it may make
|
||
sense to merge all of it together and make a proper persistent model around
|
||
this while providing an external API for document insertion/deletion. But
|
||
that is definitely out of scope in terms of getting this to run. However
|
||
there are limitations with the current model in terms of minimizing the
|
||
API surface for the different components of the application. For now this
|
||
problem is mostly mitigated by having the query server only connected to
|
||
a data node of the Elasticsearch cluster, but the problem of knowing what
|
||
is accessible and what isn't is left to the programmer instead of being
|
||
clearly and explicitly supported by the API.
|
||
|
||
#### Server
|
||
Uses the data library to communicate with the data store and answer queries.
|
||
Processes the user entered text queries into somewhat optimized elasticsearch
|
||
queries. Provides a few endpoints to get different metrics and to eventually
|
||
allow for registration of remote repositories.
|
||
|
||
This application has an exposing service in order to allow users of the
|
||
application access to queries and the results.
|
||
|
||
#### Nginx + Angular
|
||
Communicates directly with the backend server to forward user queries and
|
||
their results. Presents the results on an interface. It's still pretty simple
|
||
looking but it seems usable (to me).
|
||
|
||
|
||
### Crawling GitHub
|
||
With the use of API keys, GitHub allows account owners to search for files
|
||
using their API.
|
||
|
||
The search endpoints allow for the use of metadata search
|
||
that is fairly useful/powerful. For instance they provide a `filename:` keyword
|
||
that permits us to look for `kustomization.yaml`, `kustomization.yml`, etc.
|
||
This enables the fetching of a list of kustomization documents, from which
|
||
we can get the actual content from another endpoint
|
||
(raw.githubusercontent.com).
|
||
|
||
However, the search API is fairly limited. There is a restriction to the number
|
||
of documents that can be retrieved from this method. One possible way to
|
||
mitigate this would be to periodically query GitHub for results, sorted by the
|
||
last indexed time. This would allow you to collect most documents from this
|
||
point forwards. The downside to this is that it may require a large number of
|
||
requests to their API since you cannot know when new files will be added.
|
||
Furthermore, there is a possibility that you would not be able to get all of
|
||
files either, depending on the velocity of growth.
|
||
|
||
The approach that was taken to mitigate this is to use the `filesize:` keyword
|
||
and to shard the search space into contiguous buckets of appropriate size in
|
||
order to get all of the documents. This is fairly efficient, since you can find
|
||
a good enough way to shard the documents in
|
||
`lg(max file size) * number of documents / 1000` API queries. Moreover, since
|
||
queries are paginated with at most 100 results per query, this solution is
|
||
competitive with getting the optimal (non-contiguous) sharding of result sets.
|
||
Furthermore, filesize queries can be cached to minimize the total number of
|
||
queries called to the API in order to shard the search space. This is done by
|
||
querying for file size intervals that always start with 0..X and binary
|
||
searching over the `filesize:` space. This will allow you to reuse a lot of
|
||
queries when you're looking for the next range, since it is upper bounded and
|
||
lower bounded to a smaller number of queries within a range that has also been
|
||
queried. I think this is only true because filesizes are power law distributed,
|
||
so searches will typically require less queries as they progress from left to
|
||
right.
|
||
|
||
However, this method in no way depends on intervals of the form 0..X, as
|
||
the number of documents in the many intervals of the range search could be
|
||
added together to also make this work. This approach just seemed simpler to
|
||
implement, maintain, and debug so it was preferred.
|
||
|
||
To get an idea of how efficient this method is, to shard the search space of
|
||
7000 documents, it will only take ~90 API range queries which should only take
|
||
a few minutes. While actually fetching the documents and their relevant
|
||
metadata (creation time, etc.) will take several hours. Furthermore, this
|
||
could be made more efficient if a prior distribution is approximated.
|
||
This prior could be scaled to the number of documents that need to be fetched,
|
||
and then finding a shard that has an adequate number of requests, will only
|
||
take a few queries per shard. It could probably be supported in a constant
|
||
number of size queries if the size of each shard is halved which shouldn't
|
||
have terrible performance impact for the retrieval. However, there where
|
||
more pressing things to implement. I might revisit this later.
|
||
|
||
### Document Indexing and Processing
|
||
In order to support simple text queries the structured documents must be
|
||
processed in some way that makes searching them easy. The current method
|
||
is to recursively traverse the map of configurations to generate each sub-path
|
||
and each key-value pair for the leaf nodes of the recursion tree.
|
||
|
||
However, note that this means that a document has to be valid yaml/json
|
||
format in order for indexing to happen. The rest of the document is treated
|
||
as mostly text and uses default text settings from Elasticsearch.
|
||
|
||
What this means is that for the following yaml document:
|
||
|
||
```yaml
|
||
resources:
|
||
- service.yaml
|
||
- deployment.yaml
|
||
|
||
configmapGenerator:
|
||
- name: app-configuration
|
||
files:
|
||
- config.yaml
|
||
|
||
patchesJson6902:
|
||
- target:
|
||
version: v1
|
||
kind: StatefulSet
|
||
name: ss-name
|
||
path: ss-patch.yaml
|
||
- target:
|
||
version: v1
|
||
kind: Deployment
|
||
name: dep-name
|
||
path: dep-patch.yaml
|
||
```
|
||
|
||
the following flattened structure would look like:
|
||
```
|
||
{
|
||
"identifiers": [
|
||
"resources",
|
||
"configmapGenerator",
|
||
"configmapGenerator:name",
|
||
"configmapGenerator:files",
|
||
"patchesJson6902",
|
||
"patchesJson6902:target",
|
||
"patchesJson6902:target:version",
|
||
"patchesJson6902:target:kind",
|
||
"patchesJson6902:target:name",
|
||
"patchesJson6902:path",
|
||
],
|
||
"values": [
|
||
"resources=service.yaml",
|
||
"resources=deployment.yaml",
|
||
"configmapGenerator:name=app-configuration",
|
||
"configmapGenerator:files=config.yaml",
|
||
"patchesJson6902:target:version=v1",
|
||
"patchesJson6902:target:kind=StatefulSet",
|
||
"patchesJson6902:target:name=ss-name",
|
||
"patchesJson6902:path=ss-patch.yaml",
|
||
"patchesJson6902:target:kind=Deployment",
|
||
"patchesJson6902:target:name=dep-name",
|
||
"patchesJson6902:path=dep-patch.yaml",
|
||
],
|
||
...
|
||
}
|
||
```
|
||
|
||
Note that unique paths and values are deduplicated.
|
||
|
||
On the search side, exact queries will be prioritized, but the document paths
|
||
and key=value pairs will also be analyzed with 3-grams to have some amount of
|
||
fuzzy search. The reason that a Levenshtein-Distance was not used instead, is due
|
||
to searching multiple fields at the same time, which is a use case where
|
||
Elasticsearch does not support proper fuzzy searching.
|
||
|
||
### Document Search
|
||
Given a text query, each token is considered separately. Each token will be fed
|
||
through a handful of analyzers on the Elasticsearch side, and will be compared
|
||
with the reverse document index of each document fields. It will then determine
|
||
the best matching documents. Text ordering is largely insignificant. This makes
|
||
sense for the structured search, but may leave room for improvement for the
|
||
text only search within the document.
|
||
|
||
Each token _must_ be matched, so each white space character acts as a
|
||
conjunction of individual queries. There are also ways of telling
|
||
Elasticsearch that some things _should_ match, but I think for now it makes
|
||
more sense to leave it as is.
|
||
|
||
I think this behavior is sufficient to make the search feel fairly intuitive
|
||
while providing support for fairly complex use cases.
|
||
|
||
### Metrics Computation
|
||
From the each kustomization document that is indexed, we can find it's
|
||
resources that are publicly available. This includes other kustomizations.
|
||
From this, we can build a directed graph of dependencies and reverse
|
||
dependencies.
|
||
|
||
This opens up the possibility to add a plethora of graph metrics that can
|
||
give the project maintainers feedback and insight into how people are using
|
||
their tools.
|
||
|
||
Some of these are useful such as getting an idea for how large the dependency
|
||
graphs actually grow in practice, and can be used to find _popular_
|
||
kustomizations within the corpus. This lends itself to implementing PageRank
|
||
to help bubble up popular results as good search results. I unfortunately
|
||
did not have the time to implement the algorithm, but I do plan to revisit
|
||
this sometime soon to add a few good and efficient implementations of useful
|
||
graph algorithms that would be useful to have. See the Roadmap.md for a more
|
||
complete list of features that could be added and how I think they could be
|
||
implemented.
|