First draft of documentation for internal/tools

2026-06-10 08:20:59 +00:00 · 2019-08-21 16:48:18 -07:00
parent 66fa2de073
commit 351df67e39
5 changed files with 604 additions and 0 deletions
--- a/internal/tools/README.md
+++ b/internal/tools/README.md
@@ -0,0 +1,428 @@
+## What is this?
+### In short
+Be the GoDoc.org of k8s configuration files.
+
+### More explicitly
+Support k8s document indexing from open-source configurations in order to make
+it easy for people to learn to use a new feature, explore k8s configs in a
+central hub, and see some metrics about kustomize use.
+
+We want people to be able to support three main classes of queries:
+
+1. Structured document queries: how should I use the following fields
+   - Grace periods: `spec:template:spec:terminationGracePeriod`?
+   - Kustomize inline patch: `patches:patch`?
+
+2. Key value queries: how should I use this more specific use case of a
+   structure configuration.
+   - HorizontalPodAutoScalers: `kind=HorizontalPodAutoScaler`?
+   - Patches on StatefulSets: `patches:target:kind=StatefulSet`?
+
+3. Full text search: search the comments and the document text from any
+   type of k8s config file.
+
+## Road map
+There is a lot that can be added in order to improve the state of this
+application. Some more details along with general thoughts and comments can be
+found in the Roadmap.md file in this directory. This README contains only
+what can be considered as mostly complete and iterable parts of this project.
+
+## Running this project
+Everything is configured using kubernetes, so it should be easy for people to
+spin this up on any k8s cluster. Everything should just work (TM).
+
+The config files live in the `config` directory.
+
+```
+config
+├── base
+│   └── kustomization.yaml
+├── crawler
+│   ├── base
+│   │   ├── github_api_secret.txt
+│   │   └── kustomization.yaml
+│   ├── cronjob
+│   │   ├── cronjob.yaml
+│   │   └── kustomization.yaml
+│   └── job
+│       ├── job.yaml
+│       └── kustomization.yaml
+├── elastic
+│   └── ...
+├── redis
+│   ├── document_keystore
+│   │   ├── kustomization.yaml
+│   │   ├── redis.yaml
+│   │   └── service.yaml
+│   └── http_cache
+│       ├── kustomization.yaml
+│       ├── redis.yaml
+│       └── service.yaml
+├── webapp
+│   ├── backend
+│   │   ├── deployment.yaml
+│   │   ├── kustomization.yaml
+│   │   └── service.yaml
+│   └── frontend
+│       ├── deployment.yaml
+│       ├── kustomization.yaml
+│       └── service.yaml
+└── schema_files
+    └── kustomization_index
+        ├── es_index_mappings.json
+        └── es_index_settings.json
+```
+
+To get everything up and running you have to:
+
+1. Get some instance of elasticsearch working... and configure the
+   configmapGenerator in `config/base` to point to the right endpoint(s). The
+   configurations that need this value to be populated are the following:
+    - `config/crawler/cronjob` to run periodic crawls.
+    - `config/crawler/job` to run crawls on demand.
+    - `config/webapp/backend` to run the search server.
+
+2. Configure the elasticsearch indices:
+```
+kustomize build config/schema_files/kustomization_index | kubectl apply -f -
+```
+This will run a `curl` command that reads json data from a ConfigMap. This will
+setup the schema.  If you want to make more complex modifications to the
+schema, you should refer to the elastic docs to figure out whether the mapping
+can be added to the current index, or whether you will need to copy the
+existing index into a different one with the appropriate mappings. Modifications
+can be made by using the elasticsearch go library and writing a simple program,
+or it can be made with any http command to the appropriate server endpoint from
+within the cluster. Unfortunately I did not have the time to write a few helper
+tools for this. Feel free to contact me if you need help with modifying
+elasticsearch configs, I'm by no means an expert, but I can try to help.
+
+3. (Optional) run the redis http chache for the crawler:
+```
+kubectl apply -k config/redis/http_cache
+```
+  This will create a deployment for the cache, and a service. The crawler should
+  be configured to connect to the `http_cache` if it exists, but you can always
+  check the logs to make sure it connects, and that the identifiers match in the
+  crawler configuration and for the service endpoint.
+
+  The please be aware that the cache does not have a persistent volume.
+
+4. Configure the main redis instance:
+```
+kubectl apply -k config/redis/document_keystore
+```
+  This will create a StatefulSet with a volume of 4GiB for a redis instance.
+
+5. Get an access token from GitHub.
+
+To be able to kindly ask GitHub for it's data on k8s config files, you'll need
+to create an access\_token. From my understanding, this is the only way to do
+these code search queries (without first specifying a repository).
+
+To generate a token, go to your GitHub's account in Settings > Developer
+Settings > Personal access tokens. It should look like this.
+
+![GitHub Token 1](
+https://sigs.k8s.io/kustomize/internal/tools/pictures/github_token.png)
+
+From here you want to generate a new token and  have the following
+configuration:
+
+![GitHub Token 1](
+https://sigs.k8s.io/kustomize/internal/tools/pictures/token_config.png)
+
+If you have uses for any other data from this token, (org data, or something
+else) you can pick and choose, but be careful since it can grant this
+application access to your notifications, etc. However, any such extension
+is explicitly a non-goal and would not be maintained by this project.
+
+6. Launch the crawler:
+```
+kustomize build config/crawler/cronjob | kubectl apply -f -
+```
+This will periodically run the crawler every day according to the cron timing
+rules in the cronjob.yaml file.
+
+Instead, to get the crawler running now, you can run:
+```
+kustomize build config/crawler/cronjob | kubectl apply -f -
+```
+which will launch a non-periodic version of the crawler. It will take a few
+minutes for the crawler to split the search, but then config files should
+start to get populated within 20 minutes. It may take a while to do the
+first crawl, since it has to fetch rate-limited endpoints for each new file it
+finds. It should get significantly faster to update in the future.
+
+5. Launch the search backend
+```
+kustomize build config/webapp/backend | kubectl apply -f -
+```
+
+6. Launch the search frontend
+```
+kustomize build config/webapp/frontend | kubectl apply -f -
+```
+
+## Notes about the components
+
+### Elasticsearch
+I will add a basic working setup soon. I just did the lazy thing and used an
+already packaged solution. Most clouds will provide their own elastic
+environments, however, Elasticsearch is also working on their own
+implementation of a
+![k8s operator](https://www.elastic.co/elasticsearch-kubernetes), which might
+be worth checking out. Please note that it comes with its own license
+agreement.
+
+### Redis
+There are two Redis instances that are used in this application.
+
+One of them is configured to have on disk persistence, so make sure to have
+that set up in your kubernetes cluster. Also note that it is running on a
+single master node (i.e. it does not automatically shard keys to multiple head
+nodes as part of a highly available cluster). Since it's storing a sparse
+graph, I can't imagine this being much of an issue, but it's probably worth
+mentioning.
+
+The other Redis instance is running as a HTTP (RFC 7234) cache for etags from
+GitHub (or any other document store from which we could crawl/index). This one
+does not require full persistent storage on disk. The caching strategy is an
+LRU cache which is probably a good starting point. It might be worth it to
+investigate other cache policies, but I think LRU will work well since
+documents may or may not expire anyway, and the amount of memory allocated for
+keys is fairly large, so eviction of frequently used documents seems unlikely
+anyway.
+
+### Nginx + Angular
+There is a Dockerfile included for generating the container image with Nginx
+(using the default package) and adding all of the supporting compiled angular
+files. Any modifications to the code-base should be compatible with this setup,
+so all that's needed is to rebuild the container image, and possibly modify
+the image tags in the k8s file.
+
+### Supporting Go binaries
+There are a few go binaries that each have their own Dockerfile to build
+containers in which to run them on k8s, namely the crawler and the search
+service. Their configurations are not optimal (read: needs to be cleaned up),
+but they are functional.
+
+## Technical details
+
+### Overall design and imlpementation
+
+There are a few components that are all running together in order to get
+the overall application to work smoothly. This section will provide a brief
+overview of each component with the following sections going into more details.
+
+The overall structure is outlined in the following figure:
+![overview](
+https://sigs.k8s.io/kustomize/internal/tools/pictures/sys_arch.png)
+
+#### Crawler
+The leftmost component consists of a crawler with an http cache of GitHub
+queries does two things, it first looks at the list of documents in
+elasticsearch and tries to update them. In doing so, it maintains a set of
+newly updated files to exclude them from other parts of the crawl.
+
+To find newly added documents, the crawler crawls any new dependencies
+introduced in the document updating step and it also queries GitHub for the
+most recently indexed kustomization.\* files. Each new file will be processed
+for efficient text queries and put into the document index. Any new dependency
+will also incur more crawl operations. Finally, a graphical
+representation of the documents and their dependencies is built in Redis to be
+used for graph algorithms such as PageRank and component analysis.
+
+#### Data library
+There are a few helper libaries for dealing with Elasticsearch, Redis and
+documents. This is not persistent, nor is it centralized. They act as small
+components that help to package common pieces of code. Eventually it may make
+sense to merge all of it together and make a proper persistent model around
+this while providing an external API for document insertion/deletion. But
+that is definitely out of scope in terms of getting this to run. However
+there are limitations with the current model in terms of minimizing the
+API surface for the different components of the application. For now this
+problem is mostly mitigated by having the query server only connected to
+a data node of the Elasticsearch cluster, but the problem of knowing what
+is accessible and what isn't is left to the programmer instead of being
+clearly and explicitly supported by the API.
+
+#### Server
+Uses the data library to communicate with the data store and answer queries.
+Processes the user entered text queries into somewhat optimized elasticsearch
+queries. Provides a few endpoints to get different metrics and to eventually
+allow for registration of remote repositories.
+
+This application has an exposing service in order to allow users of the
+application access to queries and the results.
+
+#### Nginx + Angular
+Communicates directly with the backend server to forward user queries and
+their results. Presents the results on an interface. It's still pretty simple
+looking but it seems usable (to me).
+
+
+### Crawling GitHub
+With the use of API keys, GitHub allows account owners to search for files
+using their API.
+
+The search endpoints allow for the use of metadata search
+that is fairly useful/powerful. For instance they provide a `filename:` keyword
+that permits us to look for `kustomization.yaml`, `kustomization.yml`, etc.
+This enables the fetching of a list of kustomization documents, from which
+we can get the actual content from another endpoint
+(raw.githubusercontent.com).
+
+However, the search API is fairly limited. There is a restriction to the number
+of documents that can be retrieved from this method. One possible way to
+mitigate this would be to periodically query GitHub for results, sorted by the
+last indexed time. This would allow you to collect most documents from this
+point forwards. The downside to this is that it may require a large number of
+requests to their API since you cannot know when new files will be added.
+Furthermore, there is a possibility that you would not be able to get all of
+files either, depending on the velocity of growth.
+
+The approach that was taken to mitigate this is to use the `filesize:` keyword
+and to shard the search space into contiguous buckets of appropriate size in
+order to get all of the documents. This is fairly efficient, since you can find
+a good enough way to shard the documents in
+`lg(max file size) * number of documents / 1000` API queries. Moreover, since
+queries are paginated with at most 100 results per query, this solution is
+competitive with getting the optimal (non-contiguous) sharding of result sets.
+Furthermore, filesize queries can be cached to minimize the total number of
+queries called to the API in order to shard the search space. This is done by
+querying for file size intervals that always start with 0..X and binary
+searching over the `filesize:` space. This will allow you to reuse a lot of
+queries when you're looking for the next range, since it is upper bounded and
+lower bounded to a smaller number of queries within a range that has also been
+queried. I think this is only true because filesizes are power law distributed,
+so searches will typically require less queries as they progress from left to
+right.
+
+However, this method in no way depends on intervals of the form 0..X, as
+the number of documents in the many intervals of the range search could be
+added together to also make this work. This approach just seemed simpler to
+implement, maintain, and debug so it was preferred.
+
+To get an idea of how efficient this method is, to shard the search space of
+7000 documents, it will only take ~90 API range queries which should only take
+a few minutes. While actually fetching the documents and their relevant
+metadata (creation time, etc.) will take several hours. Furthermore, this
+could be made more efficient if a prior distribution is approximated.
+This prior could be scaled to the number of documents that need to be fetched,
+and then finding a shard that has an adequate number of requests, will only
+take a few queries per shard. It could probably be supported in a constant
+number of size queries if the size of each shard is halved which shouldn't
+have terrible performance impact for the retrieval. However, there where
+more pressing things to implement. I might revisit this later.
+
+### Document Indexing and Processing
+In order to support simple text queries the structured documents must be
+processed in some way that makes searching them easy. The current method
+is to recursively traverse the map of configurations to generate each sub-path
+and each key-value pair for the leaf nodes of the recursion tree.
+
+However, note that this means that a document has to be valid yaml/json
+format in order for indexing to happen. The rest of the document is treated
+as mostly text and uses default text settings from Elasticsearch.
+
+What this means is that for the following yaml document:
+
+```yaml
+resources:
+- service.yaml
+- deployment.yaml
+
+configmapGenerator:
+- name: app-configuration
+  files:
+  - config.yaml
+
+patchesJson6902:
+- target:
+    version: v1
+    kind: StatefulSet
+    name: ss-name
+  path: ss-patch.yaml
+- target:
+    version: v1
+    kind: Deployment
+    name: dep-name
+  path: dep-patch.yaml
+```
+
+the following flattened structure would look like:
+```json
+{
+  "identifiers": [
+    "resources",
+    "configmapGenerator",
+    "configmapGenerator:name",
+    "configmapGenerator:files",
+    "patchesJson6902",
+    "patchesJson6902:target",
+    "patchesJson6902:target:version",
+    "patchesJson6902:target:kind",
+    "patchesJson6902:target:name",
+    "patchesJson6902:path",
+  ],
+  "values": [
+    "resources=service.yaml"
+    "resources=deployment.yaml"
+    "configmapGenerator:name=app-configuration"
+    "configmapGenerator:files=config.yaml"
+    "patchesJson6902:target:version=v1",
+    "patchesJson6902:target:kind=StatefulSet",
+    "patchesJson6902:target:name=ss-name",
+    "patchesJson6902:path=ss-patch.yaml",
+    "patchesJson6902:target:kind=Deployment",
+    "patchesJson6902:target:name=dep-name",
+    "patchesJson6902:path=dep-patch.yaml",
+  ],
+  ...
+}
+```
+
+Note that unique paths and values are deduplicated.
+
+On the search side, exact queries will be prioritized, but the document paths
+and key=value pairs will also be analyzed with 3-grams to have some amount of
+fuzzy search. The reason that a Levenshtein-Distance was not used instead, is due
+to searching multiple fields at the same time, which is a use case where
+Elasticsearch does not support proper fuzzy searching.
+
+### Document Search
+Given a text query, each token is considered separately. Each token will be fed
+through a handful of analyzers on the Elasticsearch side, and will be compared
+with the reverse document index of each document fields. It will then determine
+the best matching documents. Text ordering is largely insignificant. This makes
+sense for the structured search, but may leave room for improvement for the
+text only search within the document.
+
+Each token _must_ be matched, so each white space character acts as a
+conjunction of individual queries. There are also ways of telling
+Elasticsearch that some things _should_ match, but I think for now it makes
+more sense to leave it as is.
+
+I think this behavior is sufficient to make the search feel fairly intuitive
+while providing support for fairly complex use cases.
+
+### Metrics Computation
+From the each kustomization document that is indexed, we can find it's
+resources that are publicly available. This includes other kustomizations.
+From this, we can build a directed graph of dependencies and reverse
+dependencies.
+
+This opens up the possibility to add a plethora of graph metrics that can
+give the project maintainers feedback and insight into how people are using
+their tools.
+
+Some of these are useful such as getting an idea for how large the dependency
+graphs actually grow in practice, and can be used to find _popular_
+kustomizations within the corpus. This lends itself to implementing PageRank
+to help bubble up popular results as good search results. I unfortunately
+did not have the time to implement the algorithm, but I do plan to revisit
+this sometime soon to add a few good and efficient implementations of useful
+graph algorithms that would be useful to have. See the Roadmap.md for a more
+complete list of features that could be added and how I think they could be
+implemented.
--- a/internal/tools/ROADMAP.md
+++ b/internal/tools/ROADMAP.md
@@ -0,0 +1,176 @@
+# Road map and comments about this work
+
+From working on this project, here is a collection of thoughts and suggestions
+for future improvements. For any questions about this, or to request help do
+not hesitate to contact @damienr74 on GitHub, my email should be listed.
+
+I think this project has the potential for the K8s community to promote best
+practices. If this becomes popular, It could become easier to find
+*subjectively good* configurations. This can act as a way to guide newcomers
+to k8s config features that are easy to maintain, practical, and tested in some
+real world environment. However, a lot of work remains to be made if this is
+to happen. Extracting and ranking semantic-level information from the open
+source configuration files, is definitely not trivial, and will require a lot of
+though and consideration from the experts and the patterns that successful k8s
+project follow. This, is outside of my scope having little to no experience with
+k8s other than working on this project; however, if you have ideas I can
+probably suggest approaches in order to implement it, having worked a lot on
+this project.
+
+### Improving configuration files and container configs
+I did not have a lot of time to refactor the images to use configmaps for
+everything. This is a good thing to improve, should be fairly easy. Another
+thing that could make the user experience of launcing this could be to make all
+of the go utilities be subcommands to the same binary/container image. This
+would reduce the number of things that would have to be rebuilt, in order to get
+it running, and it would make the application (and its components) more self
+contained. (also has some disadvantages, so I'll let someone else decide.
+
+### Adding graph metrics
+From the Redis graph representation, we are able to run a multitude of graph
+algorithms (not all of which are implemented).
+
+The simplest one would be to run kruskal's algorithm to find connected
+components, and to compute graph metrics on each component. Here are some of the
+metrics that may be useful:
+
+ Average size and histograms of the sizes of each components.
+
+ Average size and histograms of the node with the highest in degree (rdeps) of
+  each component.
+
+ Average size and histograms of the number of repositories in a connected
+  component.
+
+ Any other metric that may be helpful to measure the scale of the kustomize
+  import graph.
+
+Another cool thing that may be helpful, would be to output the graph
+representation of deps/rdeps. This should be fairly easy to do with graphviz/dot
+so if anyone really wants this, I (damienr74) should be able to do it. Feel free
+to send me an email or to @ mention me in an issue.
+
+Note: dfs could also be used to find connected components, but I think union
+find is preferable, since the results can be stored and modified very
+efficiently. The only challenging part would be to implement deleting of edges
+and nodes from a component efficiently, but I know it is possible to support
+these operations with a union find structure.
+
+### Implementing PageRank
+The graph is set up to be able to efficiently compute PageRank since the edge
+weights are real valued, and the graph representation is sparse which means that
+it will fit in the memory of a single machine which will make the processing
+much more efficient.
+
+It could also be implemented as a Redis script, but I feel like there's
+something fundamentally wrong with implementing PageRank in lua. :P
+
+### Implement feature tracking
+Each day, when the crawler finds and indexes these structured documents,
+it should insert aggregate data to a separate index. This data could look like the
+following:
+
+```json
+{
+  "kind": "kustomization",
+  "added_identifiers": [
+    {
+      "identifier": "some:new:k8s:feature",
+      "addedIn": [
+        "docID1",
+        "docID100",
+        "docID45",
+        ...
+      ],
+    }
+    {
+      "identifier": "another:k8s:feature",
+      "documents": [
+        ...
+      ],
+    }
+    ...
+  ]
+
+  "removed_identifiers": [
+    {
+      "identifier": "some:deprecated:field",
+      "documents": [
+        ...
+      ]
+    }
+  ]
+}
+```
+
+This would make it fairly easy to get deep insight into:
+- the speed at which things can effectively be deprecated.
+- how many people are migrating to current best practices.
+- how many documents get updated frequently/rarely.
+- detailed cross sections of growth/regression over conjunctions of features.
+- a world of possibilities.
+
+This is also something that I would be interested to work on sometime soon, so
+feel free to contact me (damienr74) or ask questions about this.
+
+As needed, it could be a good idea to also aggregate past data with a larger
+granularity. for instance each month, the past 30 days can be aggregated into
+weekish durations, And every year these weekly aggregations can be converted
+into monthly summaries depending on how much data this ends up being, and how
+much you want to pay for the storage of this data.
+
+Another cool way to compress this data would be to dynamically compress this
+data into a logarithmic number of buckets with decreasing granularity. But it
+seems like overkill for the amount of data that we'd likely get.
+
+### The UI probably needs a lot of work
+I'm not much of a UI/UX person and have little to no experience in developing
+these types of applications. If anyone with Angular experience wants to dive in
+and completely restructure the app to make the UI/UX/Code health better that
+would be greatly appreciated.
+
+### Query tuning probably still has to be adjusted
+I'm also not an expert in Elasticsearch. From what I could read in the docs,
+I think I've made sane decisions in converting user queries into meaningful
+Elasticsearch queries, but I'm sure there are a lot of improvements that remain
+to be done in order to get more accurate results.
+
+
+### Some other signals that indicate the presence of a good configuration file
+There are lots of heuristics that could be used to achieve this. Here are a
+couple in no particular order:
+
+ Penalize for the number of yaml `---` document splits. I'm not sure what the
+  general consensus is, but I think it's better to separate them, since it
+  makes git commits less noisy, it's a trivial transformation, and it makes
+  config files smaller. However, I can understand the argument that its somewhat
+  practical to keep an overall view of the configurations together (maybe).
+
+ Penalize the number of unique identifiers in a structured document. I think
+  this makes sense, since we don't want to have someone game the search engine
+  to match documents with every possible path from the k8s docs. PageRank might
+  help with this to some extent, but with a small corpus it would be fairly easy
+  to game.
+
+ Assign weights to the usefulness of certain fields. It would be good to
+  promote documents that use `keyRefFromConfigMap`, liveness probes, etc.
+
+These are the main ones I can think of, but I'm sure there are a *ton* of
+ways to achieve this.
+
+If the corpus gets large enough, we might even be able to use *blockchains*,
+*machine learning*, and maybe even self-driving cars.
+
+### Add more support for indexing of other k8s/kustomize related data
+One thing that jumps to mind is the use of kustomize plugins. They are easy
+to track since they all have an unused global variable: `var KustomizePluggin`
+it would be easy to run the pluginator command and generate godocs for each
+go file with this unique identifier.
+
+For the sake of completeness, here is the full GitHub query that we can use to
+find these:
+`api.github.com/search/code?q=var+KustomizePlugin+extension%3A.go&access_token=access_token`
+
+Godoc will not show much, since most packages will be using package main, but
+using pluginator we can make it a properly named package such that Godoc would
+actually generate the relevant documentation.
--- a/internal/tools/pictures/github_token.png
+++ b/internal/tools/pictures/github_token.png
--- a/internal/tools/pictures/sys_arch.png
+++ b/internal/tools/pictures/sys_arch.png
--- a/internal/tools/pictures/token_config.png
+++ b/internal/tools/pictures/token_config.png